# Linear Regression

**Let's run a Linear Regression in Python on our NBA season average data.**

In [None]:
import pandas as pd
import numpy as np

# This library is used to do simple plots before running our regression
import matplotlib.pyplot as plt

# Two libraries you can do linear regression with.
from sklearn import linear_model
import statsmodels.api as sm

# Used for displaying images from the internet I found to help learn this.
from IPython.display import Image

**Load the pre-made data from the csv file**

In [None]:
nba = pd.read_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")
nba.head()

**Look at all the column names to help hypothesize which two variables might be related.**

In [None]:
nba.columns

**Maybe we can expect that points scored (`PTS`) goes up as minutes played (`MP`) goes up?**

First, make a scatterplot of these two columns to explore the data.

In [None]:
plt.scatter(x=nba['MP'], y=nba['PTS'])
plt.title("Points Scored vs Minutes Played")
plt.xlabel("Minutes Played")
plt.ylabel("Points Scored")
plt.show()

In [None]:
# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(nba['MP']).reshape(-1,1)
# Response is '3PA', or the average # of 3-pointers attempted Per Game
regr_y = nba['PTS']

# Building a linear regression model using scikit's sklearn
regr = linear_model.LinearRegression()

# Calculating the parameters of our regression model using the fit() method
lin_model = regr.fit(X=regr_X, y=regr_y)

# Coefficient of year in our model
print("Coefficient of minutes played in our model: ", lin_model.coef_)

# Intercept Value in our model
print("Intercept in our model: ", lin_model.intercept_)

# Coefficient of Determination Score
print("R^2 Score: ", regr.score(X=regr_X, y=regr_y))

PTS = (0.76) MP - 9.76  

In [None]:
# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'MP']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

t-critical = 1.960
Our model's t-value = 79.546

Since the t-value was > t*, we have statistically-significant evidence to suggest at a 95% confidence level that Minutes Played and Points Scored have a positive, linear relationship 

In [None]:
plt.scatter(x=nba['AST'], y=nba['TOV'])
plt.title("Assists vs Turnovers")
plt.xlabel("Assists")
plt.ylabel("Turnovers")
plt.show()

Ho: Assists & TOV have NO relationship.

Ha: Assists & TV have a positive, linear relationship.

Ho: u=0

Ha: u>0

In [None]:
# Only using 'Year' columnm, so we need to reshape to fit scikit's fit() function
regr_X = np.array(nba['AST']).reshape(-1,1)
# Response is '3PA', or the average # of 3-pointers attempted Per Game
regr_y = nba['TOV']

# Need to add a column of 1s to create a constant term
# statsmodels.api does not do it for us like sklearn does
summary_X = sm.add_constant(regr_X)

# Make into dataframes to make sure variable names are shown in output
summary_X = pd.DataFrame(summary_X).reset_index(drop=True)
summary_X.columns = ['Constant', 'AST']
summary_y = pd.DataFrame(regr_y).reset_index(drop=True)

summary_est = sm.OLS(summary_y, summary_X)

print(summary_est.fit().summary())

TOV = (0.26)(AST) + 1.05

An increase in one assist yields an average of 0.26 more turnovers.

t-critical = 1.960
Our model's t-value = 59.108

Since our t-value was > t*, we have statistically-significant evidence to suggest at a 95% confidence level that Assists and Turnovers have a positive, linear relationship 