<h3>In-Depth Analysis</h3>
<p>In this section of our report, we will be looking at the different machine learning algorithms needed for our capstone project. As we have already explored the data quite extensively through our data wrangling and exploratory analysis, it is time to put our results from this to work!</p>

<p>As we are trying to predict out a continuous variable (points scored), we will be focusing in on two different machine learning algorithms:<br>
- Linear Regression<br>
- Random Forest Regressor</p>

<p>I will be splitting this section into the following parts:<br>
1. Final Data Cleaning/Data Wrangling (Removing future variable biases)<br>
2. Fitting our data into the Linear Regression Model<br>
3. Cross Validate our Linear Regression Model and optimize<br>
4. Fitting our data into the Random Forest Regressor Model<br>
5. Cross Validate our Random Forest Regressor and optimize<br>
6. Compare our Linear Regression Model with the Random Forest Regressor<br>
7. Compare our optimal machine learning algorithm with the Y-T-D average<br>
8. Conclusion and Next Steps</p>

In [27]:
#importing libraries
import pandas as pd
import numpy as np
from nba_py.player import PlayerSummary
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

NameError: name 'df_masked' is not defined

<h3>1. Final Data Cleaning/Data Wrangling (Removing future variable biases)</h3>
<p>We've done a lot of data cleaning in our data wrangling section. We've also done a lot of data cleaning to find the data points that are highly correlated with points:<br>
- Categorical Data: Starter or Not, Player Position, Team playing against, Home vs Away, Game Number<br>
- Statistical Data: MIN, FG2M, FG2A, FG3M, FG3A, FTM, FTA</p>
<p>We require two final data cleaning stages before we can assume we can complete our first data analysis:<br>
1. We must create year-to-date average and past three game averages for the statistical categories that we will be using.
2. We must also create our dependent variables and our independent variables to fit into our dataset.
</p>

<p>We must create last game's score, year-to-date averages and past three game averages because currently, each collection in our dataset represents what is happening within that exact game. If we were to use the statistical data in each collection to predict out points, it would be like using future variables to predict out points. This would obviously be a bias and there would be a linear way in doing this as well by summing the FG2M, FG3M and FTA together with respective coefficients.</p>
<p>Of course, we can also use different moving averages (weekly moving average, monthly moving average, etc.). However, we want our machine learning algorithm to work as soon as possible. If we were to use a large window period for our moving average, we may potentially have to wait as we will have NaN values. For simplicity, we will stick with using the last game's score, year-to-date averages and past three games.</p>

In [33]:
seasons = ['2014-15','2015-16','2016-17','2017-18']
df = pd.DataFrame()
for year in seasons:
    temp_df = pd.read_csv("../raw_data/eda_data"+year+".csv")
    df = pd.concat([df, temp_df])
df= df.drop(['FT_PCT'], axis=1)
#NOTE - we have some players who do not have a position assigned.
#We are hanging on the API call to get these positions. For purposes of setting up the code, we'll drop these players.
df = df.dropna(thresh=3)
df = df[(df['CENTER'] == 'C') | (df['FORWARD'] == 'F') | (df['GUARD'] == 'G')]
df = df.replace({'CENTER': 'C', 'FORWARD': 'F', 'GUARD':'G'}, 1)
df = df.fillna(0)
df['HOME_AWAY'] = np.where(df.MATCHUP.str.contains("@"), "AWAY","HOME")
clean_df = df[['TEAM_ABBREVIATION', 'GAME_ID','GAME_NUMBER','HOME_AWAY', 'PLAYER_ID','CENTER', 'GUARD', 'STARTER','MIN', 'FG2M','FG2A','FG3M', 'FG3A', 'FTM', 'FTA','PTS' ]]
clean_df['MIN_LASTGAME'] = clean_df.MIN.shift(1)
clean_df['FG2M_LASTGAME'] = clean_df.FG2M.shift(1)
clean_df['FG2A_LASTGAME'] = clean_df.FG2A.shift(1)
clean_df['FG3M_LASTGAME']= clean_df.FG3M.shift(1)
clean_df['FG3A_LASTGAME']= clean_df.FG3A.shift(1)
clean_df['FTM_LASTGAME']= clean_df.FTM.shift(1)
clean_df['FTA_LASTGAME']= clean_df.FTA.shift(1)
lis = ['MIN','FG2M','FG2A','FG3M', 'FG3A', 'FTM', 'FTA']
lis_modified = [item + '_AVGLAST3GAMES' for item in lis]
dictionary = dict(zip(lis, lis_modified))
lis_YTD= [item + '_YTD' for item in lis]
YTD = dict(zip(lis, lis_YTD))

player_games_grouped = clean_df.set_index(['GAME_ID']).groupby(['PLAYER_ID'])
player_games_threegame = pd.DataFrame(player_games_grouped.rolling(center=False,window=3,win_type='triang')['MIN','FG2M','FG2A','FG3M', 'FG3A', 'FTM', 'FTA'].mean().shift()).rename(index=str, columns=dictionary).reset_index()
players_games_ytd = player_games_grouped['MIN','FG2M','FG2A','FG3M', 'FG3A', 'FTM', 'FTA'].expanding(min_periods=2).mean().rename(index=str, columns=YTD).reset_index()
player_games = clean_df[['PTS','PLAYER_ID','GAME_ID','HOME_AWAY','GAME_NUMBER','CENTER','GUARD', 'STARTER','MIN_LASTGAME', 'FG2M_LASTGAME','FG2A_LASTGAME','FG3M_LASTGAME', 'FG3A_LASTGAME','FTM_LASTGAME','FTA_LASTGAME']]
training_set = pd.merge(player_games_threegame, players_games_ytd, left_on=['PLAYER_ID', 'GAME_ID'], right_on=['PLAYER_ID','GAME_ID'])
training_set['GAME_ID'] = training_set['GAME_ID'].apply(int)
training_set['PLAYER_ID'] = training_set['PLAYER_ID'].apply(int)
second_set = pd.merge(player_games, training_set, left_on=['PLAYER_ID', 'GAME_ID'], right_on=['PLAYER_ID','GAME_ID'])
final_set = second_set.drop(['PLAYER_ID', 'GAME_ID'], axis=1)
evaluate = pd.get_dummies(final_set, drop_first=True)
evaluate = pd.get_dummies(evaluate, columns=['GAME_NUMBER'], drop_first=True)

evaluate = evaluate.dropna()

evaluate.head()

  interactivity=interactivity, compiler=compiler, result=result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See 

Unnamed: 0,PTS,CENTER,GUARD,MIN_LASTGAME,FG2M_LASTGAME,FG2A_LASTGAME,FG3M_LASTGAME,FG3A_LASTGAME,FTM_LASTGAME,FTA_LASTGAME,...,GAME_NUMBER_72,GAME_NUMBER_73,GAME_NUMBER_74,GAME_NUMBER_75,GAME_NUMBER_76,GAME_NUMBER_77,GAME_NUMBER_78,GAME_NUMBER_79,GAME_NUMBER_80,GAME_NUMBER_81
3,2,1.0,0.0,20.0,3.0,8.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,4,1.0,0.0,25.0,1.0,7.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
5,8,1.0,0.0,13.0,2.0,3.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
6,12,1.0,0.0,28.0,3.0,9.0,0.0,0.0,2.0,2.0,...,0,0,0,0,0,0,0,0,0,0
7,4,1.0,0.0,27.0,5.0,8.0,0.0,0.0,2.0,2.0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
df.head()

df['HOME_VS_AWAY'] = ['']

Unnamed: 0.1,Unnamed: 0,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,FORWARD,GUARD,STARTER_GAMES,BENCH_GAMES,TOTAL_GAMES,PERCENT_STARTS,STARTER,SEASON,FG2M,FG2A
0,0,708,Kevin Garnett,1610612751,BKN,Brooklyn Nets,21400006,2014-10-29,BKN @ BOS,L,...,1.0,0.0,47.0,0.0,47.0,1.0,YES,2014-15,5,8
1,1,708,Kevin Garnett,1610612751,BKN,Brooklyn Nets,21400033,2014-11-01,BKN @ DET,W,...,1.0,0.0,47.0,0.0,47.0,1.0,YES,2014-15,7,15
2,2,708,Kevin Garnett,1610612751,BKN,Brooklyn Nets,21400044,2014-11-03,BKN vs. OKC,W,...,1.0,0.0,47.0,0.0,47.0,1.0,YES,2014-15,3,8
3,3,708,Kevin Garnett,1610612751,BKN,Brooklyn Nets,21400060,2014-11-05,BKN vs. MIN,L,...,1.0,0.0,47.0,0.0,47.0,1.0,YES,2014-15,1,7
4,4,708,Kevin Garnett,1610612751,BKN,Brooklyn Nets,21400075,2014-11-07,BKN vs. NYK,W,...,1.0,0.0,47.0,0.0,47.0,1.0,YES,2014-15,2,3


In [34]:
X = evaluate.drop(['PTS'], axis=1)
y = evaluate[['PTS']]
#Splitting into training set and test set. Training set is used to train your model. Test set is to set the accuracy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<h3>2. Fitting our data into the Linear Regression Model</h3>

In [36]:
#Initializing our regressor
regressor = LinearRegression()

#Fitting our training data to the regressor
regressor.fit(X_train, y_train)

#Checking the score
score = regressor.score(X_test, y_test)
print(score)

0.46579983248408874


<h3>3. Cross Validate our Linear Regression Model and optimize</h3>

<h3>4. Fitting our data into the Random Forest Regressor Model</h3>

<h3>5. Cross Validate our Random Forest Regressor and optimize</h3>

<h3>6. Compare our Linear Regression Model with the Random Forest Regressor</h3>

<h3>7. Compare our optimal machine learning algorithm with the Y-T-D average</h3>

<h3>8. Conclusion and Next Steps</h3>