In [1]:
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
import re
import time
from random import randint
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import isnan
plt.style.use('fivethirtyeight')
%matplotlib inline

# Capstone Part 3: Progress Report and Preliminary Findings

This progress report will describe the process of working through this capstone project so far, including the successes and setbacks encountered along the way. Then I will run a preliminary model and discuss the initial results, as well as possible future steps.

### Collecting Data

For my project I needed to collect two different sets of data. I needed a list of players entering the 2017 NBA draft as well as their corresponding college career statistics, and I also needed the college statistics for NBA players, as well as their PER for their professional careers.

The plan is to use the NBA player data as the training set for my model/models. The training data will be the college stats of the NBA players, with the target being the NBA players' NBA PERs. After fitting the model, on this training set, I will use the stats of the college players to predict their NBA career PER.

I scraped an initial list of the draft prospects from draftexpress.com. For those players, I scraped basketball-reference.com for their college career statistics. I used the same website to scrape the statistics for a set of NBA players. The set of players includes active and non active US born players who went to college and played in the 3 point era (1979/80 through the present). Also, it includes only players who have played more than 82 games, which is the length of one season.

It took some time to get the code right for my initial scrapes, but I was able to get the information that I needed.

### Cleaning / Mining

I decided to start by cleaning and mining my training data. The cleaning wasn't much of a problem. Since most columns were numbers, I just needed to make sure they were in the correct format. Also, I needed to make dummy variables for the player positions and the most frequently occuring schools. I also needed to insert NaNs into the dataframe where I needed to since there were missing values in the dataset.

### Dealing with missing values

This is where I ran into some trouble. The college player data was pretty much complete so there isn't much to worry about there. The NBA player data, however, contained a lot of missing values. There were a few columns that I felt I could drop entirely, but some I felt were too important to take out.

In [2]:
df=pd.read_csv('/Users/ct/DSI-NYC-4/projects/capstone/part-02/nba_7.csv')

In [3]:
df.isnull().sum()

PER             0
name            0
g               0
mp_per_g        0
fg_per_g        0
fga_per_g       0
fg_pct          0
fg2_per_g     425
fg2a_per_g    448
fg2_pct       448
fg3_per_g     425
fg3a_per_g    448
fg3_pct       490
ft_per_g        0
fta_per_g       0
ft_pct          0
trb_per_g       0
ast_per_g     320
stl_per_g     320
blk_per_g     320
tov_per_g     484
pf_per_g      303
pts_per_g       0
position        0
school          0
height          0
weight          0
dtype: int64

Above is the list of null values for my columns. There are a lot of missing values for stats like assists, blocks, steals, turnovers, and personal fouls. I don't want to drop those columns since I feel like they are important aspects of assessing player performance.

I decided to try imputing missing values by using the NBA players' early professional career stats (first 3 years). I wanted to scrape first 3 year stats for NBA players, but the list generated for this on basketball reference was not identical to the list in the original scrape. It appears that there is no way to get the same list of players with my desired subset of stats.

I scraped the list data I found from basketball reference and then tried to join it to the original NBA player dataframe on the player names. The problem was that some players have identical names. To fix this I manually made all of the names unique and was able to properly join the two dataframes. I then imputed the values from the bottom five columns with missing values using the newly scraped early career stats. This unfortunately did not completely take care of my NaNs. After imputation of those columns, they still had about 100 missing values each.

In order to get to a state where I could run some sort of model, I decided to just drop the columns relating to 2 point and 3 point field goals and then drop rows still containing NaNs in the imputed columns. This dropped the numbers of rows from 1444 to 1304. I lost some data but I had no more NaNs in my data set and could start building a model.

### The Model

In [4]:
df=pd.read_csv('/Users/ct/DSI-NYC-4/projects/capstone/part-03/nonulls_2.csv')
df.head()

Unnamed: 0,PER,name,g,mp_per_g,fg_per_g,fga_per_g,fg_pct,ft_per_g,fta_per_g,ft_pct,...,Alabama,Georgetown,Syracuse,Arkansas,Memphis,Louisville,Indiana,LSU,Notre Dame,Ohio State
0,27.9,Michael Jordan,101,29.599117,7.1,13.2,0.54,3.1,4.2,0.748,...,0,0,0,0,0,0,0,0,0,0
1,26.6,Anthony Davis,40,32.0,5.3,8.4,0.623,3.6,5.1,0.709,...,0,0,0,0,0,0,0,0,0,0
2,26.4,Shaquille O'Neal,90,30.5,8.7,14.3,0.61,4.1,7.1,0.575,...,0,0,0,0,0,0,0,1,0,0
3,26.2,David Robinson,127,29.5,8.1,13.3,0.613,4.8,7.6,0.627,...,0,0,0,0,0,0,0,0,0,0
4,25.7,Chris Paul,63,33.5,4.4,9.3,0.472,4.9,5.8,0.838,...,0,0,0,0,0,0,0,0,0,0


In [5]:
X=df.iloc[:,2:]
X.head()

Unnamed: 0,g,mp_per_g,fg_per_g,fga_per_g,fg_pct,ft_per_g,fta_per_g,ft_pct,trb_per_g,ast_per_g,...,Alabama,Georgetown,Syracuse,Arkansas,Memphis,Louisville,Indiana,LSU,Notre Dame,Ohio State
0,101,29.599117,7.1,13.2,0.54,3.1,4.2,0.748,5.0,1.8,...,0,0,0,0,0,0,0,0,0,0
1,40,32.0,5.3,8.4,0.623,3.6,5.1,0.709,10.4,1.3,...,0,0,0,0,0,0,0,0,0,0
2,90,30.5,8.7,14.3,0.61,4.1,7.1,0.575,13.5,1.7,...,0,0,0,0,0,0,0,1,0,0
3,127,29.5,8.1,13.3,0.613,4.8,7.6,0.627,10.3,0.7,...,0,0,0,0,0,0,0,0,0,0
4,63,33.5,4.4,9.3,0.472,4.9,5.8,0.838,3.9,6.3,...,0,0,0,0,0,0,0,0,0,0


In [6]:
y=df.iloc[:,0]
y.head()

0    27.9
1    26.6
2    26.4
3    26.2
4    25.7
Name: PER, dtype: float64

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score

X_scaler = StandardScaler()
X_scaled = X_scaler.fit_transform(X)

In [8]:
lr=LinearRegression()
lr.fit(X_scaled,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
cross_val_score(lr,X_scaled,y).mean()

-6.127353160762028

In [10]:
ridge=Ridge()
ridge.fit(X_scaled,y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [11]:
cross_val_score(ridge,X,y,cv=5).mean()

-13.052048063925648

As shown above, I tried a couple of different models for my data, both of which produced terrible results. I first tried a simple linear regression, but the cross validation score for the model was -6.1. I then tried out a Ridge regression, and it somehow came out worse with a cross val score of -13. I think that i'm probably going to need help with figuring out how to move forward. I can try to play with using different subsets of features, but I'm not sure how it would help. Also, I have considered possibly turning the target into some sort of categorical value to make it easier for my model to make predictions. 

Before moving forward, I think that I need to figure out exactly why my models are performing so bad and then address those issues if possible.

In [12]:
from sklearn.ensemble import GradientBoostingRegressor

gb=GradientBoostingRegressor()
gb.fit(X_scaled,y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [13]:
cross_val_score(gb,X_scaled,y).mean()

-6.1726514901579463