# Predicting the NBA's MVPs

In the previous sections we've scraped and cleaned all the data we need to predict NBA MVPs. In this section we'll use this data along with machine learning algorithms to most accurately predict MVP winners.

Let's start with reading in our combined dataset:

In [110]:
import pandas as pd

## Read combined player_mvp_stats file
stats = pd.read_csv("player_mvp_stats.csv")

In [112]:
## Display first 5 rows of stats table
stats.head()

Unnamed: 0.1,Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,0,A.C. Green,PF,35,DAL,50,35,18.5,2.2,5.1,...,0.0,0.0,Dallas Mavericks,19,31,0.38,18.0,91.6,94.0,-2.5
1,1,Bruno Šundov,C,18,DAL,3,0,3.7,0.7,2.3,...,0.0,0.0,Dallas Mavericks,19,31,0.38,18.0,91.6,94.0,-2.5
2,2,Cedric Ceballos,SF,29,DAL,13,5,27.1,4.5,10.8,...,0.0,0.0,Dallas Mavericks,19,31,0.38,18.0,91.6,94.0,-2.5
3,3,Chris Anstey,C,24,DAL,41,4,11.5,1.2,3.4,...,0.0,0.0,Dallas Mavericks,19,31,0.38,18.0,91.6,94.0,-2.5
4,4,Dirk Nowitzki,PF,20,DAL,47,24,20.4,2.9,7.1,...,0.0,0.0,Dallas Mavericks,19,31,0.38,18.0,91.6,94.0,-2.5


In [3]:
## Delete useless row
del stats["Unnamed: 0"]

Let's remove null values in order to work with our machine learning algorithm

In [5]:
## retrieve number of null values in each column
pd.isnull(stats).sum() 

Player        0
Pos           0
Age           0
Tm            0
G             0
GS            0
MP            0
FG            0
FGA           0
FG%          37
3P            0
3PA           0
3P%        1526
2P            0
2PA           0
2P%          70
eFG%         37
FT            0
FTA           0
FT%         389
ORB           0
DRB           0
TRB           0
AST           0
STL           0
BLK           0
TOV           0
PF            0
PTS           0
year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
W             0
L             0
W/L%          0
GB            0
PS/G          0
PA/G          0
SRS           0
dtype: int64

The columns with the null values are percentage columns. This could be because they attempted none of these type of shots as the percentage is determined by dividing the number of shots made by the number of shots attempted. 

Let's try to prove this hypothesis:

In [11]:
## Retrieve the Player and the 3 point shots attempted from the rows that have a null 3P%
stats[pd.isnull(stats["3P%"])][["Player","3PA"]]

Unnamed: 0,Player,3PA
1,Bruno Šundov,0.0
7,Hot Rod Williams,0.0
20,John Salley,0.0
26,Travis Knight,0.0
31,Anthony Mason,0.0
...,...,...
10784,Evan Eschmeyer,0.0
10785,Gheorghe Mureșan,0.0
10787,Jim McIlvaine,0.0
10793,Mark Hendrickson,0.0


In [13]:
## Retrieve the Player and the free throws attempted from the rows that have a null FT%
stats[pd.isnull(stats["FT%"])][["Player","FTA"]]

Unnamed: 0,Player,FTA
1,Bruno Šundov,0.0
40,Jamal Robinson,0.0
44,A.J. Bramlett,0.0
47,Benoit Benjamin,0.0
91,A.J. Guyton,0.0
...,...,...
10631,Jason Hart,0.0
10662,George King,0.0
10742,Luke Zeller,0.0
10777,Malcolm Lee,0.0


In [14]:
## replace all null values with 0
stats = stats.fillna(0) 

Let's take a look at the columns and see which columns we should use as predictors for our algorithm:

In [15]:
## Display all columns in the stats table
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS'],
      dtype='object')

We have to isolate the numeric values as predictors and remove "Pts Won", "Pts Max", and "Share" as they are directly correlated with the mvp votes

In [23]:
## Isolated numeric columns 
predictors = ['Age','G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'year',
        'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS']

In [24]:
## Create train set of years before 2021
train = stats[stats["year"] < 2021]

In [25]:
## Create test set for the year 2021
test = stats[stats["year"] == 2021]

In [26]:
from sklearn.linear_model import Ridge

## Initiate the ridge model
reg = Ridge(alpha=.1) 

In [67]:
## use predictor columns to predict share
reg.fit(train[predictors], train["Share"]) 

Ridge(alpha=0.1)

In [28]:
## Use reg.predict to get predictions
predictions = reg.predict(test[predictors])

In [31]:
## Create data from from predictions
predictions= pd.DataFrame(predictions,columns=["predictions"], index=test.index)

In [32]:
## Display predictions
predictions

Unnamed: 0,predictions
467,0.018302
468,-0.010733
469,0.003589
470,-0.003425
471,0.008344
...,...
10703,-0.008471
10704,-0.014132
10705,0.009494
10706,-0.020755


In [33]:
## Combine player and share columns with the predictions on axis=1(columns)
combination = pd.concat([test[["Player","Share"]],predictions],axis=1)

In [34]:
## Display combined dataframe
combination

Unnamed: 0,Player,Share,predictions
467,Aaron Gordon,0.0,0.018302
468,Austin Rivers,0.0,-0.010733
469,Bol Bol,0.0,0.003589
470,Facundo Campazzo,0.0,-0.003425
471,Greg Whittington,0.0,0.008344
...,...,...,...
10703,Patty Mills,0.0,-0.008471
10704,Quinndary Weatherspoon,0.0,-0.014132
10705,Rudy Gay,0.0,0.009494
10706,Tre Jones,0.0,-0.020755


In [36]:
## Sort share values in descending order
combination.sort_values("Share",ascending=False).head(10)

Unnamed: 0,Player,Share,predictions
478,Nikola Jokić,0.961,0.147722
6845,Joel Embiid,0.58,0.153719
2934,Stephen Curry,0.449,0.14985
7737,Giannis Antetokounmpo,0.345,0.200824
1108,Chris Paul,0.138,0.069609
8324,Luka Dončić,0.042,0.157725
6070,Damian Lillard,0.038,0.123984
2860,Julius Randle,0.02,0.086797
2855,Derrick Rose,0.01,0.027671
8614,Rudy Gobert,0.008,0.092053


In [38]:
from sklearn.metrics import mean_squared_error

mean_squared_error(combination["Share"],combination["predictions"])

0.0026522392093529137

In [39]:
combination["Share"].value_counts()

0.000    525
0.001      3
0.961      1
0.138      1
0.010      1
0.020      1
0.449      1
0.005      1
0.038      1
0.003      1
0.580      1
0.345      1
0.042      1
0.008      1
Name: Share, dtype: int64

In [42]:
combination=combination.sort_values("Share",ascending=False)

## create rank column by mvp voting
combination["RK"]= list(range(1,combination.shape[0]+1))

In [43]:
combination

Unnamed: 0,Player,Share,predictions,RK
478,Nikola Jokić,0.961,0.147722,1
6845,Joel Embiid,0.580,0.153719,2
2934,Stephen Curry,0.449,0.149850,3
7737,Giannis Antetokounmpo,0.345,0.200824,4
1108,Chris Paul,0.138,0.069609,5
...,...,...,...,...
3424,Chris Chiozza,0.000,0.009259,536
3423,Bruce Brown,0.000,-0.000794,537
3422,Blake Griffin,0.000,0.013149,538
3421,Andre Roberson,0.000,-0.026438,539


In [45]:
combination = combination.sort_values("predictions",ascending=False)

## create rank columb by predicted rank
combination["Predicted_RK"] = list(range(1, combination.shape[0]+1))

In [47]:
combination.head(10)

Unnamed: 0,Player,Share,predictions,RK,Predicted_RK
7737,Giannis Antetokounmpo,0.345,0.200824,4,1
8324,Luka Dončić,0.042,0.157725,6,2
6845,Joel Embiid,0.58,0.153719,2,3
3427,James Harden,0.001,0.150258,13,4
2934,Stephen Curry,0.449,0.14985,3,5
478,Nikola Jokić,0.961,0.147722,1,6
3019,LeBron James,0.001,0.147227,15,7
3430,Kevin Durant,0.0,0.143231,531,8
3144,Russell Westbrook,0.005,0.125669,11,9
6070,Damian Lillard,0.038,0.123984,7,10


In [61]:
combination.sort_values("Share",ascending=False).head(10)

Unnamed: 0,Player,Share,predictions,RK,Predicted_RK
478,Nikola Jokić,0.961,0.147722,1,6
6845,Joel Embiid,0.58,0.153719,2,3
2934,Stephen Curry,0.449,0.14985,3,5
7737,Giannis Antetokounmpo,0.345,0.200824,4,1
1108,Chris Paul,0.138,0.069609,5,33
8324,Luka Dončić,0.042,0.157725,6,2
6070,Damian Lillard,0.038,0.123984,7,10
2860,Julius Randle,0.02,0.086797,8,23
2855,Derrick Rose,0.01,0.027671,9,80
8614,Rudy Gobert,0.008,0.092053,10,21


In [59]:
def find_ap(combination):
    actual = combination.sort_values("Share",ascending=False).head(5)
    predicted = combination.sort_values("predictions",ascending=False)
    ps = []
    found = 0
    seen= 1
    for index,row in predicted.iterrows():
        if row["Player"] in actual["Player"].values:
            found+=1
            ps.append(found/seen)
        seen += 1
    return sum(ps) / len(ps) ## return error metric

In [60]:
find_ap(combination)

0.616969696969697

In [65]:
years = list(range(1999,2022))

In [79]:
## Predictions for all years between 1999 and 2022 
aps = [] ## average prediction scores
all_predictions = []
for year in years[5:]:
    train =stats[stats["year"] < year]
    test =stats[stats["year"] == year]
    reg.fit(train[predictors], train["Share"]) ## use predictor columns to predict share    predictions = reg.predict(test[predictors])
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions,columns=["predictions"], index=test.index)
    combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)
    
    all_predictions.append(combination)
    aps.append(find_ap(combination))

In [80]:
sum(aps)/len(aps) ## Mean average predictions

0.6623176551624976

In [85]:
def add_ranks(combination):
    combination=combination.sort_values("Share",ascending=False)
    ## create rank column by mvp voting
    combination["RK"]= list(range(1,combination.shape[0]+1))
    combination = combination.sort_values("predictions",ascending=False)
    combination["Predicted_Rk"] = list(range(1,combination.shape[0]+1))
    combination["Diff"] = combination["RK"] - combination["Predicted_Rk"]
    return combination

In [89]:
ranking = add_ranks(all_predictions[1])
ranking[ranking["RK"]<6].sort_values("Diff",ascending=False)

Unnamed: 0,Player,Share,predictions,RK,Predicted_Rk,Diff
8011,Tim Duncan,0.258,0.170034,4,2,2
3852,Shaquille O'Neal,0.813,0.268261,2,1,1
2616,Dirk Nowitzki,0.275,0.119752,3,7,-4
4067,Steve Nash,0.839,0.048844,1,34,-33


In [92]:
def backtest(stats,model,year,predictors):
    aps = []
    all_predictions = [] 
    for year in years[5:]:
        train =stats[stats["year"] < year]
        test =stats[stats["year"] == year]
        reg.fit(train[predictors], train["Share"]) ## use predictor columns to predict share    predictions = reg.predict(test[predictors])
        predictions = reg.predict(test[predictors])
        predictions = pd.DataFrame(predictions,columns=["predictions"], index=test.index)
        combination = pd.concat([test[["Player", "Share"]], predictions], axis = 1)
        combination=add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
    return sum(aps)/len(aps), aps, pd.concat(all_predictions)
    

In [93]:
mean_ap,aps,all_predictions=backtest(stats,reg,years[5:],predictors)

In [94]:
mean_ap

0.6623176551624976

Let's try adding some more predictors to give the model a little bit more information

In [95]:
stat_ratios = stats[["PTS","AST","STL","BLK","3P","year"]].groupby("year").apply(lambda x: x/x.mean())

In [96]:
stat_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,year
0,0.667203,0.302863,0.888889,0.498584,0.000000,1.0
1,0.177013,0.181718,0.000000,0.000000,0.000000,1.0
2,1.702049,0.545154,0.740741,0.997167,2.265122,1.0
3,0.449341,0.424009,0.592593,0.747875,0.000000,1.0
4,1.116544,0.605727,0.888889,1.495751,0.849421,1.0
...,...,...,...,...,...,...
10810,0.735752,0.819562,0.479763,1.528302,0.650951,1.0
10811,0.071202,0.000000,0.000000,0.000000,0.130190,1.0
10812,1.281633,0.601012,1.119447,2.547170,0.520761,1.0
10813,0.474679,0.218550,0.319842,1.273585,0.650951,1.0


In [98]:
stats[["PTS_T","AST_R","STL_R","BLK_R","3P_R"]] = stat_ratios[["PTS","AST","STL","BLK","3P"]]

In [99]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,W/L%,GB,PS/G,PA/G,SRS,PTS_T,AST_R,STL_R,BLK_R,3P_R
0,A.C. Green,PF,35,DAL,50,35,18.5,2.2,5.1,0.422,...,0.38,18.0,91.6,94.0,-2.5,0.667203,0.302863,0.888889,0.498584,0.0
1,Bruno Šundov,C,18,DAL,3,0,3.7,0.7,2.3,0.286,...,0.38,18.0,91.6,94.0,-2.5,0.177013,0.181718,0.0,0.0,0.0
2,Cedric Ceballos,SF,29,DAL,13,5,27.1,4.5,10.8,0.421,...,0.38,18.0,91.6,94.0,-2.5,1.702049,0.545154,0.740741,0.997167,2.265122
3,Chris Anstey,C,24,DAL,41,4,11.5,1.2,3.4,0.36,...,0.38,18.0,91.6,94.0,-2.5,0.449341,0.424009,0.592593,0.747875,0.0
4,Dirk Nowitzki,PF,20,DAL,47,24,20.4,2.9,7.1,0.405,...,0.38,18.0,91.6,94.0,-2.5,1.116544,0.605727,0.888889,1.495751,0.849421


In [100]:
predictors += ["PTS_T","AST_R","STL_R","BLK_R","3P_R"]

In [101]:
mean_ap,aps,all_predictions=backtest(stats,reg,years[5:],predictors)

In [102]:
mean_ap

0.6781347439120803

In [103]:
## converts positions to numeric values
stats["NPos"] = stats["Pos"].astype("category").cat.codes

In [105]:
## converts Teams to numeric values
stats["NTm"] = stats["Tm"].astype("category").cat.codes

Getting use out of these values is difficult through the use of a linear regression algorithm so let's use a random forest model

In [108]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50,random_state=1,min_samples_split=5)

mean_ap,aps,all_predictions=backtest(stats,rf,years[18:],predictors)

In [109]:
mean_ap

0.6781347439120803

# Summary

In this analysis we took on the project of creating an algorithm to predict NBA MVP winners in any given year from 1999-2021

The steps we took do so are:

- Used BeautifulSoup and Selenium to web scrape data from the nba reference website in an effort to retrieve mvp, player, and teams stats for each year

- Performed data cleaning on the scraped data to get it ready for machine learning

- Split the data in to train and test sets

- created initial machine learning model and did diagnostic to get the best error metric

- Used that to get a backtesting model used to test several years (train and test) to get single error metric

- Did diagnostic to see which columns were the most used by the model and how we could improve performance

- Added more predictors to better prediction metrics

- Finally we used a random forest algorithm for better useability
 
