# NBA Most Valuable Player
## Data Science Final Project

Group Member 1: David Basin
Group Member 2: Mateo Castro
Group Member 3: Abed Islam
***

We will be attempting to determine who will win the NBA Most Valuable Player award using machine learning methods such as SVM/Linear Regression and K-Folds. Our findings will be based on the advanced analystics of past MVP winners, their teams and the advanced analytics of this years players. The datasets to be used are provided.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, linear_model, metrics
%matplotlib inline 

### Part 1: Acquiring and cleaning the data
We will be obtaining and cleaning the data, so that we have only the information we will need for our models.

In [83]:
team_records = pd.read_csv("Standings/Team_Records.csv")
team_records = team_records.loc[team_records["Season"] != "2017-18"]
for i in range(17,22):
    years = str(i)+"-"+str(i+1)
    east_standings = pd.read_csv("Standings/"+years+"_East.csv")
    west_standings = pd.read_csv("Standings/"+years+"_West.csv")
    team_records = pd.concat([team_records,east_standings,west_standings])
team_records["Season"] = team_records["Season"].str.slice(2)
team_records["Team"] = team_records["Team"].str.replace('*','',regex=False)
team_records.set_index(["Season","Team"],inplace=True)
team_dict = {"ATL":"Atlanta Hawks", "BRK":"Brooklyn Nets","BOS":"Boston Celtics","CHO":"Charlotte Hornets",
    "CHI":"Chicago Bulls","CLE":"Cleveland Cavaliers","DAL":"Dallas Mavericks","DEN":"Denver Nuggets",
    "DET":"Detroit Pistons","GSW":"Golden State Warriors","HOU":"Houston Rockets","IND":"Indiana Pacers",
    "LAC":"Los Angeles Clippers","LAL":"Los Angeles Lakers","MEM":"Memphis Grizzlies","MIA":"Miami Heat",
    "MIL":"Milwaukee Bucks","MIN":"Minnesota Timberwolves","NOP":"New Orleans Pelicans","NYK":"New York Knicks",
    "OKC":"Oklahoma City Thunder","ORL":"Orlando Magic","PHI":"Philadelphia 76ers","PHO":"Phoenix Suns",
    "POR":"Portland Trail Blazers","SAC":"Sacramento Kings","SAS":"San Antonio Spurs","TOR":"Toronto Raptors",
    "UTA":"Utah Jazz","WAS":"Washington Wizards","NOJ":"New Orleans Jazz","SEA":"Seattle SuperSonics",
    "WSB":"Washington Bullets","SDC":"San Diego Clippers","KCK":"Kansas City Kings","NJN":"New Jersey Nets",
    "CHH":"Charlotte Hornets","NOH":"New Orleans Hornets","CHA":"Charlotte Bobcats"}
    

for i in range(77,121):
    years = str(i)[-2:]+"-"+str(i+1)[-2:]
    voting = pd.read_csv("Voting/"+years+"_Voting.csv", encoding = "ISO-8859-1", engine="python")
    advanced = pd.read_csv("advanced_stats/"+years+"_advanced.csv", encoding = "ISO-8859-1", engine="python")
    pergame = pd.read_csv("PerGame/PerGame_"+years+".csv", encoding = "ISO-8859-1", engine="python")
    #per game: FG%, TRB, AST, STL, BLK, TOV, PTS
    #advanced: PER, TS%, USG%, WS, OBPM, DBPM, VORP
    #Voting: Tm, Share
    voting = voting.loc[:,["Player","Tm","Share"]]
    advanced = advanced.loc[:,["Player","PER","TS%","USG%","WS","OBPM","DBPM","VORP"]]
    advanced["Player"] = advanced["Player"].str.replace('*','',regex=False)
    pergame = pergame.loc[:,["Player","FG%","TRB","AST","STL","BLK","TOV","PTS"]]
    pergame["Player"] = pergame["Player"].str.replace('*','',regex=False)
    final_df = pd.merge(voting, advanced, how="left", on="Player")
    final_df = pd.merge(final_df, pergame, how="left", on="Player")
    final_df["Year"] = years
    final_df["W/L%"] = 0
    for i, row in final_df.iterrows():
        if row["Tm"] == "TOT": 
            final_df.loc[i,"W/L%"] = .625
        else:
            team = team_dict[row["Tm"]]
            final_df.loc[i,"W/L%"] = team_records.loc[(row["Year"],team)].at["W/L%"]
    share_sum = final_df["Share"].sum()
    final_df["Share"] = final_df["Share"]/final_df["Share"].sum()
advanced_22 = pd.read_csv("advanced_stats/21-22_advanced.csv", encoding = "ISO-8859-1", engine="python")
pergame_22 = pd.read_csv("PerGame/PerGame_21-22.csv", encoding = "ISO-8859-1", engine="python")
advanced_22 = advanced_22.loc[:,["Player","Tm","PER","TS%","USG%","WS","OBPM","DBPM","VORP"]]
advanced_22["Player"] = advanced_22["Player"].str.replace('*','',regex=False)
pergame_22 = pergame_22.loc[:,["Player","FG%","TRB","AST","STL","BLK","TOV","PTS"]]
pergame_22["Player"] = pergame_22["Player"].str.replace('*','',regex=False)
final_df_22 = pd.merge(advanced_22, pergame_22, how="left", on="Player")
for i, row in final_df_22.iterrows():
        if row["Tm"] == "TOT": 
            final_df_22.loc[i,"W/L%"] = .625
        else:
            team = team_dict[row["Tm"]]
            final_df_22.loc[i,"W/L%"] = team_records.loc[("21-22",team)].at["W/L%"]

### Part 2: EDA 
In this part, we will be performing Exploratory Data Analysis on the dataset to provide some inital insights to the data.

In [None]:
per_game_1 = pd.read_csv("Per")

### Part 3: Creating and training the model
Creating the model and training it with the train set.

### Part 4: Validating and testing the model
We will optimize the model using K-Folds validation to find the best parameters to use and testing the resulting model.

### Part 5: Determining the MVP
Finally, the MVP will be determined for this season using the model created.