# IRONHACK mini project


### The challenge

* Perform an end-to-end analysis putting into practice what you have learned so far. You will apply statistical or machine learning techniques and present your results to the class.


### The goal

* We want to act as a consultant for a NBA team who wants to know how important are the Draft picks for a franchise. We will be able to say how important are their picks and how good is going to be a player depending on their pick of the Draft.
* Is it worthy to have one of the top selections?

In [1]:
# Import libraries and dependencies

import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

pd.set_option('display.max_columns', None) # We want to see all the columns

In [2]:
# Import an original database

nba = pd.read_csv("Data/20_Years_of_NBA_Draft_Data/draft-data-20-years.csv")
nba

FileNotFoundError: [Errno 2] No such file or directory: '../Data/20_Years_of_NBA_Draft_Data/draft-data-20-years.csv'

In [None]:
nba.shape

In [None]:
nba.info()

In [None]:
nba.describe(include="all")

#### CONCLUSION:
>After observing the database and the rest of descriptive tables, we see that the column 'Unnamed: 0' is an "index" column and there are 2 couples of columns that have exactly the same values ('Rk' and 'Pk') and ('DraftYr' and 'DraftYear'). Moreover, the draft overall picks are sorted in ascending order whereas the rest of numerical columns are sorted in descending order.

>CLARIFICATION: We are not going to drop the repeated column 'Rk' because we will use it as a groupby column.

>We also observe that there are some rows that they have almost the entire row full of NaN. Those rows are from players who never played NBA even they were selected in the Draft. As our study is going to be for athletes who actually played the game, we decided to drop them all off.

>On the describe table we see that there is at least one name repeated, we are going to check

In [None]:
# Dropping unnecessary and repeated columns

nba = nba.drop(['Unnamed: 0', 'Tm', 'College', 'DraftYr', 'playerurl'], axis =1) # Dropping columns

In [None]:
nba.head(5) # Checking

In [None]:
# Checking the duplicity of the player names

nba.Player.value_counts()

In [None]:
# Dropping full duplicated rows
# If we did only with player names, we could drop players who have the same name but there are not the same person

nba.drop_duplicates(ignore_index=True)

>We see the same number of rows than the original database after dropping so there are not duplicated players

In [None]:
# Checking the NaN

nba.isna().sum()

In [None]:
nba = nba.dropna().reset_index(drop=True) # Dropping them
nba.isna().sum()

In [None]:
# Renaming the headers

nba.rename(columns={'Rk':'groupby', 'Pk':'overall_pick', 'Yrs':'years_played', 'G':'total_games_played', 'TOTMP':'total_minutes_played', 'TOTPTS':'total_points_scored', 'TOTTRB':'total_rebounds', 'TOTAST':'total_assists', 'FG%':'field_goal_percentage', '3P%':'3_point_percentage',
       'FT%':'free_throw_percentage', 'WS':'win_shares', 'WS/48':'win_shares_48_minutes', 'BPM':'box_plus_minus', 'VORP':'value_over_replacement_player', 'MPG':'minutes_per_game', 'PPG':'points_per_game', 'RPG':'rebounds_per_game', 'APG':'assists_per_game',
       'DraftYear':'draft_year'},inplace=True)
nba.head(5)

In [None]:
# We are going to create a new variable as a groupby for all the overall picks with the mean of each other features
# We also are going to drop the 'draft_year' column because it gives us nothing

nba_groupby = nba.groupby(by= ["groupby"], axis= 0).mean().round(2).drop(['draft_year'], axis =1)
nba_groupby

In [None]:
sns.pairplot(nba_groupby)

In [None]:
nba_groupby.corr("spearman").style.background_gradient(cmap='coolwarm')

#### CONCLUSION

* By only a simple sight of both the pairplot and the heatmap, we see a huge correlation between the overall pick and the rest of the features. This correlation is negative as the overall pick column is in descending order. So we can conclude already that there is a close relationship between the draft overall pick and the performance of a player in their career.
* The rest of the features are also correlated because as more games more points scored and so on.
* Only the percentadge variables show no correlation. They tell us that it does not mean that as more games or higher pick better percentadge, it is just a bit higher but not so relevant.

In [None]:
# Making a new variable dropping the huge correlated columns in order to avoid multicorrelation

nba_groupby_new = nba_groupby.drop(['total_games_played', 'total_minutes_played', 'total_rebounds', 'total_assists', 
                                    'field_goal_percentage', '3_point_percentage', 'free_throw_percentage', 
                                    'win_shares_48_minutes', 'minutes_per_game', 'rebounds_per_game', 'assists_per_game'], 
                                   axis =1)

In [None]:
nba_groupby_new.corr("spearman").style.background_gradient(cmap='coolwarm')

### Applying the model

In [None]:
# Split between explanative features and the target variable

features_list = ['years_played', 'total_points_scored', 'win_shares', 'box_plus_minus', 
                 'value_over_replacement_player', 'points_per_game']

X = nba_groupby_new.loc[:, features_list]
y = nba_groupby_new.loc[:, "overall_pick"]

In [None]:
# Division of the dataset into trains and a test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=0)

In [None]:
# Checking if every done value matches

print("Number of rows of X_train = {}".format(len(X_train)))
print("Number of rows of X_test = {}".format(len(X_test)))
print("Number of rows of y_train= {}".format(len(y_train)))
print("Number of rows of y_test = {}".format(len(y_test)))
print("Percentage of train values = {}".format(round(len(X_train)/len(X), 4) * 100),"%")

In [None]:
# Applying the scaler to the train and test set

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
# Creating the model with sklearn

from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # Instanciation of the model
regressor.fit(X_train, y_train)

In [None]:
# The coefficients of the regressor

regressor.coef_

In [None]:
# Verifying overfitting

print(" Score of Train : {}\n Score of Test : {}".format(regressor.score(X_train, y_train), regressor.score(X_test, y_test)))

>The results are close and the model is really good

In [None]:
predictions = regressor.predict(X_train)

mse = mean_squared_error(y_train, predictions, squared=True)
rmse = mean_squared_error(y_train, predictions, squared=False)
mae = mean_absolute_error(y_train, predictions)
print("MSE:", mse)
print("RMSE:", rmse)
print("MAE:", mae)

In [None]:
# Visualization of predictions on train set vs real values for y (groundtruth)

x = predictions
y = y_train
t = np.arange(0, x.size)

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

plt.scatter(t,x) 
plt.scatter(t,y)

plt.title("Predictions vs Groundtruth overall_pick in the train set") 

plt.xlabel("Index of the train set") 
plt.ylabel("overall_pick") 
 
plt.legend(["Predictions train", "Groundtruth"], loc ="upper right")

plt.show()

### Visualization

In [None]:
nba1 = nba[nba["total_games_played"] > 250]

fig = px.scatter_3d(nba1, x = "draft_year", y = "overall_pick", z = "total_points_scored", 
                    opacity = 0.75, hover_data = ["Player"],
                    color = "overall_pick", color_continuous_scale = "haline_r")

print("")
print("")
print("")
print("TOTAL POINTS SCORED BY OVERALL PICK AND YEAR")
print("")
print("")
print("")



fig.update_traces(marker = dict(size = 3.5))
fig.update_layout(template = "plotly_dark", font = dict(family = "PT Sans", size = 12))
fig.show()

* We want to know how important is a player. What happens when he is out of the court?

In [None]:
fig = px.scatter_3d(nba1, x = "draft_year", y = "overall_pick", z = "value_over_replacement_player", 
                    opacity = 0.75, hover_data = ["Player"],
                    color = "overall_pick", color_continuous_scale = "Plotly3_r")

print("")
print("")
print("")
print("VALUE OVER REPLACEMENT PLAYER BY OVERALL PICK AND YEAR")
print("")
print("")
print("")



fig.update_traces(marker = dict(size = 3.5)) # scaling down the markers
fig.update_layout(template = "plotly_dark", font = dict(family = "PT Sans", size = 12))
fig.show()

In [None]:
for y in ["years_played"]:
    
    fig = px.bar(nba_groupby, x = "overall_pick", y = y, title = f"Career longevity based on draft overall pick",
    color = "overall_pick", color_continuous_scale = "Thermal_r",)

    fig.update_layout(template = "plotly_dark", font = dict(family = "PT Sans", size = 18))
    fig.show()

# Conclusion

>Is it worthwhile to have a selection among the first fifteen positions?

>Our research shows that:

* Players selected from top positions on Draft perform better on Points scored and Team impact along their professional career at the NBA

* Career longevity shows a relationship with Draft Pick. In our opinión this is an indirect measure of Player´s Quality as we asume that NBA is a Darwinian environment, highly competitive were long careers are only available for high performing players

### Based on the analysis of this data we conclude that picking players from the top positions on the Draft has a positive impact for the Teams