# Capstone Part 4: Report Writeup + Technical Analysis

### Note:

The code for my Model Selection, Model Tuning, and Prediction Making can be found on the notebooks listed along with this one in this issue. The specific notebooks are named:

- capstone_p3_nb2 (for Mode Selection/Tuning)
- capstone_p3_nb5 (for Predictions/Results)

### Executive Summary

The goal of this project was to predict the NBA performance for the college basketball players entering the 2017 NBA draft. To do this, I first needed to webscrape DraftExpress.com and Sports-Reference.com and parse through the HTML using BeautifulSoup in order to get all of my player data. After cleaning the data and dealing with missing values, I tested a number of different models in order to determine the best one. After choosing the model I tuned it using Gridsearch to find the best parameters, and then used it to make predictions for the draft class.

### The Variables

In the end I used 39 variables for my model. They were: 

- g: games played in college
- mp_per_g: minutes played per game
- fg_per_g: field goals made per game
- fga_per_g: field goals attempted per game
- fg_pct: field goal percentage	
- ft_per_g: free throws made per game
- fta_per_g: free throws attempted per game
- ft_pct: free throw percentage
- trb_per_g: total rebounds per game
- ast_per_g: assists per game	
- stl_per_g: steals per game
- blk_per_g: blocks per game
- tov_per_g: turnovers per game
- pf_per_g: personal fouls per game
- pts_per_g: points per game
- height: player height in inches
- weight: player weight in lbs

I created dummy variables for the most frequently occuring schools in my training set and position categories. They were:

- Forward	
- Guard	
- UNC	
- Kentucky	
- UCLA	
- Duke	
- Arizona	
- Kansas	
- UConn	
- Michigan	
- Georgia Tech	
- Michigan State	
- Alabama	
- Georgetown	
- Syracuse	
- Arkansas	
- Memphis	
- Louisville	
- Indiana	
- LSU	
- Notre Dame	
- Ohio State

I had originally scraped data for 2 point field goals and 3 point field goals and percentages as well. However a lot of rows in my training set were missing those values. Since the values for overall field goal were complete in my training set, I decided to drop the 2 and 3 point stats variables and keep those overall field goal columns.

### The Target

I originally wanted to predict a specific PER number for each college player, but I was unable to build a good model that would do that. So I decided to change my approach so that I would be predicting a PER range instead. This changed the models from regressions to classifications.

I created 3 classes based on PER rating. Players with a PER less than twelve were labeled as "bad". Players between 12 and 17 inclusive were labeled as "average", and players over 17 were labeled as "good".

Of the 1304 players in my training set, 466 were "bad", 709 were "average", and 129 were "good". I know that these classes were imbalanced but I felt that it was appropriate because adjusting the ranges for classes would make me lose the meaning behind the classifications. Also, the distribution makes sense to me because in the real world, the number of good players is definitely a lot lower than the number of average players in the league.

With this class distribution in my training set, the baseline accuracy score I was trying to beat with my models was about 54 percent.

### Model Selection

I tested five different models to see what would work best. I selected my model by comparing their mean cross_val_scores. The results were as follows:

- Random Forest: 0.5497
- K Neighbors: 0.5567
- Linear SVC: 0.5681
- SVC: 0.5766
- Gradient Boost: 0.5789

After looking at the scores, I decided that I would use Gradient Boost as my classifier.

### Model Tuning

I used GridSearchCV to find the best parameters for my Gradientboost model. I ran the Gridsearch for the following parameters:

- max_depth: 3
- max_features: log2
- min_samples_leaf: 2
- min_samples_split: 4
- n_estimators: 200

After finding those parameters I checked the cross_val_score for the tuned model and it was 0.5912.

Then I just wanted to see what would happen if I threw that in a bagging classifier and it gave a cross_val_score of 0.5965. Since it improved, I decided to use that model to make my predictions.

### Results

After getting my model ready, I made the predictions for the college players. The distribution of the predictions was:

In [8]:
import pandas as pd
df=pd.read_csv('/Users/ct/DSI-NYC-4/projects/capstone/part-03/cp_pred_3.csv')
df['Pred'].value_counts()

average    50
bad        25
good        3
Name: Pred, dtype: int64

I then made my own rankings for these players by sorting them by their predicted probably of being in the "good" class. The rankings were:

In [7]:
df[['name','Pred','good']].sort(['good'], ascending=False).head(10)

  """Entry point for launching an IPython kernel.


Unnamed: 0,name,Pred,good
0,Markelle Fultz,good,0.67588
72,Dennis Smith,good,0.543082
2,Jayson Tatum,good,0.460192
71,Josh Jackson,average,0.433689
16,Zach Collins,average,0.342975
3,De'Aaron Fox,average,0.318901
27,Jawun Evans,average,0.311426
1,Lonzo Ball,average,0.299432
63,Dedric Lawson,average,0.259045
60,Derrick White,average,0.243041


### Observations / Insights

- The model agrees with a lot of media outlets' opinion that Markelle Fultz is the best player in this draft
- The model also agrees that Jayson Tatum, Josh Jackson, and De'Aaron Fox are top prospects as well
- Dennis Smith seems to be an underrated player. Current rankings at other places have him at 6 or 7. The model believes that Dennis Smith is the second best player.
- The model isn't as high on Lonzo Ball as a lot of other people are. Most big boards have him at 2. My Model has him at 8
- There could be a few potential sleepers in the draft. Guys like Collins, Lawson, and White are not in the top ten in most big boards, but my rankings include them 

### Recommendations

Here are the recommendations I would make to teams based on my findings:

- Fultz is the obvious number 1 pick. Its safe to go with him if you're lucky enough to draw the 1st pick.
- Teams with top 5 picks should pass on Lonzo Ball. With my rankings showing so many more players ahead of him, he would be ovverated at the 2 spot.
- Whoever picks Dennis Smith at 6,7, or 8 will be getting a steal if he does end up being the 2nd best player to come out of the draft.
- For teams in the late first round or in the second round, I would take a shot at White or Lawson. With later draft positions, those guys would be low risk, high reward options.

### Next Steps

I believe that the next steps for improving my model all involve making my data better. One route I can take, is to try to find a way to expand the scope of my model to include players who did not go to college. For players who played overseas, I would need to figure out a way to normalize their stats with college stats. For player who went straight to the NBA after high school, I would need to find a way to normalize their high school stats as well. Doing these things would help by expanding my data and giving my model more information to train on. However, I feel that the way to expand the data the most is only by waiting. I learned while doing this project that there really isn't that much data to work with regarding this topic. Unfortunately it looks like we would have to wait through the years until enough players have played in the league to provide a large enough dataset to build a very good model on.

When it comes to the variables/features though, there are some things I could do on that front. For my missing values, I can either look for sources to find the missing data or figure out a good way to impute them. I could also try finding other variables to add to the model. One example of a variable I would like to add is the strength of schedule numbers for the players' college careers.