# Week 4- Intro to Modeling + Linear Regression

Welcome back to Week 4 of `Balling with Data`! Today, we'll finally begin **modeling**, the most exciting part of this project! For today's notebook, we won't be going too in-depth into what we'll be doing, rather we'll be showing you some demos and you're going to begin applying those concepts to your own data. Let's get started!

In [1]:
# Standard imports
# If any of these don't work, try doing `pip install _____`, or try looking up the error message.
import numpy as np
import pandas as pd
import json
import time
import os.path
from os import path
import math
import datetime
import unidecode
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
import time

# Table of Contents

# So... what is modeling?

Now, we've reached the most exciting part of the project-- modeling! Now, what this part of the project entails is us being able to **predict different stats based off of the features we've extracted and cleaned!**

However, in order for us to completely, understand how this works, we'll need to make sure we've got a good sense of **machine learning**, and how we can use different **machine learning models** to help us identify different underlying trends in the data that we might not be able to typically see! Let's get into a video that introduces most of what we'll be covering today.

## Video Intro: Machine Learning Basics

In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU')

## Additional Resources

Here's a great article briefly explanining a bunch of the different models we'll be using for the project over the next few weeks!

[**All ML models explained in 6 mins**](https://towardsdatascience.com/all-machine-learning-models-explained-in-6-minutes-9fe30ff6776a)

# Now, let's get into Linear Regression.

As the first fundamental part into Machine Learning, we'll first learn about **Linear Regression**. As I'm sure many of you have learned about this in the past, linear regression can take on a variety of different forms.

You might know it as something of the following:

$$y = ax + b$$

where we have the independent variable `x` multiplied and added by different constants that we've found that minimize the **least-squares error** between the line and the different points. In the scope of this project **`x` would be our different values for features, and `y` would be our prediction for those given values.**

However, what we can clearly see is that the equation is above is a little too simple use in our case. For our project, we're working with multidimensional data, with a variety of different inputs for features. 

**So, how can we translate this into a higher dimension using Linear Algebra? (Hint: you guys might've seen this in different classes before)**

Now that we know how to translate this, let's take a look at the most common/accurate way to find the "best-fitting line" for all the points.

## Ordinary Least Squares

The most common way to find the best fit line for all the points is with **ordinary least squares**. What you might've seen in the past with something like $y = ax + b$ is that we want to calculate what the ideal slope (a) and intercept are (b) in order to find the line that minimizes the **squared error between the points**. Now that we've translated that equation to account for more variables and dimensions, let's take a look at the Linear Algebra way of viewing Linear Regression.

In [3]:
# A little image of least-squares!
from IPython.display import Image
Image(url="https://static-assets.imageservice.cloud/2873957/choosing-the-correct-type-of-regression-analysis-statistics-by-jim.jpg", width=500)

Solving a system of linear equations:

$$Ax = b$$

Ordinary Least Squares to minimize the sum of squared residuals (or the squared error):

$$\hat{\beta}_{OLS} = \underset{\beta}{\operatorname{argmin}} \Vert X\beta - y\Vert_2^2$$

Define the exact loss function and take derivative of function to find the optimal weights (coefficients) for our model:

$$ L(\beta) = \sum_{i=1}^{n}(x_i^T\beta - y_i)^2 = \Vert X\beta - y\Vert_2^2$$

Rewrite this expression and FOIL.

$$ = (X\beta - y)^T(X\beta - y)$$$$ = (X\beta)^T(X\beta) - (X\beta)^Ty - y^TX\beta + y^Ty $$$$ = \beta^TX^TX\beta - 2\beta^TX^Ty + y^Ty$$

Take the derivative of the equation to find the most optimal values for our weights/coefficients.

$$ \nabla_\beta L(\beta) = \nabla_\beta (\beta^TX^TX\beta - 2\beta^TX^Ty + y^Ty)$$$$ = \nabla_\beta(\beta^TX^TX\beta) - 2\nabla_\beta(\beta^TX^Ty) + \nabla_\beta(y^Ty)$$$$ = 2X^TX\beta - 2X^Ty = 0$$

Therefore, 
$$ X^TX\hat{\beta}_{OLS} = X^Ty$$$$ \hat{\beta}_{OLS} = (X^TX)^{-1}X^Ty$$

**Now, we have proven that we can find the analytical solution to the Least Squares problem, and we can find the best weights for our model to minimize the squared error between our predictions and their actual values.**

**Side note:** Take the inverse of $X^TX$ is not always computationally feasible/possible. In these cases, we might need to rely on algorithms like **gradient descent** to find the optimal values for $\beta$.

However, for all intents and purposes of this part of the project, this equation should be computable, and you should be able to easily derive the analytical solution!

## Your Turn!

Try to derive the analytical solution using our data! **Remember, `X` is our feature matrix, `y` are our actual values, and $\hat{\beta}$ are our different weights for our model.** Feel free to ask us any questions with this-- we know we're kind of just laying this on you guys, but we're sure that you guys will be able to pick it up!

In [4]:
### POTENTIALLY USEFUL CLASS, if you end up using .apply for normalizing the dataframe. ###

# Takes advantage of Pandas's apply method
class Denormalize():
    """
    Stores variables to denormalize the different normalized columns later.
    """
    def __init__(self):
        self.means = {}
        self.stds = {}
        
    def add_col(self, col):
        self.means[col.name] = np.mean(col)
        self.stds[col.name] = np.std(col)
    
    def get_means(self):
        return self.means
    
    def get_stds(self):
        return self.stds

**Some potentially useful things to know/resources**:
1. Matrix multiplication can be done with the '@' sign.
2. You may want to look into `np.linalg.inv` if you're stuck.
3. Remember, we need to make sure our features/predictions are normalized! Try to remember what is required in order for us to normalize the columns, and then un-normalize our predictions once we're done with the regression. (Hint: previous notebook might've had some info about it)
4. Look into how Panda's `.apply` method works, and see what's printed out when you call it on a dataframe!
5. Look above at the `Denormalize` class. It can be useful for keeping track of different means and standard deviations, so you know how to convert them back when it comes time to give raw predictions!

Once you've been able to make some headway into creating the model, try and see how you can make it more accurate (plot different features)! See which ones might be worser for our model (ones with more outliers), and see if there's some additional **feature selection** that could help your model become more accurate!

In [5]:
# Loading in data here. The name of your data might be different, replace it below if this cell doesn't work!
data = pd.read_csv("data/player_data_final.csv")
data.head()

Unnamed: 0,name,NCAAB_assists,NCAAB_blocks,NCAAB_field_goal_attempts,NCAAB_field_goal_percentage,NCAAB_field_goals,NCAAB_free_throw_attempt_rate,NCAAB_free_throw_attempts,NCAAB_free_throw_percentage,NCAAB_free_throws,...,NCAAB_two_point_percentage,NCAAB_win_shares,NBA_assists,NBA_blocks,NBA_points,NBA_steals,NBA_total_rebounds,Center,Forward,Guard
0,Landry Fields,89.0,25.0,506.0,0.49,248.0,0.508,257.0,0.696,179.0,...,0.521,6.0,155.0,17.0,797.0,80.0,521.0,0,0,1
1,Andy Rautins,171.0,8.0,297.0,0.438,130.0,0.273,81.0,0.815,66.0,...,0.571,4.9,3.0,0.0,8.0,1.0,1.0,0,0,1
2,Patrick Patterson,36.0,51.0,374.0,0.575,215.0,0.348,130.0,0.692,90.0,...,0.626,7.0,41.0,37.0,328.0,17.0,200.0,0,1,0
3,Gani Lawal,15.0,49.0,325.0,0.529,172.0,0.683,222.0,0.572,127.0,...,0.531,4.1,0.0,0.0,0.0,0.0,0.0,0,1,0
4,Cole Aldrich,31.0,125.0,265.0,0.562,149.0,0.6,159.0,0.679,108.0,...,0.562,5.9,4.0,7.0,18.0,5.0,35.0,1,0,0


In [6]:
### SANDBOX AREA FOR FINDING THE ANALYTICAL SOLUTION TO THE DIFFERENT WEIGHTS

weights = ...

### Answer

In [7]:
def normalize_and_return(col, obj):
    """
    Calls normalize on the dataframe but also returns the different parameters to un-normalize later.
    """
    
    def normalize(col):
        """
        Normalizes a column to have a mean of 0 and an standard deviation of 1.
        """
        return (col - col.mean()) / col.std()
    
    obj.add_col(col)
    return normalize(col)

In [8]:
# NBA column names
NBA_cols = ['NBA_assists', 'NBA_blocks', 'NBA_points', 'NBA_steals', 'NBA_total_rebounds']

In [9]:
# Setting up X_train
denorm_feats = Denormalize()
orig_X_train = data.drop(columns=NBA_cols).set_index('name')
X_train = orig_X_train.apply(lambda x: normalize_and_return(x, denorm_feats))
X_train.insert(0, "intercept", np.ones(X_train.shape[0])) # Inserting column for intercept (b)
x_means, x_stds = denorm_feats.get_means(), denorm_feats.get_stds()

In [10]:
# Setting up y_train
orig_y_train = data[NBA_cols]
denorm_y = Denormalize()
y_train = orig_y_train.apply(lambda x: normalize_and_return(x, denorm_y))
y_means, y_stds = denorm_y.get_means(), denorm_y.get_stds() # This will be needed later for interpreting our results.

In [13]:
# Finding all the different weights with pseudoinverse
weights = np.linalg.pinv(X_train) @ y_train.values

In [14]:
weights

array([[-1.90819582e-16, -1.04777298e-15, -6.26235175e-16,
        -2.77555756e-17, -7.63278329e-16],
       [ 3.18739737e-01, -5.85552040e-02,  1.80297252e-03,
         5.35250609e-03, -1.19412740e-02],
       [ 1.92474471e-03,  3.11467521e-01, -2.25661157e-03,
        -1.10178825e-02,  2.45891131e-02],
       [-1.30228987e-01,  3.86514655e-02, -1.11543884e-02,
         7.22664435e-02, -8.96268539e-03],
       [ 2.24021237e-01,  2.09863323e-01,  1.84062312e-01,
         1.95396400e-01,  2.42490611e-01],
       [ 3.56391632e-01,  1.80718071e+00,  1.43268738e+00,
        -2.40275969e-01,  1.39933792e+00],
       [-9.65985118e-02,  1.02625009e-01, -2.35999477e-02,
        -4.38882696e-02,  2.95941321e-02],
       [ 6.33291838e-02, -1.69858599e-01, -1.69771895e-01,
         3.73215099e-02, -1.95578917e-02],
       [-8.44026114e-02, -9.40822468e-02, -1.07841832e-01,
        -1.10774558e-01, -1.09047364e-01],
       [ 3.14530052e-01,  8.48193620e-01,  8.28920760e-01,
         9.23953277e-02

In [15]:
# Finding the different weights for the model
weights1 = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train.values

In [18]:
weights

array([[-1.90819582e-16, -1.04777298e-15, -6.26235175e-16,
        -2.77555756e-17, -7.63278329e-16],
       [ 3.18739737e-01, -5.85552040e-02,  1.80297252e-03,
         5.35250609e-03, -1.19412740e-02],
       [ 1.92474471e-03,  3.11467521e-01, -2.25661157e-03,
        -1.10178825e-02,  2.45891131e-02],
       [-1.30228987e-01,  3.86514655e-02, -1.11543884e-02,
         7.22664435e-02, -8.96268539e-03],
       [ 2.24021237e-01,  2.09863323e-01,  1.84062312e-01,
         1.95396400e-01,  2.42490611e-01],
       [ 3.56391632e-01,  1.80718071e+00,  1.43268738e+00,
        -2.40275969e-01,  1.39933792e+00],
       [-9.65985118e-02,  1.02625009e-01, -2.35999477e-02,
        -4.38882696e-02,  2.95941321e-02],
       [ 6.33291838e-02, -1.69858599e-01, -1.69771895e-01,
         3.73215099e-02, -1.95578917e-02],
       [-8.44026114e-02, -9.40822468e-02, -1.07841832e-01,
        -1.10774558e-01, -1.09047364e-01],
       [ 3.14530052e-01,  8.48193620e-01,  8.28920760e-01,
         9.23953277e-02

In [21]:
(weights - weights1) ** 2

Unnamed: 0,0,1,2,3,4
0,1.586112e-30,3.254822e-30,1.041086e-31,1.791596e-31,1.399682e-30
1,0.0113595,0.0002077862,0.01044363,0.006544068,0.002690566
2,0.0001060021,0.001130927,0.0005274519,0.0003132419,0.001556153
3,3.15279,1.243724,3.017618,0.2830427,0.1522279
4,0.003418043,0.05267956,0.003211766,0.0100464,0.03871898
5,10.0793,18.35028,0.5540931,1.486,8.251245
6,0.001606887,2.489117e-05,0.001334189,0.001065703,0.0006490743
7,0.6956785,0.02187844,0.755503,0.4968592,0.4143939
8,0.08130392,3.119561e-05,0.08024963,0.05226676,0.03167803
9,3.982271,1.13257,1.509425,1.551145,0.00838964


In [17]:
np.allclose(weights, weights1)

False

In [13]:
# Calculating different normalized predictions on training data with weights
preds = X_train @ weights.values

In [14]:
# Renaming columns so it makes sense
named_preds = preds.rename(columns=dict(zip(range(5), NBA_cols)))

In [15]:
# Un-normalizing predictions so they make sense
def denorm(col, means, stds):
    """
    Denormalizes the different columns with the different passed in means and stds of cols.
    """
    mu = means[col.name]
    sigma = stds[col.name]
    return col * sigma + mu
    
raw_preds = named_preds.apply(lambda x: denorm(x, y_means, y_stds))

In [None]:
# Now, enable user to make queries to see the different predictions.
# Note: there are negative predictions with this model. We will replace them with 0.
def run():
    while True:
        name = input("Who would you like to search for today? ")
        if name == 'STOP':
            print("\n")
            print("Have a wonderful day :)")
            return
        print("\n")
        print("Calculating...")
        time.sleep(1)
        try:
            
            # Get college stats
            college_stats = orig_X_train.loc[name] 
            per_game = college_stats / college_stats['NCAAB_games_played']
            stats_pg = round(per_game[['NCAAB_assists', 'NCAAB_blocks', 'NCAAB_points', 'NCAAB_total_rebounds', 'NCAAB_steals']], 2)
            
            # Get predicted stats
            stats = round(raw_preds.loc[name] / 82, 2).rename(lambda x: x[4:])
            
            # Get actual NBA stats
            actual_stats = round((data.set_index('name')[NBA_cols] / 82).loc[name], 2)
            
            success = (
"""

{} got {} PPG, {} RPG, {} APG, {} SPG, and {} BPG over {} games in college.

{} is projected to get {} points, {} rebounds, {} assists, {} steals, and {} blocks per game over 82 games.

{} actually got {} points, {} rebounds, {} assists, {} steals, and {} blocks per game over 82 games.

""".format(# For actual NCAAB
           name, stats_pg['NCAAB_points'], stats_pg['NCAAB_total_rebounds'], stats_pg['NCAAB_assists'], \
           stats_pg['NCAAB_steals'], stats_pg['NCAAB_blocks'], college_stats['NCAAB_games_played'], \
           
           # For pred NBA
           name, stats['points'], stats['total_rebounds'], stats['assists'], stats['steals'], stats['blocks'], \
           
           # For actual NBA
           name, actual_stats['NBA_points'], actual_stats['NBA_total_rebounds'], actual_stats['NBA_assists'], \
           actual_stats['NBA_steals'], actual_stats['NBA_blocks']
          ))
            
            print(success)
            
        except KeyError:
            error = (
"""\

Sorry! I can't find {} at the time
Please try correct capitalization/spelling, or try a different, more recent player. 
You can also check out `raw_preds.index` for a list of all available player names to query."
""".format(name))
            print(error)
            
# Running the infinite while-loop
run()

Who would you like to search for today? Kyrie Irving


Calculating...


Kyrie Irving got 17.45 PPG, 3.36 RPG, 4.27 APG, 1.45 SPG, and 0.55 BPG over 11.0 games in college.

Kyrie Irving is projected to get -1.1 points, 0.16 rebounds, -0.57 assists, 0.02 steals, and 0.18 blocks per game over 82 games.

Kyrie Irving actually got 11.51 points, 2.33 rebounds, 3.35 assists, 0.66 steals, and 0.24 blocks per game over 82 games.


Who would you like to search for today? Trae Young


Calculating...


Trae Young got 27.38 PPG, 3.91 RPG, 8.72 APG, 1.69 SPG, and 0.25 BPG over 32.0 games in college.

Trae Young is projected to get 9.0 points, 2.18 rebounds, 3.43 assists, 0.63 steals, and 0.12 blocks per game over 82 games.

Trae Young actually got 18.89 points, 3.67 rebounds, 7.96 assists, 0.88 steals, and 0.18 blocks per game over 82 games.




In [None]:
def divide(col):
    return col / orig_X_train['NCAAB_games_played']

In [145]:
college_pg = orig_X_train.apply(divide)

In [149]:
# Top 5 college players
for col in NBA_cols:
    print(college_pg['NCAAB_' + col[4:]].sort_values(ascending=False)[:5])
    print("\n")

name
Scott Machado       9.909091
Kendall Marshall    9.750000
Trae Young          8.718750
Denzel Valentine    7.774194
Lonzo Ball          7.611111
Name: NCAAB_assists, dtype: float64


name
Hassan Whiteside    5.352941
Jarvis Varnado      4.722222
Nerlens Noel        4.416667
Jeff Withey         3.945946
Khem Birch          3.757576
Name: NCAAB_blocks, dtype: float64


name
Jimmer Fredette    28.864865
Trae Young         27.375000
Doug McDermott     26.685714
Erick Green        25.031250
Buddy Hield        25.000000
Name: NCAAB_points, dtype: float64


name
Briante Weber     3.900000
Jevon Carter      3.027027
Marcus Smart      2.870968
Tahjere McCall    2.814815
Iman Shumpert     2.741935
Name: NCAAB_steals, dtype: float64


name
Kenneth Faried    14.514286
Jemerrio Jones    13.235294
Joel Bolomboy     12.575758
Caleb Swanigan    12.457143
Shawn Long        12.088235
Name: NCAAB_total_rebounds, dtype: float64




In [162]:
(data.set_index('name')[NBA_cols] / 82).loc['Landry Fields']

NBA_assists           1.890244
NBA_blocks            0.207317
NBA_points            9.719512
NBA_steals            0.975610
NBA_total_rebounds    6.353659
Name: Landry Fields, dtype: float64

In [128]:
# Top 5
for col in NBA_cols:
    print(raw_preds[col].sort_values(ascending=False)[:5])
    print("\n")

name
Trae Young                 281.436275
Jimmer Fredette            208.915796
Michael Carter-Williams    200.838465
Jerian Grant               188.788761
John Wall                  185.882391
Name: NBA_assists, dtype: float64


name
Nerlens Noel        53.626995
Jarvis Varnado      49.297839
Hassan Whiteside    48.727139
Sim Bhullar         48.487284
Jeff Withey         44.385135
Name: NBA_blocks, dtype: float64


name
Jimmer Fredette    776.522238
Trae Young         738.041159
Buddy Hield        619.972725
Joe Harris         575.422294
Klay Thompson      573.720562
Name: NBA_points, dtype: float64


name
Michael Carter-Williams    66.645186
Jevon Carter               57.141318
Marcus Smart               55.822011
Joe Harris                 55.322237
Sindarius Thornwell        52.151011
Name: NBA_steals, dtype: float64


name
Zach Collins           269.325247
Nerlens Noel           260.110610
Sim Bhullar            254.901327
Cody Zeller            253.314221
Willie Cauley-Stein    

In [111]:
list(raw_preds.index)

['Landry Fields',
 'Andy Rautins',
 'Patrick Patterson',
 'Gani Lawal',
 'Cole Aldrich',
 'Jeremy Lin',
 'Ekpe Udoh',
 'Dexter Pittman',
 'Derrick Caracter',
 'Devin Ebanks',
 'Lazar Hayward',
 'Dominique Jones',
 'Xavier Henry',
 'Greivis Vásquez',
 'Paul George',
 'Lance Stephenson',
 'Jeremy Evans',
 'Derrick Favors',
 'Gordon Hayward',
 'DeMarcus Cousins',
 'Hassan Whiteside',
 'Solomon Alabi',
 'Ed Davis',
 'Craig Brackins',
 'Evan Turner',
 'Al-Farouq Aminu',
 'Eric Bledsoe',
 'Willie Warren',
 'Trevor Booker',
 'Jordan Crawford',
 'John Wall',
 'Greg Monroe',
 'Avery Bradley',
 'Luke Harangody',
 'Luke Babbitt',
 'Armon Johnson',
 'Manny Harris',
 'Samardo Samuels',
 'Quincy Pondexter',
 'Damion James',
 'Ben Uzoh',
 'Sherron Collins',
 'Larry Sanders',
 'Kenneth Faried',
 'Julyan Stone',
 'Cory Joseph',
 'Kawhi Leonard',
 'Malcolm Thomas',
 'Reggie Jackson',
 'Ryan Reid',
 'Alec Burks',
 'Tobias Harris',
 'Darington Hobson',
 'Jon Leuer',
 'Jimmer Fredette',
 'Tyler Honeycutt',

# Using Sklearn

Now, using Sklearn, we can do the same thing we've done up above with just a few lines of code! Start by importing `LinearRegression` and reading up on the documentation for how you can get it working!

With sklearn's `LinearRegression`, we're able to add in a bunch of different parameters into our model to have it more accurately predict different stats!

Take a look here at the documentation for a closer look: [Linear Regression documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

Additionally, here's a few lines to get you started!

In [26]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression(fit_intercept=True) # Fitting against an additional intercept column

In [28]:
# Now, fit the linear model to our data, save the different weights, and have predict the different stats based
# off of the training data!
linear_model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Congrats! Now, you've successfully been able to **implement linear regression from scratch and use sklearn's Linear Regression library for easier predictive use in the future!**

# Saving Our Predictions

In [None]:
# Just for future reference, we're going to save our predictions here so that we can compare them with other models in the future!
raw_preds.to_csv("lin_reg_preds.csv")

# Silly stuff

In [None]:
for col in raw_preds:
    print(col)
    print(np.argmin(raw_preds[col]))

In [None]:
raw_preds.iloc[468]

In [None]:
orig_X_train.loc['Jordan Loyd']