# Week 4- Intro to Modeling + Linear Regression

Welcome back to Week 4 of `Balling with Data`! Today, we'll finally begin **modeling**, the most exciting part of this project! For today's notebook, we won't be going too in-depth into what we'll be doing, rather we'll be showing you some demos and you're going to begin applying those concepts to your own data. Let's get started!

In [3]:
# Standard imports
# If any of these don't work, try doing `pip install _____`, or try looking up the error message.
import numpy as np
import pandas as pd
import json
import time
import os.path
from os import path
import math
import datetime
import unidecode
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
import time

# Table of Contents

# So... what is modeling?

Now, we've reached the most exciting part of the project-- modeling! Now, what this part of the project entails is us being able to **predict different stats based off of the features we've extracted and cleaned!**

However, in order for us to completely, understand how this works, we'll need to make sure we've got a good sense of **machine learning**, and how we can use different **machine learning models** to help us identify different underlying trends in the data that we might not be able to typically see! Let's get into a video that introduces most of what we'll be covering today.

## Video Intro: Machine Learning Basics

In [4]:
from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU')

## Additional Resources

Here's a great article briefly explanining a bunch of the different models we'll be using for the project over the next few weeks!

[**All ML models explained in 6 mins**](https://towardsdatascience.com/all-machine-learning-models-explained-in-6-minutes-9fe30ff6776a)

# Now, let's get into Linear Regression.

As the first fundamental part into Machine Learning, we'll first learn about **Linear Regression**. As I'm sure many of you have learned about this in the past, linear regression can take on a variety of different forms.

You might know it as something of the following:

$$y = ax + b$$

where we have the independent variable `x` multiplied and added by different constants that we've found that minimize the **least-squares error** between the line and the different points. In the scope of this project **`x` would be our different values for features, and `y` would be our prediction for those given values.**

However, what we can clearly see is that the equation is above is a little too simple use in our case. For our project, we're working with multidimensional data, with a variety of different inputs for features. 

**So, how can we translate this into a higher dimension using Linear Algebra? (Hint: you guys might've seen this in different classes before)**

Now that we know how to translate this, let's take a look at the most common/accurate way to find the "best-fitting line" for all the points.

## Ordinary Least Squares

The most common way to find the best fit line for all the points is with **ordinary least squares**. What you might've seen in the past with something like $y = ax + b$ is that we want to calculate what the ideal slope (a) and intercept are (b) in order to find the line that minimizes the **squared error between the points**. Now that we've translated that equation to account for more variables and dimensions, let's take a look at the Linear Algebra way of viewing Linear Regression.

In [5]:
# A little image of least-squares!
from IPython.display import Image
Image(url="https://static-assets.imageservice.cloud/2873957/choosing-the-correct-type-of-regression-analysis-statistics-by-jim.jpg", width=500)

Solving a system of linear equations:

$$Ax = b$$

Ordinary Least Squares to minimize the sum of squared residuals (or the squared error):

$$\hat{\beta}_{OLS} = \underset{\beta}{\operatorname{argmin}} \Vert X\beta - y\Vert_2^2$$

Define the exact loss function and take derivative of function to find the optimal weights (coefficients) for our model:

$$ L(\beta) = \sum_{i=1}^{n}(x_i^T\beta - y_i)^2 = \Vert X\beta - y\Vert_2^2$$

Rewrite this expression and FOIL.

$$ = (X\beta - y)^T(X\beta - y)$$$$ = (X\beta)^T(X\beta) - (X\beta)^Ty - y^TX\beta + y^Ty $$$$ = \beta^TX^TX\beta - 2\beta^TX^Ty + y^Ty$$

Take the derivative of the equation to find the most optimal values for our weights/coefficients.

$$ \nabla_\beta L(\beta) = \nabla_\beta (\beta^TX^TX\beta - 2\beta^TX^Ty + y^Ty)$$$$ = \nabla_\beta(\beta^TX^TX\beta) - 2\nabla_\beta(\beta^TX^Ty) + \nabla_\beta(y^Ty)$$$$ = 2X^TX\beta - 2X^Ty = 0$$

Therefore, 
$$ X^TX\hat{\beta}_{OLS} = X^Ty$$$$ \hat{\beta}_{OLS} = (X^TX)^{-1}X^Ty$$

**Now, we have proven that we can find the analytical solution to the Least Squares problem, and we can find the best weights for our model to minimize the squared error between our predictions and their actual values.**

**Side note:** Take the inverse of $X^TX$ is not always computationally feasible/possible. In these cases, we might need to rely on algorithms like **gradient descent** to find the optimal values for $\beta$.

However, for all intents and purposes of this part of the project, this equation should be computable, and you should be able to easily derive the analytical solution!

## Your Turn!

Try to derive the analytical solution using our data! **Remember, `X` is our feature matrix, `y` are our actual values, and $\hat{\beta}$ are our different weights for our model.** Feel free to ask us any questions with this-- we know we're kind of just laying this on you guys, but we're sure that you guys will be able to pick it up!

In [6]:
### POTENTIALLY USEFUL CLASS, if you end up using .apply for normalizing the dataframe. ###

# Takes advantage of Pandas's apply method
class Denormalize():
    """
    Stores variables to denormalize the different normalized columns later.
    """
    def __init__(self):
        self.means = {}
        self.stds = {}
        
    def add_col(self, col):
        self.means[col.name] = np.mean(col)
        self.stds[col.name] = np.std(col)
    
    def get_means(self):
        return self.means
    
    def get_stds(self):
        return self.stds

**Some potentially useful things to know/resources**:
1. Matrix multiplication can be done with the '@' sign.
2. You may want to look into `np.linalg.inv` if you're stuck.
3. Remember, we need to make sure our features/predictions are normalized! Try to remember what is required in order for us to normalize the columns, and then un-normalize our predictions once we're done with the regression. (Hint: previous notebook might've had some info about it)
4. Look into how Panda's `.apply` method works, and see what's printed out when you call it on a dataframe!
5. Look above at the `Denormalize` class. It can be useful for keeping track of different means and standard deviations, so you know how to convert them back when it comes time to give raw predictions!

Once you've been able to make some headway into creating the model, try and see how you can make it more accurate (plot different features)! See which ones might be worser for our model (ones with more outliers), and see if there's some additional **feature selection** that could help your model become more accurate!

In [7]:
# Loading in data here. The name of your data might be different, replace it below if this cell doesn't work!
data = pd.read_csv("data/player_data_final.csv")
data.head()

Unnamed: 0,name,NCAAB_assists,NCAAB_blocks,NCAAB_field_goal_attempts,NCAAB_field_goal_percentage,NCAAB_field_goals,NCAAB_free_throw_attempt_rate,NCAAB_free_throw_attempts,NCAAB_free_throw_percentage,NCAAB_free_throws,...,NCAAB_two_point_percentage,NCAAB_win_shares,NBA_assists,NBA_blocks,NBA_points,NBA_steals,NBA_total_rebounds,Center,Forward,Guard
0,Landry Fields,89.0,25.0,506.0,0.49,248.0,0.508,257.0,0.696,179.0,...,0.521,6.0,155.0,17.0,797.0,80.0,521.0,0,0,1
1,Andy Rautins,171.0,8.0,297.0,0.438,130.0,0.273,81.0,0.815,66.0,...,0.571,4.9,3.0,0.0,8.0,1.0,1.0,0,0,1
2,Patrick Patterson,36.0,51.0,374.0,0.575,215.0,0.348,130.0,0.692,90.0,...,0.626,7.0,41.0,37.0,328.0,17.0,200.0,0,1,0
3,Gani Lawal,15.0,49.0,325.0,0.529,172.0,0.683,222.0,0.572,127.0,...,0.531,4.1,0.0,0.0,0.0,0.0,0.0,0,1,0
4,Cole Aldrich,31.0,125.0,265.0,0.562,149.0,0.6,159.0,0.679,108.0,...,0.562,5.9,4.0,7.0,18.0,5.0,35.0,1,0,0


In [8]:
### SANDBOX AREA FOR FINDING THE ANALYTICAL SOLUTION TO THE DIFFERENT WEIGHTS

weights = ...

In [1]:
# Now, using the weights, predict the different stats of the training data, and denormalize it so we can see
# the different raw predictions of rookie stat lines!
norm_preds = ...
raw_preds = ...

## Things to Note:

Take a look at our different our different predictions, particularly with `Jordan Loyd`. What is something that you see that isn't right here?

# Using Sklearn

Now, using Sklearn, we can do the same thing we've done up above with just a few lines of code! Start by importing `LinearRegression` and reading up on the documentation for how you can get it working!

With sklearn's `LinearRegression`, we're able to add in a bunch of different parameters into our model to have it more accurately predict different stats!

Take a look here at the documentation for a closer look: [Linear Regression documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

Additionally, here's a few lines to get you started!

In [26]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression(fit_intercept=True) # Fitting against an additional intercept column

In [28]:
# Now, fit the linear model to our data, save the different weights, and have predict the different stats based
# off of the training data!
linear_model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Congrats! Now, you've successfully been able to **implement linear regression from scratch and use sklearn's Linear Regression library for easier predictive use in the future!**

# Saving Our Predictions

In [None]:
# Just for future reference, we're going to save our predictions here so that we can compare them with other models in the future!
raw_preds.to_csv("lin_reg_preds.csv")