# LASI 2021 Machine Learning Workshop

## Multiple Linear Regression

## Step 1: Identify target and feature variables

**Target Variable**: The variable I am trying to predict.

**Feature Variables**: The variables I will use to make the prediction.

For regression, the target variable is *continuous*.

We will represent the target variable as $y$ and the feature variables as $x_1$, $x_2$, $\ldots$ $x_n$

In a linear regression, we are trying to find a set of coefficients such that:

$$ y \approx \beta_0 + \beta_1x_1 + \ \beta_2x_2 + \ldots \beta_nx_n$$

## Step 2: Plug in target and variables into formula

In [None]:
#  plug in variables for y and x; note formula is just a python string
formula = "y ~ x1 + x2 + ... xn"


## Step 3: Run the regression

In [None]:
#run the regression
model = smf.ols(formula, data=df).fit()

## Step 4: Retrieve the parameters or coefficients

In [None]:
# retrieve the parameters or coefficients
model.params

## Step 5: Obtain a summary report of the regression

In [None]:
# summary of regression
model.summary()

# Example

This particular dataset contains average SAT scores by US state for the years 2005 - 2015. It contains GPA and family income data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [None]:
# load data set
df = pd.read_csv("data/school_scores.csv")

In [None]:
df.head()

In [None]:
# Select a subset of the dataset. We don't want all the columns

df = df[['Year', 'State.Code','Total.Math','Total.Verbal',
         'Academic Subjects.Mathematics.Average GPA','Academic Subjects.English.Average GPA',
         'Academic Subjects.Mathematics.Average Years','Academic Subjects.English.Average Years',
         'Total.Test-takers']]

In [None]:
# shorten the names
df.columns=['Year','State','MathSAT','VerbalSAT',
            'MathGPA','EnglishGPA',
            'MathYrs','EnglishYrs',
            'TotalTested']

## Run First Regression

**First Regression:** At this point, let's go back up and do a regression of ``MathSAT ~ MathGPA``. We will also try ``VerbalSAT ~ EnglishGPA``

In [None]:
plt.plot(df.MathGPA, df.MathSAT, '+')

## Run Second Regression

**Second Regression:** We will create some new columns and run a second regression. This time our target variable will be ``TotalSAT ~ AvgGPA + TotalYrs``. But to run it we need to create some new columns. 

In [None]:
df = df.assign(TotalSAT = lambda x: x['MathSAT'] + x['VerbalSAT'])
df = df.assign(TotalYrs = lambda x: x['MathYrs'] + x['EnglishYrs'])
df = df.assign(AvgGPA = lambda x: (x['MathGPA'] + x['EnglishGPA'])/2)

In [None]:
df.head()

## Run Third Regression

**Third Regression:** For the third regression we will create a new category /column called region. Then we will run a regression for ``TotalSAT ~ AvgGPA + TotalYrs + C(Region``

In [None]:
southeast = ["WV","VA","KY","TN","NC","SC","GA","AL","MI","AR","LA","FL","MS"]
southwest  = ["TX","OK","NM","AZ"]
west =["CO","WY","MT","ID","WA","OR","UT","NV","CA","AK","HI"]
northeast =["MA","NJ","NY","CT","RI","ME","VT","NH","PA","DE","MD"]
midwest = ["OH","IN","MI","MN","IL","MO","WI","IA","KS","NE","SD","ND"]


def create_region(state):
    if state in southeast:
        region = "south"
    elif state in northeast:
        region = "northeast"
    elif state in west:
        region = "west"
    elif state in southwest:
        region = "southwest"
    elif state in midwest:
        region = "midwest"
    else:
        region = "other"
    return region
    

In [None]:
df['Region'] = df['State'].apply(create_region)