# Problem Session 3
## More Regression

The problems in this notebook will touch on the content covered in our `Regression` lectures including:
- `Simple Linear Regression`,
- `A First Predictive Modeling Project`,
- `Multiple Linear Regression` and
- `Categorical Variables and Interactions`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

## Baseball is Back

Our first two problems will deal with the baseball data set we have worked with in the lecture notebooks.

##### 1. Run differential

A common statistic in baseball is called the <i>run differential</i>. This stat is the difference between the runs your team scores and the runs they let their opponents score. 

- Load the `baseball.csv` data set,
- Create a `RD` column by computing `R - RA`,
- Make a train test split with $20\%$ in the test set and
- then plot `W` against `RD`. 

Does there appear to be a linear relationship between the two?

In [2]:
## load baseball
baseball = pd.read_csv("../Data/baseball.csv")

In [None]:
## Make RD here




In [None]:
## import train test split here




In [None]:
## make the train test split here




In [None]:
## plot W against RD here
plt.figure(figsize=(8,6))

plt.scatter()

plt.xlabel()
plt.ylabel()

plt.show()

##### 2. Comparing two models.

In the `Multiple Linear Regression` notebook we built this model:

$$
\texttt{W} = \beta_0 + \beta_1 \texttt{R} + \beta_2 \texttt{RD} + \epsilon.
$$

Does having `RD` broken down into its component parts make for "better" predictions?

Use a validation set to see if the above model has a lower validation MSE than this simple linear regression model:

$$
\texttt{W} = \beta_0 + \beta_1 \texttt{RD} + \epsilon.
$$

<i>Note: We are using a validation set here because it will be faster than cross-validation.</i>

In [None]:
## Make a validation set




In [None]:
## Import LinearRegression here


## Import mean_squared_error here

In [None]:
## Make the MLR model here

## make a model object


## fit the model


## get the prediction on the validation set



## print the validation MSE


In [None]:
## Make the SLR model here

## make a model object


## fit the model


## get the prediction on the validation set



## print the validation MSE


In [None]:
## write or code here




## Carseats

Now lets return to building predictive models to predict carseat `Sales`.

##### 3. Load and prepare the data

First load the `carseats.csv` data set and make a train test split of the data, set aside $20\%$ of the data for test set purposes.

In [None]:
carseats = pd.read_csv("../Data/carseats.csv")

In [None]:
## code here




##### 4. More EDA

First load the scatter plots and correlation code chunks below to remind yourself of the EDA you accomplished in `Problem Set 2`.

In [None]:
import matplotlib.pyplot as plt
from seaborn import set_style, pairplot

set_style("whitegrid")

In [None]:
pairplot(car_train[['Sales', 'CompPrice', 'Income', 'Advertising', 'Population']],
            x_vars = ['CompPrice', 'Income', 'Advertising', 'Population'],
            y_vars = ['Sales'],
            height = 3)

plt.show()

print("\n\n\n\n")

pairplot(car_train[['Sales', 'Price', 'Age', 'Education']],
            y_vars = ['Sales'],
            x_vars = ['Price', 'Age', 'Education'],
            height = 3)

plt.show()

In [None]:
print("Correlation with Sales")
print("==================")
car_train[['Sales', 'CompPrice', 'Income', 'Advertising', 
                        'Population', 'Price', 'Age', 'Education']].corr()['Sales'].sort_values()

Now let's do some EDA for the variables we have not yet examined, `ShelveLoc`, `Urban` and `US`.

- Using `seaborn`'s `swarmplot`, <a href="https://seaborn.pydata.org/generated/seaborn.swarmplot.html">https://seaborn.pydata.org/generated/seaborn.swarmplot.html</a> visually explore if there appears to be different `Sales` for different values of `ShelveLoc`, `Urban`, and `US`.
- Using `pandas` `groupby`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html</a>, examine the medians, and means of `Sales` for different values of `ShelveLoc`, `Urban`, and `US`.

In [None]:
## import swarmplot here




In [None]:
## Here is the sample code for the swarmplot of 'ShelveLoc'
plt.figure(figsize=(8,6))

swarmplot(data = car_train,
             y = 'Sales',
             x = 'ShelveLoc')

plt.ylabel("Sales", fontsize=16)
plt.xlabel("ShelveLoc", fontsize=16)
plt.xticks(fontsize=14)
plt.show()

## Make swarmplots for 'Urban' and 'US' here


In [None]:
## do groupbys here

## here is sample code for ShelveLoc
print("ShelveLoc")
print("+++++++++++++++++++++")
print("Mean")
print(car_train.groupby("ShelveLoc").Sales.mean())
print()
print("Median")
print(car_train.groupby("ShelveLoc").Sales.median())
print()
print()


##### 5. Choosing some models

Choose a few different linear regression models based on your explorations and write them down. If your models include categorical variables like `ShelveLoc`, `Urban`, or `US` make the one hot encoded variables below.

<i>If you are struggling to come up with some models here are some potential models you could try, note if you have models that you want to try feel free to ignore these suggestions :):</i>

$$
\text{Sales} = \beta_0 + \beta_1 \text{Price} + \epsilon
$$

$$
\text{Sales} = \beta_0 + \beta_1 \text{Price} + \beta_2 \text{Advertising} + \epsilon
$$

$$
\text{Sales} = \beta_0 + \beta_1 \text{Price} + \beta_2  \text{ShelveLoc_Bad} + \beta_3 \text{ShelveLoc_Good} + \epsilon
$$

$$
\text{Sales} = \beta_0 + \beta_1 \text{Price} + \beta_2 \text{Advertising} + \beta_3  \text{ShelveLoc_Bad} + \beta_4 \text{ShelveLoc_Good} + \epsilon
$$

$$
\text{Sales} = \beta_0 + \beta_1 \text{Price} + \beta_2 \text{Advertising} + \beta_3  \text{ShelveLoc_Bad} + \beta_4 \text{ShelveLoc_Good} + \beta_5 \text{Population} + \epsilon
$$

##### Write your models here



In [None]:
## code here







In [None]:
## code here







In [None]:
## code here







##### 6. Model Comparison

Compare the models you chose using cross-validation to find the one we think will have the lowest MSE. Remember to include a baseline model for comparison purposes.

In [None]:
## import kfold and anything else you need here




In [None]:
## Code here




In [None]:
## Code here




In [None]:
## Code here




In [None]:
## Code here




--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)