<a href="https://colab.research.google.com/github/masonnystrom/DS-Unit-2-Linear-Models/blob/master/Unit_2_Sprint_1_Linear_Models_Study_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[SKLearn Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

[SKLearn Train Test Split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[SKLearn Logistic Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[SKLearn Scoring Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [0]:
import pandas as pd
import numpy as np

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Linear Regression

## Basics and Data Preparation

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Linear Regression:** `A model that evaluates the relationship between an independent variable and dependent variable with a linear equation.`

**Polynomial Regression:** `A form of linear regression where the relationship between an independent variable and dependent variable is modeled as an Nth degree polonomial in x. Polynomial regression is useful because it fits a wider range of curvature(non-linear) while still using a linear equation.`

**Overfitting:** `Overfitting is in regards to a model that too accurately predicts or fits the data and therefore would not be useful with a different dataset. The more complex the model, the greater the chance of overfitting`

**Underfitting:** `Underfitting is when a model doesn't accurately predict or represent the data. This can because the model is too simple, informed by too few features, or generally not the correct model for the data.`

**Outlier:** `An outlier is an input/output that skews the results for some reason and should be removed from the data for analysis purposes.`

**Categorical Encoding:** `The process of converting categorical variables to numerical variables so that they can be utilized for modeling`

Use `auto_df` to complete the following.

In [173]:
columns = ['symboling','norm_loss','make','fuel','aspiration','doors',
           'bod_style','drv_wheels','eng_loc','wheel_base','length','width',
           'height','curb_weight','engine','cylinders','engine_size',
           'fuel_system','bore','stroke','compression','hp','peak_rpm',
           'city_mpg','hgwy_mpg','price']
auto_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
auto_df = pd.read_csv(auto_url, names=columns, header=None, na_values="?")
print(auto_df.shape)
auto_df.head()

(205, 26)


Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,length,width,height,curb_weight,engine,cylinders,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [174]:
auto_df_clean = auto_df.dropna(axis=0)
auto_df_clean.shape

(159, 26)

Perform a train test split on `auto_df`, your target feature is `price`

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(auto_df_clean, test_size=0.2, random_state=42)


It's always good to practice EDA, so explore the dataset with both explanatory statistics and visualizations.

In [176]:
train.describe()

Unnamed: 0,symboling,norm_loss,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
count,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0,127.0
mean,0.661417,119.937008,98.494488,172.63622,65.67874,54.055906,2475.283465,120.0,3.297244,3.21937,10.148189,95.07874,5094.094488,26.346457,31.874016,11628.496063
std,1.176627,34.323565,5.322285,11.562753,2.031913,2.247879,498.057957,32.082978,0.263715,0.301319,3.904602,30.112911,447.684261,5.787751,6.281735,6215.869151
min,-2.0,65.0,86.6,141.1,60.3,49.4,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,15.0,18.0,5118.0
25%,0.0,93.0,94.5,166.3,64.0,52.6,2087.5,97.0,3.065,3.075,8.7,69.0,4800.0,23.0,28.0,7429.0
50%,1.0,110.0,97.0,172.4,65.4,54.3,2340.0,110.0,3.27,3.27,9.0,88.0,5200.0,26.0,32.0,9095.0
75%,1.0,150.0,101.2,177.8,66.5,55.5,2834.0,136.0,3.54,3.4,9.4,114.0,5500.0,31.0,37.0,15310.0
max,3.0,197.0,115.6,202.6,71.7,59.8,4066.0,258.0,3.94,4.17,23.0,200.0,6600.0,47.0,53.0,35056.0


In [177]:
train.describe(exclude='number')

Unnamed: 0,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,engine,cylinders,fuel_system
count,127,127,127,127,127,127,127,127,127,127
unique,18,2,2,2,5,3,1,5,5,5
top,toyota,gas,std,four,sedan,fwd,front,ohc,four,2bbl
freq,27,115,107,79,61,81,127,98,105,52


Check for nulls and then write a function to fill in null values. As you can see with `norm_loss`, some of the nulls have a placeholder value of `?` that will need to be addressed.

In [178]:
train.isnull().sum()
# filled in ? with NaN's in the pd.read. 

symboling      0
norm_loss      0
make           0
fuel           0
aspiration     0
doors          0
bod_style      0
drv_wheels     0
eng_loc        0
wheel_base     0
length         0
width          0
height         0
curb_weight    0
engine         0
cylinders      0
engine_size    0
fuel_system    0
bore           0
stroke         0
compression    0
hp             0
peak_rpm       0
city_mpg       0
hgwy_mpg       0
price          0
dtype: int64

In [0]:
# # function that fills NaNs
# def fill_null(DataFrame):
#   for col in columns in DataFrame:
#    col.replace({'?':'NaN'})

How does train test split address underfitting/overfitting?

`Splitting your data into a train and test set ensure that you're not just using one model and fitting your data to that. The Test set provides a separate data that prevents our model from fitting the noise.`

What are three synonyms for the Y Variable?
- `dependent variable`
- `target`
- `Measure variable`

What are three synonyms for the X Variable(s)?
- `independent variable`
- `features`
- `responding variables`

One hot encode a categorical feature

In [181]:
train.columns

Index(['symboling', 'norm_loss', 'make', 'fuel', 'aspiration', 'doors',
       'bod_style', 'drv_wheels', 'eng_loc', 'wheel_base', 'length', 'width',
       'height', 'curb_weight', 'engine', 'cylinders', 'engine_size',
       'fuel_system', 'bore', 'stroke', 'compression', 'hp', 'peak_rpm',
       'city_mpg', 'hgwy_mpg', 'price'],
      dtype='object')

In [0]:
target = 'price'
features = ['fuel', 'bore', 'bod_style', 'hgwy_mpg', 'doors', 'aspiration', 'make', 'city_mpg']
y_train = train[target]
X_train = train[features]
y_test = test[target]
X_test = test[features] 

In [183]:
from sklearn.preprocessing import OneHotEncoder as ce
import category_encoders as ce
encoder = ce.OneHoteEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

AttributeError: ignored

Define the 5 versions of **Baseline**:
* The score you'd get by guessing
* Fast, first models that beat guessing
* Complete, tuned "simpler" model (Simpler mathematically, computationally. Or less work for you, the data scientist.)
* Minimum performance that "matters" to go to production and benefit your employer and the people you serve.
* Human-level performance

What is the purpose of getting a baseline that tells you what you would get with a guess? (Mean or Majority Classifier Baseline)

`The mean or majority classifier serves as the starting prediction. If you were to have an initial guess you'd start with the mean or majority and so beating the mean or majority baseline is how you test the effectiveness of a model.`

Get the mean baseline for the target feature. If you log transformed the target feature, get the mean baseline of the log transformed target feature.

In [184]:
train.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,length,width,height,curb_weight,engine,cylinders,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
105,3,194.0,nissan,gas,turbo,two,hatchback,rwd,front,91.3,170.7,67.9,49.7,3139,ohcv,six,181,mpfi,3.43,3.27,7.8,200.0,5200.0,17,23,19699.0
179,3,197.0,toyota,gas,std,two,hatchback,rwd,front,102.9,183.5,67.7,52.0,3016,dohc,six,171,mpfi,3.27,3.35,9.3,161.0,5200.0,19,24,15998.0
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110.0,5500.0,19,25,17710.0
120,1,154.0,plymouth,gas,std,four,hatchback,fwd,front,93.7,157.3,63.8,50.6,1967,ohc,four,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,6229.0
68,-1,93.0,mercedes-benz,diesel,turbo,four,wagon,rwd,front,110.0,190.9,70.3,58.7,3750,ohc,five,183,idi,3.58,3.64,21.5,123.0,4350.0,22,25,28248.0


In [0]:
from sklearn.metrics import accuracy_score

In [186]:
target = 'price'
y_train = train[target]

# Mean baseline
mean = y_train.mean()
y_pred = [mean] * len(y_train)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_train, y_pred)
mae

4824.5459730919465

In [0]:
import plotly.express as px

## Modeling

What is the 5 step process for using the Scikit-learn's estimator API?
1. `Import the estimator class`
2. `Instatiate this class`
3. `Arrange X features matrix and y target vector`
4. `Fit the model`
5. `Apply the model`

Follow the 5 steps to make a prediction on your test set. The functions and changes you made to `X_train` may need to be applied to `X_test` if you have not done so already.

In [192]:
train.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,length,width,height,curb_weight,engine,cylinders,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
105,3,194.0,nissan,gas,turbo,two,hatchback,rwd,front,91.3,170.7,67.9,49.7,3139,ohcv,six,181,mpfi,3.43,3.27,7.8,200.0,5200.0,17,23,19699.0
179,3,197.0,toyota,gas,std,two,hatchback,rwd,front,102.9,183.5,67.7,52.0,3016,dohc,six,171,mpfi,3.27,3.35,9.3,161.0,5200.0,19,24,15998.0
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110.0,5500.0,19,25,17710.0
120,1,154.0,plymouth,gas,std,four,hatchback,fwd,front,93.7,157.3,63.8,50.6,1967,ohc,four,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,6229.0
68,-1,93.0,mercedes-benz,diesel,turbo,four,wagon,rwd,front,110.0,190.9,70.3,58.7,3750,ohc,five,183,idi,3.58,3.64,21.5,123.0,4350.0,22,25,28248.0


In [209]:
train.dtypes

symboling        int64
norm_loss      float64
make            object
fuel            object
aspiration      object
doors           object
bod_style       object
drv_wheels      object
eng_loc         object
wheel_base     float64
length         float64
width          float64
height         float64
curb_weight      int64
engine          object
cylinders       object
engine_size      int64
fuel_system     object
bore           float64
stroke         float64
compression    float64
hp             float64
peak_rpm       float64
city_mpg         int64
hgwy_mpg         int64
price          float64
dtype: object

In [0]:
# Step 1 - Use Linear Regression
from sklearn.linear_model import LinearRegression

In [0]:
# Step 2 instantiate this class
model = LinearRegression()

In [0]:
# Step 3 arrange x feature matrices
target = 'norm_loss'
features = [ 'city_mpg','hgwy_mpg']
y_train = train[target]
X_train = train[features]
y_test = test[target]
X_test = test[features] 

In [0]:
# One Hot encoder
import category_encoders as ce
encoder = ce.OneHoteEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [0]:
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(strategy='mean')
# X_train_imputed = imputer.fit_transform(X_train)
# X_test_imputed = imputer.transform(X_test)

In [0]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train_imputed)
# X_test_scaled = scaler.transform(X_test_imputed)

In [211]:
# Step 4 fit model
model.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [215]:
# Step 5 apply the model
y_pred = model.predict(X_train)
accuracy_score(y_train, y_pred)

ValueError: ignored

## Scoring

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Mean Absolute Error (MAE):** `mean absolute error (MAE) is a measure of difference between two continuous variables`

**Mean Squared Error (MSE):** `average squared difference between the estimated values and the actual value`

**Root Mean Squared Error (RMSE):** `the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are`

**Coefficient of Determination ($R^2$):** `how close the data are to the fitted regression line.`

**Residual Error:** `difference between the observed value and the true value`

**Bias:** `the difference between the average prediction of our model and the correct value. High bias oversimplifies the model and underfits the data`

**Variance:** `Variance describes how much a model changes when you train it using different portions of your data set. High variance overfits data`

**Validation Curve:** `Your Answer Here`

**Ordinary Least Squares:** `Line plotted on the regression that determines the best fit line`

**Ridge Regression:** `Ridge regression adds a penalty to the model to help reduce variance in a model`

In a short paragraph, explain the Bias-Variance Tradeoff

```
The tradeoff to minimize both sources of error. A reduction in bias usually leads to an increase in variance. 
```

Use each of the regression metrics (MAE, MSE, RMSE, and $R^2$) on both the mean baseline and your predictions.

In [214]:
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_train, y_pred)
mae

20.900596331655247

In [0]:
# MSE

In [0]:
# RMSE

In [0]:
# R^2

Print and plot the coefficients of your model.

In [216]:
# Print the coefficients
model.coef_
model.intercept

AttributeError: ignored

In [0]:
# Plot the coefficients
import plotly as px 
px.scatter(df x =(coeff), y =(intercept))

Interpret your results with a short paragraph. How well did your model perform? How do you read a single prediction? Did you beat the baseline? 

```
Your Answer Here
```

Use Ridge Regression and get the $R^2$ score

How does the ridge regression score compare to your linear regression and baseline scores?

```
Your Answer Here
```

# Logistic Regression

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Logistic Regression:** `Regression model used for binary or categorical variables where we use data to explain the relationship between one dependent variable and multiple independent variables`

**Majority Classifier:** `The most common classifier which can be found using the mode`

**Validation Set:** `A set of the same data which is different from the test, and train. It is used to examine or tune the parametors of a classifier.  `

**Accuracy:** `The percentage of time that your model predicts the correct result.`

**Feature Selection:** `Determining which features to use in a model that will determine the prediction`

Answer each of the following questions with no more than a short paragraph.
<br/><br/>

What is the difference between linear regression and logistic regression?
```
Logistic regression is used when the dependent variable is binary in nature. In contrast, Linear regression is used when the dependent variable is continuous
```

What is the purpose of having a validation set?
```
To validate our train model on a dataset that isn't our test data.
```

Can we use MAE, MSE, RMSE, and $R^2$ to score a Logistic Regression model? Why or why not? If not, how do we score Logistic Regression models?
```
No you cannot because Logisitc uses categorical variables. We score logistic regression using accuracy score for validation.
```

Use the Titanic dataset below to predict whether passengers survived or not. Try to avoid looking at the work you did during the lecture.

Make sure to do the following but feel free to do more:
- Train/Test/Validation Split
- Majority Classifier Baseline
- Include at least 2 features in X (Stretch, try K-Best)
- Use Logistic Regression
- Score your model's accuracy against the Majority Classifier Baseline
 - If you did not beat the baseline, tweak your model until it exceeds the baseline
- Score your model on the validation set

In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

train = pd.read_csv(DATA_PATH+'titanic/train.csv')
test = pd.read_csv(DATA_PATH+'titanic/test.csv')

In [0]:
train.head()