# Exercise: Linear Regression

In [131]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Simple Linear Regression

Use the `data/cars93.csv` dataset.

The x-axis is `Weight`, the independent variable. The y-axis is `MPG.highway`, the dependent variable.

In [132]:
cars_df = pd.read_csv("../data/cars93.csv")
cars_df.head()

Unnamed: 0,Id,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,1,Acura,Integra,Small,12.9,15.9,18.8,25,31,,...,5,177,102,68,37,26.5,11.0,2705,non-USA,Acura Integra
1,2,Acura,Legend,Midsize,29.2,33.9,38.7,18,25,Driver & Passenger,...,5,195,115,71,38,30.0,15.0,3560,non-USA,Acura Legend
2,3,Audi,90,Compact,25.9,29.1,32.3,20,26,Driver only,...,5,180,102,67,37,28.0,14.0,3375,non-USA,Audi 90
3,4,Audi,100,Midsize,30.8,37.7,44.6,19,26,Driver & Passenger,...,6,193,106,70,37,31.0,17.0,3405,non-USA,Audi 100
4,5,BMW,535i,Midsize,23.7,30.0,36.2,22,30,Driver only,...,4,186,109,69,39,27.0,13.0,3640,non-USA,BMW 535i


### Correlations

Check for correlation between the dependent and independent variables. There's a [DataFrame.corr](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) method. It defaults to `pearson`.

The `Weight` and `MPG.highway` has a strong negative correlation: -0.810658.

In [133]:
# Weight, MPG.highway

### LinearRegression.fit(X, y, ...)

`X` is "Weight" with a single feature.  
`y` is "MPG.highway" with the target.

Use the model to print:
- `model.coef_`
- `model.intercept_`
- `model.score(X, y)`, the coefficient of determination

In [134]:
# LinearRegression, fit, coef_, intercept_, score

### Scatter plot

The x-axis is "Weight". The y-axis is "MPG.highway".

In [135]:
# Scatter plot of Weight vs. MPG.highway

### Trendline

There are various trendlines: Numpy, Scipy, and LinearRegression models.

In [136]:
# Use a trendline with a visualization

## Multiple Linear Regression

With numeric and object types, there are various `read_*` functions that replace `na_values` (strings): " ", "#N/A", "#N/A", N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", “1.#IND", “1.#QNAN", "&lt;NA&gt;", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null", with `pd.NA`. `keep_default_na` defaults to `True`.

In [137]:
cars_df.isna().sum()

Id                     0
Manufacturer           0
Model                  0
Type                   0
Min.Price              0
Price                  0
Max.Price              0
MPG.city               0
MPG.highway            0
AirBags               34
DriveTrain             0
Cylinders              0
EngineSize             0
Horsepower             0
RPM                    0
Rev.per.mile           0
Man.trans.avail        0
Fuel.tank.capacity     0
Passengers             0
Length                 0
Wheelbase              0
Width                  0
Turn.circle            0
Rear.seat.room         2
Luggage.room          11
Weight                 0
Origin                 0
Make                   0
dtype: int64

Drop columns "Id", "Manufacturer", "Model", and "Make"

In [138]:
# ...

"AirBags" automatically convert `pd.NA` to "None". Replace `pd.NA` with the string "None" and mark it as a category.

In [139]:
# Airbags pd.NA to "None", category dtype

Use the [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html). Convert "Rear.seat.room" and "Luggage.room" to the mean.

In [140]:
# SimpleImputer: "Rear.seat.room", "Luggage.room"

In [141]:
cars_df["Cylinders"].value_counts()

Cylinders
4         49
6         31
8          7
3          3
5          2
rotary     1
Name: count, dtype: int64

There are a few options with "Cylinders":
1. Replace "rotary" with `np.nan`.
2. Replace "rotary" with a "2" and convert to an integer.
3. Create a dummy.

In [None]:
# ...

Man.trans.avail - these are "Yes"/"No" strings, convert them to a boolean or integer

Play around with our dataset. We don't want to have too many dummy variables. Dummy variables include:
- Type
- AirBags
- DriveTrain
- Origin

In [None]:
# ...

The target/dependent variable should be "MPG.highway".

Features/independent variables have a variety of options.

Use `train_test_split`. The testing set is 25% of our dataset. We can choose `random_state`, though it needs to be a constant value.

In [None]:
# train_test_split

Use `LinearRegression`. Fit the training set. Predict the test set.

In [None]:
# LinearRegression

Generate a scatter plot with actual values versus predicted values.

In [None]:
# Scatter plot prediction vs actual

Generate a scatter plot with residuals.

In [None]:
# Scatter plot with residuals

Calculate the R<sup>2</sup> score based on `X_test` and `y_test`.

In [None]:
# model.score()

### Stretch Goals

- Mix it up with dependent variables and independent variables. Do independent/dependent variables have a perfect, strong, mid, weak, or zero correlation?
- Create a [matplotlib.pyplot.subplot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html) that focuses on scatter plots with dependent versus independent variables.
- Create a 3D plot!