# Python Machine Learning: Regression Solutions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
data = pd.read_csv('../data/auto-mpg.csv')

---
### Challenge 1: More EDA

Create the following plots, or examine the following distributions, while exploring your data:

1. A histogram of the displacement.
2. A histogram of the horsepower.
3. A histogram of the weight.
4. A histogram of the acceleration.
5. What are the unique model years, and their counts?
6. What are the unique origin values, and their counts?

---

In [None]:
ax = data['displacement'].hist(grid=False, bins=np.linspace(75, 450, 15))
ax.set_xlabel('Displacement')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
ax = data['horsepower'].hist(grid=False, bins=np.linspace(45, 230, 15))
ax.set_xlabel('Horsepower')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
ax = data['weight'].hist(grid=False)
ax.set_xlabel('Weight')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
ax = data['acceleration'].hist(grid=False)
ax.set_xlabel('Acceleration')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
data['model year'].value_counts().sort_index()

In [None]:
data['origin'].value_counts().sort_index()

---
### Challenge 2: Mean Absolute Error

Another commonly used metric in regression is the **Mean Absolute Error (MAE)**. As the name suggests, this can be calculated by taking the mean of the absolute errors. Calculate the mean absolute error on the training and test data with your trained model. We've imported the MAE for you below:

---

In [None]:
# Remove the response variable and car name
X = data.drop(columns=['car name', 'mpg'])
# Assign response variable to its own variable
y = data['mpg'].astype(np.float64)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_train, y_train_pred))
print(mean_absolute_error(y_test, y_test_pred))

---
### Challenge 3: Feature Engineering

You might notice that the `origin` variable has only three values. So, it's really a categorical variable, where each sample has one of three origins. In this scenario, we've treated it like a continuous variable. 

How can we properly treat this variable as categorical? This is a question of preprocessing and **feature engineering**.

What we can do is replace the `origin` feature with two binary variables. The first tells us whether origin is equal to 2. The second tells us whether origin is equal to 3. If both are false, that means origin is equal to 1.

By fitting a linear regression with these two binary features rather than treating `origin` as continuous, we can get a better sense for how the origin impacts the MPG.

Create two new binary features corresponding to origin, and then recreate the training and test data. Then, fit a linear model to the new data. What do you find about the performance and new coefficients?

---

In [None]:
data['origin_2'] = (data['origin'] == 2).astype('int')
data['origin_3'] = (data['origin'] == 3).astype('int')

In [None]:
# Remove the response variable and car name
X = data.drop(columns=['car name', 'mpg', 'origin'])
# Assign response variable to its own variable
y = data['mpg'].astype(np.float64)

In [None]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate model
print(model.score(X_test, y_test))
print(model.coef_)