# Linear Regression Practice - Plus Data Preprocessing/Preparation

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

## Business Problem

We have used car data scraped from Belarus, in order to explore the used car market.

Our goal is to build a model that effectively predicts the price of the used car based on its parameters (both numerical and categorical).

## Data Understanding

[Original Data Source](https://www.kaggle.com/datasets/lepchenkov/usedcarscatalog)

In [None]:
df = pd.read_csv("data/used_cars.csv")

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
sns.pairplot(df);

In [None]:
sns.heatmap(df.corr(), annot=True);

In [None]:
df.describe(include=[object])

## Data Preparation

### Deciding which columns to use

Looking at the object columns, we can see that there are way too many `manufacturer_name`s and `model_name`s to One Hot Encode those. While those columns may be useful, they would require some extra work to get them in our model - let's just drop those columns for now.

In [None]:
# Drop columns


### Impute Null Values

One thing we haven't done so far this week is deal with many null values! However, if we try to throw our data into a model and it has null values, it will break the model.

In [None]:
# Check out our null values
df.isna().sum()

#### Discuss: How should we deal with these null values?

-  


SKLearn Imputation User Guide: https://scikit-learn.org/stable/modules/impute.html#impute

In [None]:
# Dropping rows where engine capacity is null
# Easiest to do this before a train test split, honestly


In [None]:
# While we could easily run the simple impute on the full dataset (no data leakage),
# This seems like a good time for a train test split!

X = None
y = None

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Can use simple imputer to fill nulls in color!
# Import it here


In [None]:
# Instantiate it with our strategy

# Then fit and transform


X_train_nonull = None
X_test_nonull = None

### Encode Our Categorical Data


In [None]:
# Even though X_train_nonull is a np array, we can explore it a bit
# Looping over each column to explore unique values
for col in range(X_train_nonull.shape[1]):
    col_uniques = np.unique(X_train_nonull[:,col])
    print(len(col_uniques))
    print(col_uniques)

What we can see is that, for our four categorical columns (`transmission`, `color`, `engine_type`, and `body_type`), the maximum number of uniques is 13 - none of these have so many categories that we can't just one-hot encode these!

In [None]:
# Let's use the same process we did yesterday
# I went ahead and provided the imports
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
# Define our categorical columns - by index number, because np array
cat_cols = [0, 1, 4, 6]

# Instantiate our encoder
encoder = OneHotEncoder(handle_unknown='error',
                        drop='first',
                        categories='auto')

# Create an columntransformer object
ct = ColumnTransformer(transformers=[('ohe', encoder, cat_cols)],
                       remainder='passthrough', sparse_threshold=0)

# Now fit and transform!


X_train_enc = None
X_test_enc = None

In [None]:
# We can explore what this looks like now, at a glance
pd.DataFrame(X_train_enc, columns=ct.get_feature_names()).info()
# A note - newer versions of sklearn allow you to get better feature names
# But we'll stick with the version of sklearn that's in IllumiDesk for now
# Also I know these all appear to be objects - it's weird. They really aren't.

### Scale Our Data

In [None]:
# Import the scaler we want to use
# Which should we use? Why?


In [None]:
# Instantiate, fit, then transform


X_train_sc = None
X_test_sc = None

## Modeling

### Model-Less Baseline

In [None]:
# Get the mean of our training y


In [None]:
# Grab predictions
baseline_train_preds = None
baseline_test_preds = None

In [None]:
# Evaluate
print(f"Train R2 Score: {r2_score(y_train, baseline_train_preds):.4f}")
print(f"Train MAE Score: ${mean_absolute_error(y_train, baseline_train_preds):.4f}")
print(f"Train RMSE Score: ${mean_squared_error(y_train, baseline_train_preds, squared=False):.4f}")
print("*"*20)
print(f"Test R2 Score: {r2_score(y_test, baseline_test_preds):.4f}")
print(f"Test MAE Score: ${mean_absolute_error(y_test, baseline_test_preds):.4f}")
print(f"Test RMSE Score: ${mean_squared_error(y_test, baseline_test_preds, squared=False):.4f}")

#### Evaluate: Thoughts?

- 


### Baseline Linear Regression Model

In [None]:
# Import the modeling library we want to use


In [None]:
# Create and fit our model
lr_base = None


In [None]:
# Grab predictions
train_preds = None
test_preds = None

In [None]:
# Evaluate
print(f"Train R2 Score: {r2_score(y_train, train_preds):.4f}")
print(f"Train MAE Score: ${mean_absolute_error(y_train, train_preds):.4f}")
print(f"Train RMSE Score: ${mean_squared_error(y_train, train_preds, squared=False):.4f}")
print("*"*20)
print(f"Test R2 Score: {r2_score(y_test, test_preds):.4f}")
print(f"Test MAE Score: ${mean_absolute_error(y_test, test_preds):.4f}")
print(f"Test RMSE Score: ${mean_squared_error(y_test, test_preds, squared=False):.4f}")

In [None]:
# visualize residuals
plt.scatter(train_preds, y_train-train_preds, label='Train')
plt.scatter(test_preds, y_test-test_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

#### Evaluate: Thoughts?

- 


### Next Model: Log Y

We saw that our y value was pretty right-skewed to start with:

In [None]:
y_train.hist();

### Log Transforming

Linear regression can work better if the predictor and target are normally distributed. 

**Log-transforming** can be a good tool to make *right-skewed* data more normal.

(For *left-skewed* data, which is rarer, we can try transforming our data by raising it to an exponent greater than 1.)

Let's see what transforming it would look like.

Log transformation using numpy's `log1p` - [documentation](https://numpy.org/doc/stable/reference/generated/numpy.log1p.html) (Why `log1p`? Because math - [check out this post](https://stackoverflow.com/a/49538384/14222529). Also FYI inverse would be `expm1`)

In [None]:
np.log1p(y_train).hist();

Much more normal! (although still some extreme outliers it looks like)

Let's log both our train and test y, then see if our model improves.

In [None]:
y_train_log = None
y_test_log = None

In [None]:
# Model


# Grab predictions
train_preds_log = None
test_preds_log = None

In [None]:
# Evaluate
# Note that, for the two in dollar terms (MAE and RMSE), I unlog the predictions
print(f"Train R2 Score: {r2_score(y_train_log, train_preds_log):.4f}")
print(f"Train MAE Score: ${mean_absolute_error(np.expm1(y_train_log), np.expm1(train_preds_log)):.4f}")
print(f"Train RMSE Score: ${mean_squared_error(np.expm1(y_train_log), np.expm1(train_preds_log), squared=False):.4f}")
print("*"*20)
print(f"Test R2 Score: {r2_score(y_test_log, test_preds_log):.4f}")
print(f"Test MAE Score: ${mean_absolute_error(np.expm1(y_test_log), np.expm1(test_preds_log)):.4f}")
print(f"Test RMSE Score: ${mean_squared_error(np.expm1(y_test_log), np.expm1(test_preds_log), squared=False):.4f}")

In [None]:
# visualize residuals
plt.scatter(train_preds_log, y_train_log-train_preds_log, label='Train')
plt.scatter(test_preds_log, y_test_log-test_preds_log, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

#### Evaluate: Thoughts?

- 


### Interpreting after Log Transformations

But with this transformed target, how do I now interpret my LR coefficients?


In [None]:
# Look at coefs for our ohe model
dict(zip(ct.get_feature_names(), lr_base.coef_))

In [None]:
# Now for coefs of our log model
dict(zip(ct.get_feature_names(), lr_log.coef_))

Before the transformation, I would have said that a one-unit increase (and note - units are based on however we scaled our inputs!) in the X column results on average in a `Xcoef` increase in our target. 

But what I need to say now is that a one-unit increase in our X column results on average in a `Xcoef` increase *in the logarithm of of our target*, i.e. an increase in price by a factor of $e^{\text{Xcoef}}$.

More practically, you can interpret the exponent as a percentage! If you take the exponent of the coefficient minus one, that gives you the percentage increase.

Formula:

$e ^ \text{Xcoef} - 1$

In code:
```
(np.exp(Xcoef) - 1) * 100
```



In [None]:
# For example:
log_coef_dict = dict(zip(ct.get_feature_names(), lr_log.coef_))

for feature, coef in log_coef_dict.items():
    print(f"A One-Unit Increase in {feature} results on average in a {(np.exp(coef) - 1) * 100:.4f}% change in price")

Note that our binary columns start to get really weird. In practice, before interpreting variables in any practical sense, we'd run one last model without scaling to allow us to better interpret our results - but we'll still likely keep our logged `y` as our target if it improves our model!

Reference:
- https://stats.oarc.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-terms-of-percent-change-in-linear-regression/

### Next Model

Now what?

In [None]:
# code here to keep iterating!