# Regression: predicting the price of a car

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

For this project, we'll use the car price dataset by CooperUnion from Kaggle: https://www.kaggle.com/CooperUnion/cardataset

<img src="images/car-prices-dataset.jpg">

In [None]:
# read from 'data/car-price-processes.csv.gz'

In [None]:
# check len

There are 11 914 rows in the data

In [None]:
# take a look at the first 5 rows

It has the following fields:
    
* `make` — make of a car (BMW, Toyota, and so on)
* `model` — model of a car
* `year` — year when the car was manufactured
* `engine_fuel_type` — type of fuel the engine needs (diesel, electric, and so on)
* `engine_hp` — horsepower of the engine
* `engine_cylinders` — number of cylinders in the engine
* `transmission_type` — type of transmission (automatic or manual)
* `driven_wheels` — front, rear, all
* `number_of_doors` — number of doors a car has
* `market_category` — luxury, crossover, and so on
* `vehicle_size` — compact, midsize, or large
* `vehicle_style` — sedan or convertible
* `highway_mpg` — miles per gallon (mpg) on the highway
* `city_mpg` — miles per gallon in the city
* `msrp` — manufacturer’s suggested retail price

## Exploratory data analysis
### Price

Our goal is to predict the price of a car. Let's look at the distribution of prices:

In [None]:
plt.figure(figsize=(10, 6))

# distplot df.msrp

The data has long tail - it's hard to see anything. Let's zoom in a bit

In [None]:
plt.figure(figsize=(10, 6))

# distplot df.msrp < 100000


To remove the effect of the long tail, we need to apply log transformation:

$$y_\text{new} = \log(y + 1)$$

In [None]:
plt.figure(figsize=(10, 6))

# np.log1p(df.msrp)



The tail is gone now. When the distribution looks like that, it's easier for the model to predict the prices

<img src="images/normal_distribution.svg">

### Missing values

Not all machine learning models can deal with missing values. That's why we first need to check if some data is missing

In [None]:
# check nulls

We have 5 columns with missing data:

* `engine_fuel_type`
* `engine_hp`
* `engine_cylinders`
* `number_of_doors`
* `market_category`

We'll address it later when training a model


**Go back to the slides** 

## Training the model

### Splitting the data

We'll split the data into two parts: training and validation


<img src="images/validation_process.svg" />

**Note**: we do it for simplicity. You should split it into three parts: train, validation and test. Check [the book](https://mlbookcamp.com/) for more details

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split
# test_size=0.2

In [None]:
# len(df_train), len(df_val)

We want to predict MSRP. Let's take it out from our dataframes, so we don't accidentally use it. We also need to apply the log transformation to it

In [None]:
y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)

del df_train['msrp']
del df_val['msrp']

### Baseline

First start with all numeric variables. This is how we can select numeric columns in pandas:

To make it easier to use, let's create a list with variable names:

In [None]:
base = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [None]:
df_train[base].head()

We already have $y$ - the target variable we want to predict. Now we need $X$ - the matrix with features. Let's create a function `prepare_X` that extract features from the dataframe and creates this matrix.

We previously checked that we have missing value in our data. For now we'll simple relace these values with zero, but later we'll look at other options.

In [None]:
df_num = df_train[base]
df_num = df_num.fillna(0)
df_num.values


In [None]:
def prepare_X(df):
    # 
    return X

In [None]:
X_train = prepare_X(df_train)

This is how the data looks like after preprocessing:

In [None]:
# let's take a look at X_train

In [None]:
X_train.shape

Now we have both $X$ and $y$, so let's train our model. We'll use `LinearRegression` from `sklearn.linear_model`

In [None]:
from sklearn.linear_model import LinearRegression

To train a model, we use the `fit` method:

In [None]:
# fit LinearRegression() on X_train, y_train

To predict, we use the `predict` method:

In [None]:
# predict for X_train

`y_pred` is an array with predictions for each row of the matrix `X_train`:

In [None]:
# look at y_pred

Let's compare the predictions with the predictions with the actual values:

In [None]:
plt.figure(figsize=(10, 6))

# distplot y_train, label='target'
# distplot y_pred, label='prediction'


plt.legend()
plt.show()

We see that the distributions have quite different shapes, so probably our model is not good. To quantify the predictive performance of our model, we use metrics.

For regression, we often use RMSE: Root Mean Squared Error.

We don't have RMSE in sklearn, but we can use MSE (`mean_squared_error` in `sklearn.metrics`) and take a square root of it 

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
def rmse(y, y_pred):
    mse = mean_squared_error(y, y_pred)
    return np.sqrt(mse)

Now we can use this function to evaluate the quality of our predictions:

In [None]:
# calc rmse between y_train, y_pred

Instead of evaluating the quality of our model on training data, we use validation for that.

We don't yet have the feature matrix for that, so before we can do it, we need to extract $X$ from the validation dataframe 

In [None]:
# get X_val from df_val

Now let's apply the model and calculate RMSE:

In [None]:
# predict from lr to X_val

# check rmse between y_val, y_pred

### Adding more features

The process of extracting features from the dataframe is called "feature engineering". 

Let's engineer our first feature: age, which is the difference between the current year (2017 in the dataset) and the year when the car was manufactured.

We will create a new `prepare_X` function

In [None]:
def prepare_X(df):
    df = df.copy()
    features = base.copy()

    # add age: 2017 - df['year']
    # add age to features

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values

    return X

In this function we modify the dataframe: add a new column there. That's why we first take a copy, so it doesn't affect the original dataframe we pass in

Also, we create a list `features` with the features we'll use in $X$. First, we copy the `base` list, which contains the base numeric features we used previously, and then add `age` there 

Let's use this new function for training the model:

In [None]:
X_train = prepare_X(df_train)
X_train.shape

In [None]:
X_train = prepare_X(df_train)

lr = LinearRegression().fit(X_train, y_train)

y_pred = lr.predict(X_train)
print('train', rmse(y_train, y_pred))

X_val = prepare_X(df_val)
y_pred = lr.predict(X_val)
print('validation', rmse(y_val, y_pred))

RMSE improved from 0.75 to 0.51. Let's look at the distributions:

In [None]:
plt.figure(figsize=(10, 6))

sns.distplot(y_val, label='target', kde=False)
sns.distplot(y_pred, label='prediction', kde=False)

plt.legend()
plt.show()

The distribution of the predictions is now closer to the original one

## Using the model

Now that we have a trained model, we can use it to make predictions.
Let's check how to do it for a car from our test set (we didn't train our model on this data)

In [None]:
i = 0
ad = df_val.iloc[i].to_dict()
ad

To use the model:

* create a dataframe from this dictionary
* convert it to a matrix X (with one row)
* put it into the model
* get the first element from the resutls

In [None]:
# turn ad to data frame
# apply prepare_X
# predict, get the first result
# apply 


We applied logarithmic transformation to our data. Let's undo it to get the actual price recommendation

In [None]:
# undo the exponent (expm1)

We can compare it with the real price:

In [None]:
np.expm1(y_val[i]), suggestion

## Saving and loading the model

In [None]:
import pickle

Saving the model

In [None]:
# save to 'price-model.bin'

Loading the model

In [None]:
# load from 'price-model.bin'

In [None]:
# code summarYy

base = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

def prepare_X(df):
    df = df.copy()
    features = base.copy()

    df['age'] = 2017 - df.year
    features.append('age')

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values

    return X

In [None]:
car = {
    'year': 2008,
    'engine_hp': 148.0,
    'engine_cylinders': 4.0,
    'highway_mpg': 33,
    'city_mpg': 24
}

In [None]:
# predict for car with expm1 and round(2)

## Web service

```
DATA='{
    "year": 2008,
    "engine_hp": 148.0,
    "engine_cylinders": 4.0,
    "highway_mpg": 33,
    "city_mpg": 24
}'


curl -XPOST \
    --data "$DATA" \
    -H "Content-Type: application/json" \
    localhost:9696/predict
```