A.S. Lundervold, v. 170921

# Introduction

This notebook goes through some core concepts related to **regression** in machine learning, based on concrete examples. 

We'll use two data sets for this, of increasing complexity: _vehicles_ and _housing prices_. This notebook goes through the _vehicles_ example. We'll have a look at the housing data later. 

<img src="assets/cars.jpg">

# Setup

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
if 'google.colab' in str(get_ipython()):
    print('The notebook is running on Colab. colab=True.')
    colab=True
else:
    print('The notebook is not running on Colab. colab=False.')
    colab=False

In [None]:
# To display plots directly in the notebook:
%matplotlib inline

We import our standard framework:

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sklearn

In [None]:
# Set the directory in which to store data
NB_DIR = Path.cwd()     
DATA = NB_DIR/'data'/'vehicles'     

DATA.mkdir(parents=True, exist_ok=True)

# Understand the problem and look at the big picture

## Frame the problem

Our task will be to predict the price of a car from various descriptive features. One can imaging using this to figure out whether an offered price is fair, or, if you're a car dealer, to decide on your sale price. It can also be used to see if there are interesting general trends linking the price of the car to its features. 

## Select performance measures

If we imagine that our model is to be used as part of a more comprehensive pricing system, the broader picture may influence what performance measures we'd like to use. 

In this case we keep things simple: we just want the predicted price to, on average, correspond to the actual sale price. 

For regression problems, two widely used performance measures are the ***Root Mean Squared Error*** (RMSE) and the ***Mean Absolute Error*** (MAE). 

We'll talk more about these later in the notebook. 

# Get the data

We'll use the data provided by Nehal Birla here: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho. Store it in the `DATA` directory to continue. 

In [None]:
import shutil

In [None]:
shutil.unpack_archive(DATA/'archive.zip', extract_dir=DATA)

There are three different data sets. Let's have a quick look at them to decide which one to use:

In [None]:
list(DATA.iterdir())

In [None]:
car_data = pd.read_csv('data/vehicles/car data.csv')
car_details = pd.read_csv('data/vehicles/CAR DETAILS FROM CAR DEKHO.csv')
car_details_v3 = pd.read_csv('data/vehicles/Car details v3.csv')

In [None]:
car_data.head()

In [None]:
car_data.info()

In [None]:
car_details.head()

In [None]:
car_details.info()

In [None]:
car_details_v3.head()

In [None]:
car_details_v3.info()

Let's use the last one as it has the most features and instances. Note that there are some missing values in `mileage`, `engine`, `max_power`, `torque` and `seats` that we'll have to deal with.

In [None]:
df = car_details_v3.copy()

# Explore the data

In [None]:
df.head()

## Feature distributions

Here's a plot of the price distribution in our data:

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.selling_price, kde=True)
plt.show()

We see that there are some very expensive cars in the data set, but only a few.

How about the distribution of model years?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.year, kde=True)
plt.show()

We note that the cars are quite new.

Is there a relationship between the model year and the price? Let's make a new categorical feature to investigate. Based on the above histogram, we say that cars from before 2005 are "old", between 2005 and 2015 are "medium" and 2015-2020 are "new".

In [None]:
df["age_cat"] = pd.cut(df["year"], bins=[1982, 2010, 2015, 2020],
                               labels=['old', 'medium', 'new'])



In [None]:
df.head()

In [None]:
df.age_cat.value_counts()

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='age_cat')
plt.show()

We observe a tendency for newer cars to be more expensive than older ones. 

What about transmission type?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.transmission)
plt.show()

Most are manual transmission. Is there a relationship between the price and the type of transmission?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='transmission')
plt.show()

Seems like the automatic transmission cars are pricier. We can see this also by computing their mean prices:

In [None]:
# We find all the rows corresponding to automatic transmission, 
# extract their selling prices, and compute the mean
df.loc[df.transmission=='Automatic'].selling_price.mean()

In [None]:
df.loc[df.transmission=='Manual'].selling_price.mean()

How about the fuel and the price?

In [None]:
df.fuel.value_counts()

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='fuel')
plt.show()

## Converting the features' data types

There are many other features we could investigate in a similar way. But we're quickly faced with the problem that some of them, like the mileage, are numbers, but not stored as such:

In [None]:
df.head()

In [None]:
df.info()

We should convert some of the features stored as strings (`object`) to integers and floats. Specifically, the mileage, the engine size and the max power.

In [None]:
df.mileage.value_counts()

First we remove the units:

In [None]:
df.max_power.value_counts()

In [None]:
df['mileage'] = df.mileage.str.replace(' kmpl', '')
df['mileage'] = df.mileage.str.replace(' km/kg', '')

Then we convert to floats:

In [None]:
df['mileage'] = df['mileage'].astype(float)

Let's do similarly for the others:

In [None]:
df.engine = df.engine.str.replace(' CC', '').astype(float)

In [None]:
df.max_power = df.max_power.str.replace(' bhp', '')
df.max_power = df.max_power.replace('', np.nan)        # Empty strings replaced by NaNs
df.max_power = df.max_power.astype(float)

In [None]:
df.head()

Rather than dealing with the heterogeneity of the torque feature, we'll simply drop it (feel free to do otherwise on your own!)

In [None]:
df.drop('torque', axis=1, inplace=True)

We'll also drop the name of the car. This is to simplify things. A better idea would be to use it to extract information about the make and model of the car. 

In [None]:
df.drop('name', axis=1, inplace=True)

This is now our data set:

In [None]:
df.head()

## Feature encoding

We need to represent the categorical features `fuel`, `transmission`, `seller_type` and `owner` as numbers for our machine learning models to work. 

How to best do such feature encoding is a relatively large topic. The short version is that we can either do a *on hot encoding* if the feature values are not related to each other in some *ordinal* way (i.e., there's no reason to treat one as "larger" than the other), otherwise use an ordinal encoder. 

In our case, `fuel` and `transmission` are not ordinal features, while `owner` is (as it is the number of owners). 

We can use Pandas to do the one hot encoding:

In [None]:
one_hot = pd.get_dummies(df['fuel'])
df = df.join(one_hot)

In [None]:
one_hot = pd.get_dummies(df['transmission'])
df = df.join(one_hot)

In [None]:
one_hot = pd.get_dummies(df['seller_type'])
df = df.join(one_hot)

We get the following data frame:

In [None]:
df.head()

Now that we've stored the fuel and transmission information in one hot encoded vectors we can drop the original features:

In [None]:
df.drop(['fuel', 'transmission', 'seller_type'], axis=1, inplace=True)

In [None]:
df.head()

For `owner` we'd like to keep the ordinal relationship:

In [None]:
df.owner.value_counts()

In [None]:
df.replace('Test Drive Car', 0, inplace=True)
df.replace('First Owner', 1, inplace=True)
df.replace('Second Owner', 2, inplace=True)
df.replace('Third Owner', 3, inplace=True)
df.replace('Fourth & Above Owner', 4, inplace=True)

In [None]:
df.head()

## Setting up our $f: X \to y$

We'll store the features in 'X' and the labels in 'y'. Our goal is to approximate the function mapping `X` to `y`, where `y` is the `selling_price`:

<img src="assets/f_xy.png">

In [None]:
X = df.drop('selling_price', axis=1)
y = df['selling_price']

In [None]:
X.head()

In [None]:
y.head()

# Create training and test sets

> To stress a point repeated multiple times already: We're not interested in how well our models perform on the training set, what we're really after is how well they generalize to unseen data. 

The test set is meant to simulate unseen data (and should therefore not be touched when constructing and tuning our models). 

<img width=50% src="assets/testsplit.png"> 

It is important to make sure that the test set is a representative sample of the data. In our case, we want to make sure that it contains cars of all kinds of prices. 

We should base our decision of how to split the data on the explorations we've done above, and also on what the model is supposed to be used for (as that influences the kind of generalization estimate we want). Perhaps it is important to use the age of the car as part of the decision? Or the number of seats it has (perhaps we find it important that the test set contains at least some two-seaters)? And so on. 

In our case, we'll make sure that the test set contains at least some expensive cars by performing a _stratified split_ on our new categorical feature representing the cars expensiveness. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=X.age_cat, random_state=42)

We now have 6096 instances for training, 2032 for testing

In [None]:
len(X_train), len(X_test)

Their car age distributions are similar:

In [None]:
plt.hist(X_train.age_cat, alpha=0.5, label='train')
plt.hist(X_test.age_cat, alpha=0.5, label='test')
plt.legend(loc='upper right')
plt.show()

After making the split we can drop `age_cat` feature:

In [None]:
X_train.drop('age_cat', axis=1, inplace=True)
X_test.drop('age_cat', axis=1, inplace=True)

# Data preprocessing: Data cleaning, feature scaling and imputing missing data

Before we can use the data to train machine learning models we need to make sure it is "clean", the features are scaled, and think about how to deal with missing data.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [None]:
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)

In [None]:
imp = SimpleImputer()

In [None]:
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)

# Training a regression model

As for classification, we have a lot of choices when building our model. For now, we'll use some of the standard built-in models in scikit-learn. 

In [None]:
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor

We'll start by trying out a random forest model:

In [None]:
rf_reg = RandomForestRegressor(random_state=42)

In [None]:
rf_reg.fit(X_train, y_train)

The model is now trained on the training data, and we can use it to make predictions for the test data:

In [None]:
y_pred = rf_reg.predict(X_test)

Here are some of the 2032 predictions from the Random Forest:

In [None]:
len(y_pred), y_pred[:10]

Here are some of the correct answers:

In [None]:
len(y_test), np.array(y_test)[:10]

Let's put them next to each other and print out the first few:

In [None]:
list(zip(y_test, y_pred))[:10] # "Zip" the two above arrays and display the first 10

We observe that the model is close to correct some times, and way off for others. 

We can also make a scatter plot to compare the predicted prices agains the actual prices:

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=y_pred)
plt.show()

We see that at least the model isn't extremely bad..

> **But how good is it, really? Can we quantify its performance?** 

As we did for classification earlier, we need metrics that we can use to evaluate our models. As before, we can use these to compare different models and choice of model parameters. 

# Evaluating models / performance measures

First of all, as mentioned earlier one should really ask "*What is the end goal for my system"?* We're supposed to create systems that are useful in some context, as part of a larger system, which typically has a higher-level goal that our system should aim to optimize. Perhaps it's worth sacrificing predictive performance for speed, or not getting a lot of prices that don't lead to sales?

However, we won't think about these broader context matters in these toy problems.

# TBC