In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.metrics import mean_squared_error

In [None]:
# Load the used_cars dataset
used_cars_df = pd.read_csv('../data/used_cars.csv')

In [None]:
used_cars_df

In [None]:
used_cars_df.info()

## Continuation

We're now going to build upon the first lab of the week by now introucing multiple linear regression in sklearn.

We'll begin by pretty much the same EDA and cleaning as the other lab, with some extra steps (that you should've done in previous lab's tasks) - so you can simply follow along here.

## Exploratory data analysis (EDA) and data cleaning

The target for our regression problem here is the column *price*. 

Each row represent the characteristics of a car, and the corresponding sales price of said car. We are free to choose which of the available features to fit a model to, and try to predict the target with.

It's generally essential that we use our wits and domain expertise to pick and engineer good features for our model. Bad features will produce a bad model, with poor predictie power. In other word, a useless model.

**Remove a redundant column**

In [None]:
# remove the first column which looks like a copy of the index column

columns_to_keep = used_cars_df.columns[1:]
print(columns_to_keep)

In [None]:
used_cars_df = used_cars_df[columns_to_keep]

**Keep only numerical columns, for now**

Many machine learning models require that the input are all numerical (since you can't do mathematic operations with anything else), and it is therefore essential that (when using models with that requirement) make sure that the data satisfies that condition.

Note that there are ways to transform any given column into numericals that we can work with, but let's hold on with that for now and only keep the features that already are.

In [None]:
used_cars_df.info()


In [None]:
used_cars_df = used_cars_df.select_dtypes(include=['int64', 'float64'])

used_cars_df

**Deal with missing data**

Let's try to find and mitigate missing data. Note that whether to remove data points is a very sensitive decision, and should be carefully considered. 

Augmenting and fixing the data is a better alternative, if the time to do so is available. 

All changes we do to the training data *will* affect our model's performance, either insignificantly or significantly - depending on the changes we've made, and to what extent.

In [None]:
# check for null-data

used_cars_df.isnull().sum()

In [None]:
used_cars_df = used_cars_df.dropna().reset_index(drop=True)

used_cars_df.isnull().sum()

**Dealing with unreasonable data**

Usually, we have to spend considerable time to just clean the data and get rid of crap that has nested it's way into it.

Crap in data is very common in real life.

Let's begin by trying to understand the price column a bit better.

In [None]:
plt.hist(used_cars_df['price(in lakhs)'], bins=50);
plt.xlabel('price(in lakhs)');
plt.ylabel('count');

That's strange, it looks like there are a few cars that are extremely expensive. This is not incorrect per se, but let's look deeper.

In [None]:
used_cars_df[used_cars_df['price(in lakhs)']>100]

Ok, so we have 3 records of cars that looks to suspicous.

Since Ali has been in India, he knows that 1 lakh is a common indian measure that means one hundred thousand (indian rupees, in this case).

70000 lakhs is therefore 70000 * 100.000 = 7000000000 (indian rupees).

Converting this to Swedish currency we get 877447200 SEK. Not reasonably at all. 

Let's just remove these records for simplicity, and plot again

In [None]:
used_cars_df = used_cars_df[used_cars_df['price(in lakhs)']<12.5].reset_index(drop=True)

plt.hist(used_cars_df['price(in lakhs)'], bins=50);

plt.xlabel('price(in lakhs)');
plt.ylabel('count');

Ah, now it looks much more realistic!

Ok, great. Let's also take a look at kms_driven

In [None]:
plt.hist(used_cars_df['kms_driven'], bins=50, color='green');

plt.xlabel('kms_driven');
plt.ylabel('count');

Well, this also looks a little suspicious. Perhaps?

In [None]:
used_cars_df[used_cars_df['kms_driven']>150000]

Ok, so there are only 6 cars that have droven over 150.000 kms. Let's remove them, since they deviate in too much in values from our other values, and thus much deteroriate the models performance.

In [None]:
used_cars_df = used_cars_df[used_cars_df['kms_driven']<150000].reset_index(drop=True)

plt.hist(used_cars_df['kms_driven'], bins=50, color='green');

plt.xlabel('kms_driven');
plt.ylabel('count');

Remember, we now don't have any cars with high mileage in our training data at all. This is perfectly fine, but you should **not** try to use this resulting model to predict the price of cars with high mileage!

Let's look at kmpl now. This means kilometers per liter. What are reasonable values for this, do you think?

In [None]:
used_cars_df = used_cars_df[used_cars_df['mileage(kmpl)']<30].reset_index(drop=True)

plt.hist(used_cars_df['mileage(kmpl)'], bins = 20, color='red')

In [None]:
plt.scatter(used_cars_df['mileage(kmpl)'], used_cars_df['price(in lakhs)']);

If we look carefully, engine(cc) and max_power(bhp) is the same column!

In [None]:
used_cars_df

In [None]:
used_cars_df = used_cars_df.drop(columns=['max_power(bhp)'])

In [None]:
used_cars_df

Alright, looks like we have a good range with a good amount of samples

## Linear Regression

In [None]:
used_cars_df.head(2)

Alright, now let's try to use all our features to predict our target, the sales price.

We'll use a linear model for this. In other words, We'll now assume that we can model

$$ price = w_5 \cdot (seats) + w_4 \cdot (kms\ driven) + w_3 \cdot (mileage(kmpl)) + w_2 \cdot (engine(cc)) + w_1 \cdot (torque(Nm)) + w_0$$

**note** That we'll have a difficult time doing any plots here, since our eyes are limited to 3D while have 5 features

In [None]:
X, y = used_cars_df.drop(columns=['price(in lakhs)', 'seats']).values, used_cars_df['price(in lakhs)'].values

In [None]:
# all our features are here now

print(X.shape)

# our target column has the same shape as before

print(y.shape)

In [None]:
# We see that our features X has the required shape format, but the target y doesn't. So let's fix that now

y = y.reshape(-1,1)

print(y.shape)

Train/test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # set a random state, so we can reproduce our results

print('Train set:')
print('X:', len(X_train))
print('y:', len(y_train), end='\n\n')

print('Test set:')
print('X:', len(X_test))
print('y:', len(y_test))

In [None]:
# import a linear regression model and the MSE-metric from sklearn

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

multiple_linear_regression_model = LinearRegression()     # note that this model has inate support to handle several features!

In [None]:
# train the model

multiple_linear_regression_model.fit(X_train, y_train);

In [None]:
# print the weights of the trained model

print(multiple_linear_regression_model.intercept_)
print(multiple_linear_regression_model.coef_)

In [None]:
# predictions for both the train and test set

y_train_hat = multiple_linear_regression_model.predict(X_train)
y_test_hat = multiple_linear_regression_model.predict(X_test)

In [None]:
# calculate MSE for both sets

print('Train:')
print(f'MSE: {mean_squared_error(y_train, y_train_hat)}')

print('Test:')
print(f'MSE: {mean_squared_error(y_test, y_test_hat)}')

In [None]:
plt.scatter(y_test, y_test_hat);
plt.xlabel('y_test')
plt.ylabel('y_test_hat')


How does this result compare to the regression models you built using only one feature in the previous lab?

---

## Challanges

**Task 1**

Try re-training the multiple linear regression model above, each time removing a single feature. I.e., you should train 5 models with the following features respectively:

1. $x_1, x_2, x_3, x_4$

2. $x_1, x_2, x_3, x_5$

3. $x_1, x_2, x_4, x_5$

4. $x_1, x_3, x_4, x_5$

5. $x_2, x_3, x_4, x_5$

Does any of the above combinations of features yield a better performing model? Or perhaps our efforts to model used care prices using only these features from this specific dataset is simply futile...