In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
used_cars_df = pd.read_csv('../data/used_cars.csv')

## Exploratory data analysis (EDA) and data cleaning

The target for our regression problem here is the column *price*. 

Each row represent the characteristics of a car, and the corresponding sales price of said car. We are free to choose which of the available features to fit a model to, and try to predict the target with.

It's generally essential that we use our wits and domain expertise to pick and engineer good features for our model. Bad features will produce a bad model, with poor predictive power. In other word, a useless model.

**Remove a redundant column**

In [None]:
# remove the first column which looks like a copy of the index column

columns_to_keep = used_cars_df.columns[1:]
print(columns_to_keep)

In [None]:
used_cars_df = used_cars_df[columns_to_keep]

used_cars_df

**Keep only numerical columns, for now**

Many machine learning models require that the input are all numerical (since you can't do mathematic operations with anything else), and it is therefore essential that (when using models with that requirement) make sure that the data satisfies that condition.

Note that there are ways to transform any given column into numericals that we can work with, but let's hold on with that for now and only keep the features that already are.

In [None]:
used_cars_df.info()


In [None]:
used_cars_df = used_cars_df.select_dtypes(include=['int64', 'float64'])

used_cars_df

**Deal with missing data**

Let's try to find and mitigate missing data. Note that whether to remove data points is a very sensitive decision, and should be carefully considered. 

Augmenting and fixing the data is a better alternative, if the time to do so is available. 

All changes we do to the training data *will* affect our model's performance, either insignificantly or significantly - depending on the changes we've made, and to what extent.

In [None]:
# check for null-data

used_cars_df.isnull().sum()

We'll opt for the lazy way out here, and remove the null data since it's affecting a very low amount of records (not a good argument btw). 

This is generally not a recommended approach though, it might be well worth fixing the data instead.

**Question**: What consequences on the data does our decision to remove these records potentially have?

In [None]:
used_cars_df = used_cars_df.dropna().reset_index(drop=True)

used_cars_df.isnull().sum()

**Dealing with unreasonable data**

Usually, we have to spend considerable time to just clean the data and get rid of crap that has nested it's way into it.

Crap in data is very common in real life.

Let's begin by trying to understand the price column a bit better.

In [None]:
plt.hist(used_cars_df['price(in lakhs)'], bins=50);
plt.xlabel('price(in lakhs)');
plt.ylabel('count');

That's strange, it looks like there are a few cars that are extremely expensive. This is not incorrect per se, but let's look deeper.

In [None]:
used_cars_df[used_cars_df['price(in lakhs)']<=12.5]

Ok, so we have 3 records of cars that looks to suspicous.

Since Ali has been in India, he knows that 1 lakh is a common indian measure that means one hundred thousand (indian rupees, in this case).

70000 lakhs is therefore 70000 * 100.000 = 7000000000 (indian rupees).

Converting this to Swedish currency we get 877 447 200 SEK. Not reasonably at all. 

Let's just remove these records for simplicity, and plot again

In [None]:
used_cars_df = used_cars_df[used_cars_df['price(in lakhs)']<12.5].reset_index(drop=True)

plt.hist(used_cars_df['price(in lakhs)'], bins=50);

plt.xlabel('price(in lakhs)');
plt.ylabel('count');

Ah, now it looks much more realistic!

Ok, great. Let's also take a look at kms_driven

In [None]:
plt.hist(used_cars_df['kms_driven'], bins=50, color='green');

plt.xlabel('kms_driven');
plt.ylabel('count');

Well, this also looks a little suspicious. Perhaps?

In [None]:
used_cars_df[used_cars_df['kms_driven']>150000]

Ok, so there are only 6 cars that have droven over 150.000 kms. Let's remove them, since they deviate in too much in values from our other values, and thus much deteroriate the models performance.

We can of course do this if we'd like, but let's think for a moment before doing so. What limitations are we putting on our model by removing these records?

In [None]:
used_cars_df = used_cars_df[used_cars_df['kms_driven']<150000].reset_index(drop=True)

plt.hist(used_cars_df['kms_driven'], bins=50, color='green');

plt.xlabel('kms_driven');
plt.ylabel('count');

Alright, looks like we have a good range 

---

**Bonus task**


Try to do some analysis (perhaps plots and calculating simple metrics such as min, max, mean, std etc.) on each of these remaining columns. Is there something in particular you find interesting? 

Can we do something about it? If you find any notable outliers, remove them for now.

---

## Linear Regression

We'll begin our model fitting by limiting ourselves to a single feature.

We'll try to predict car prices using km_driven as the sole feature, for now. 

Note that this is obviously very limiting, but we do it for pedagocical reasons in order to both get used to the sk-learn package, and to learn an important lesson...

In other words, we'll now assume that we can model

$$ price = w_1 \cdot (kms\ driven) + w_0

In [None]:
X, y = used_cars_df['kms_driven'].values, used_cars_df['price(in lakhs)'].values

In [None]:
plt.scatter(X, y)
plt.xlabel('kms_driven');
plt.ylabel('price(in lakhs)');

#plt.ylim()

Train/test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40) # set a random state, so we can reproduce our results

print('Train set:')
print('X:', len(X_train))
print('y:', len(y_train), end='\n\n')

print('Test set:')
print('X:', len(X_test))
print('y:', len(y_test))

In [None]:
plt.scatter(X_train, y_train, label = 'train')
plt.scatter(X_test, y_test, label = 'test')

plt.xlabel('kms_driven');
plt.ylabel('price(in lakhs)');

plt.legend();

In [None]:
from scipy import stats

stats.pearsonr(used_cars_df['kms_driven'], used_cars_df['price(in lakhs)'])

In [None]:
# import a linear regression model and the MSE-metric from sklearn

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# this initialized a linear regression model. It has not trained on anything yet

linear_regression_model = LinearRegression()

These models require a 2D-input, but our current data is 1D

In [None]:
print(X_train.shape)
print(y_train.shape)

We can mitigate this using the .reshape method

In [None]:
X_train = X_train.reshape(len(X_train), 1)
y_train = y_train.reshape(len(y_train), 1)

print(X_train.shape)
print(y_train.shape)

Note that we can pass in a -1 in the .reshape method aswell, it then automatically tries to infer the dimension given your data

In [None]:
X_test = X_test.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

print(X_test.shape)
print(y_test.shape)

Let's train!

In [None]:
linear_regression_model.fit(X_train, y_train);

In [None]:
linear_regression_model.intercept_

In [None]:
linear_regression_model.coef_

In [None]:
y_train_hat = linear_regression_model.predict(X_train)
y_test_hat = linear_regression_model.predict(X_test)

In [None]:
# beräkna loss på train set

print(mean_squared_error(y_train, y_train_hat))

print(np.sqrt(mean_squared_error(y_train, y_train_hat)))

In [None]:
# beräkna loss på test set

print(mean_squared_error(y_test, y_test_hat))

print(np.sqrt(mean_squared_error(y_test, y_test_hat)))

In [None]:
plt.scatter(X_test, y_test)
plt.scatter(X_test, y_test_hat)

plt.xlabel('kms_driven');
plt.ylabel('price(in lakhs)');
plt.title('Test data performance analysis');

Is the above result any good? Can you draw any conclusions?

If we really insisted, for some reason, on fitting a straight line to our data - what would be the correct course of action?

**Polynomial regression**

If we for some reason believe that a single feature in itself is not good enough, and that we might need powers of that feature, we can try to fit a polynomial model instead.

In [None]:
X_train.shape

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# assume that we want to create features for polynomial of degree two

poly_transform = PolynomialFeatures(degree=2, include_bias=False) # this initializes our transformer

X_train_polynomial = poly_transform.fit_transform(X_train)

X_test_polynomial = poly_transform.transform(X_test)           # not that we do NOT fit on the test set, only transform it. More on this later.



What we now have created are powers of our feature km_driven.

In other words, we now have two columns where each are

$km\ driven, (km\  driven)^2$

In [None]:
X_train_polynomial

We can now use these as features for our linear regression model we imported earlier (which supports multiple features).

This would allow to model

$$ price = w_2 \cdot (km\ driven)^2 + w_1 \cdot (km\ driven) + w_0 $$

In [None]:
# initialize and fit the model

polynomial_regression_degree_2_model = LinearRegression()      # NOTERA ATT DEN HÄR KLARAR AV BÅDE EN ELLER FLERA FEATURES

polynomial_regression_degree_2_model.fit(X_train_polynomial, y_train)

In [None]:
polynomial_regression_degree_2_model.intercept_

In [None]:
polynomial_regression_degree_2_model.coef_

In [None]:
# predict and calculate loss

y_train_hat = polynomial_regression_degree_2_model.predict(X_train_polynomial)
y_test_hat = polynomial_regression_degree_2_model.predict(X_test_polynomial)

print('Train MSE:', mean_squared_error(y_train, y_train_hat))
print('Test MSE:', mean_squared_error(y_test, y_test_hat))

print('Train RMSE:', np.sqrt(mean_squared_error(y_train, y_train_hat)))
print('Test RMSE:', np.sqrt(mean_squared_error(y_test, y_test_hat)))

In [None]:
plt.scatter(X_test, y_test)
plt.scatter(X_test, y_test_hat)

plt.xlabel('kms_driven');
plt.ylabel('price(in lakhs)');

Is that really better in any meaningul way? 

Hmm... Is there some sort of conclusion we can draw here?

What should we do?

---

## Challanges

**Task 1**

Clearly, trying to model car prices using km_driven alone atleast seems difficult with linear or polynomial models.

But if you were forced to do it anyway, what should we do? Look at the plots above and see if you can come up with an idea.

<details>
  <summary>Answer</summary>
  Predicting for both expensive and cheaper cars simultaneously with one model doesn't seem to be a good idea. We might get much better performance if we instead split those stratas and train a model on each seperately.
</details>



**Task 2**

Create two datasets from used_cars_df, called used_expensive_cars_df and used_cheap_cars_df. Define cheap to be a car that costs 12.5 lakh or less.

Train a linear model on the used_cheap_cars_df data and try to predict the sales price using only kms_driven.

What do you end up with? Is it better than before? 

*Hint:* You can use pretty much everything we've done above!

**Task 3**

Instead of kms_driven, now try using one of the other available features to model price. Do you get better performance? Limit your analysis to cheap cars.

Which of the feature seem to be the single best one at predicting car price?

*Hint:* You might need to deal with some unreasonable data in the other features aswell.



**Task 4**

We have not learned how to work with non-numeric columns yet as features for ML-models. However, see if you still can try to analyze the original columns we removed. Look for outliers, faulty data and other irregularities