## Homework

> Note: sometimes your answer doesn't match one of the options exactly. That's fine. 
Select the option that's closest to your solution.

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`

### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.

### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`


### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`price`) is not in your dataframe.

### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95


### Question 5

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.


## Submit the results

* Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.

# Solution

In [218]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

## Features

In [219]:
FEATURES = ['Make',
            'Model',
            'Year',
            'Engine HP',
            'Engine Cylinders',
            'Transmission Type',
            'Vehicle Style',
            'highway MPG',
            'city mpg']

In [220]:
data = pd.read_csv('./data.csv')

## Data preparation

In [221]:
data = data[FEATURES+['MSRP']]
data.columns = data.columns.str.replace(' ', '_').str.lower()
data.fillna(0, inplace=True)
data.rename(columns={'msrp':'price'}, inplace=True)

In [222]:
data.columns

Index(['make', 'model', 'year', 'engine_hp', 'engine_cylinders',
       'transmission_type', 'vehicle_style', 'highway_mpg', 'city_mpg',
       'price'],
      dtype='object')

## Question 1

In [223]:
data.transmission_type.mode()

0    AUTOMATIC
Name: transmission_type, dtype: object

## Question 2

In [224]:
NUMERICAL_FEATURES = [ 'engine_hp', 'year', 'engine_cylinders', 'highway_mpg', 'city_mpg']
data[NUMERICAL_FEATURES].corr()

Unnamed: 0,engine_hp,year,engine_cylinders,highway_mpg,city_mpg
engine_hp,1.0,0.338714,0.774851,-0.415707,-0.424918
year,0.338714,1.0,-0.040708,0.25824,0.198171
engine_cylinders,0.774851,-0.040708,1.0,-0.614541,-0.587306
highway_mpg,-0.415707,0.25824,-0.614541,1.0,0.886829
city_mpg,-0.424918,0.198171,-0.587306,0.886829,1.0


The 2 features with biggest correlation are: `highway_mpg` and `city_mpg`

### Make `price` binary

In [225]:
data['above_average'] = (data.price > data.price.mean())*1

### Split the data

In [252]:
X = data.drop('price', axis=1)

data_full_train, data_test = train_test_split(X, test_size=0.2, random_state=42)

data_train, data_val = train_test_split(data_full_train, test_size=0.25, random_state=42)

## Question 3

In [228]:
data_full_train.dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
above_average          int32
dtype: object

In [229]:
from sklearn.metrics import mutual_info_score

for c in data_full_train.select_dtypes('object').columns:
  score = mutual_info_score(data_full_train.above_average, data_full_train[c])
  print(f'{c}: {score:.2f}')

make: 0.24
model: 0.46
transmission_type: 0.02
vehicle_style: 0.08


## Question 4

In [230]:
dv = DictVectorizer()
dicts_train = data_train.drop('above_average', axis=1).to_dict(orient='records')

dv.fit(dicts_train)

X_train = dv.transform(dicts_train)

In [231]:
model_q4 = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
y_train = data_train.above_average.values
model_q4.fit(X_train, y_train)

In [232]:
dicts_val = data_val.drop('above_average', axis=1).to_dict(orient='records')

X_val = dv.transform(dicts_val)

y_pred = model_q4.predict_proba(X_val)[:,1]
y_pred

array([3.07033780e-04, 9.97720253e-01, 6.97938861e-05, ...,
       1.01043039e-04, 9.91013704e-01, 9.90912809e-01])

In [233]:
y_pred = (y_pred>=0.5)*1

In [234]:
y_val = data_val.above_average.values
(y_pred==y_val).mean().round(2)

0.95

## Question 5

In [235]:
y_train = data_train.above_average.values
data_train.drop('above_average', axis=1, inplace=True)

In [236]:
y_val = data_val.above_average.values
data_val.drop('above_average', axis=1, inplace=True)

In [243]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)

def train_and_score(model, training_dataset, y_train, validation_dataset, y_true):

    dv = DictVectorizer()

    dicts_train = training_dataset.to_dict(orient='records')
    X_train = dv.fit_transform(dicts_train)

    model.fit(X_train, y_train)

    dicts_val = validation_dataset.to_dict(orient='records')
    X_val = dv.transform(dicts_val)

    y_pred = model.predict_proba(X_val)[:, 1]
    y_pred = (y_pred>=0.5)*1
    score = (y_pred==y_true).mean()

    return score

In [245]:
total_score = train_and_score(model, data_train, y_train, data_val, y_val)
total_score

0.946286193873269

In [246]:
columns = data_train.columns
scores = []
for col in columns:
    small_train = data_train.drop(col, axis=1)
    small_val = data_val.drop(col, axis=1)
    score = train_and_score(model, small_train, y_train, small_val, y_val)
    scores.append((col, score, abs(total_score-score)))

In [247]:
sorted(scores, key=lambda x:x[2])

[('city_mpg', 0.9458665547629039, 0.0004196391103650221),
 ('engine_cylinders', 0.9471254720939991, 0.0008392782207301552),
 ('transmission_type', 0.9450272765421738, 0.0012589173310951773),
 ('year', 0.9479647503147294, 0.0016785564414604215),
 ('highway_mpg', 0.9441879983214435, 0.0020981955518254436),
 ('make', 0.9492236676458246, 0.002937473772555599),
 ('vehicle_style', 0.9425094418799832, 0.003776751993285754),
 ('engine_hp', 0.9303399076793957, 0.015946286193873282),
 ('model', 0.9240453210239195, 0.022240872849349502)]

## Question 6

In [262]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [256]:
data_full_train, data_test = train_test_split(data, test_size=0.2, random_state=42)

data_train, data_val = train_test_split(data_full_train, test_size=0.25, random_state=42)

In [257]:
y_train = np.log1p(data_train.price)
y_val = np.log1p(data_val.price)

data_train.drop('price', axis=1, inplace=True)
data_val.drop('price', axis=1, inplace=True)

In [280]:

dv = DictVectorizer()

dicts_train = data_train.to_dict(orient='records')
X_train = dv.fit_transform(dicts_train)

dicts_val = data_val.to_dict(orient='records')
X_val = dv.transform(dicts_val)

results = []
for a in [0, 0.01, 0.1, 1, 10]:
    ridge_model = Ridge(solver='sag', random_state=42, alpha=a)
    ridge_model.fit(X_train, y_train)

    y_pred = ridge_model.predict(X_val)
    rmse =  mean_squared_error(y_pred, y_val, squared=False)

    results.append({'alpha': a, 'rmse': round(rmse, 3)})

In [288]:
sorted(results, key=lambda x:x['rmse'])

[{'alpha': 0, 'rmse': 0.248},
 {'alpha': 0.01, 'rmse': 0.248},
 {'alpha': 0.1, 'rmse': 0.248},
 {'alpha': 1, 'rmse': 0.252},
 {'alpha': 10, 'rmse': 0.33}]