In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Dataset
In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with wget:

```wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv```

We'll keep working with the MSRP variable, and we'll transform it to a classification task.

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv')

In [3]:
data.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


#### Features
For the rest of the homework, you'll need to use only these columns:

* ```Make```,
* ```Model```,
* ```Year```,
* ```Engine HP```,
* ```Engine Cylinders```,
* ```Transmission Type```,
* ```Vehicle Style```,
* ```highway MPG```,
* ```city mpg```

In [4]:
columns = ['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders', 
            'Transmission Type', 'Vehicle Style', 'highway MPG', 'city mpg', 
           'MSRP']

In [5]:
df = data[columns]

In [6]:
df.head(3)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,MSRP
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350


#### Data preparation
* Select only the features from above and transform their names using next line:<br>
```data.columns = data.columns.str.replace(' ', '_').str.lower()```
* Fill in the missing values of the selected features with 0.
* Rename ```MSRP``` variable to ```price```.

In [7]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [8]:
df.columns

Index(['make', 'model', 'year', 'engine_hp', 'engine_cylinders',
       'transmission_type', 'vehicle_style', 'highway_mpg', 'city_mpg',
       'msrp'],
      dtype='object')

In [9]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp            69
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [10]:
df['engine_hp'].fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['engine_hp'].fillna(0, inplace=True)


In [11]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp             0
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [None]:
df['engine_cylinders'].fillna(0, inplace=True)

In [12]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp             0
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [13]:
df.rename(columns={'msrp':'price'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'msrp':'price'}, inplace=True)


In [14]:
df.head(3)

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350


#### Question 1
What is the most frequent observation (mode) for the column ```transmission_type```?

* AUTOMATIC
* MANUAL
* AUTOMATED_MANUAL
* DIRECT_DRIVE

#### Answer: AUTOMATIC

In [15]:
df['transmission_type'].describe()

count         11914
unique            5
top       AUTOMATIC
freq           8266
Name: transmission_type, dtype: object

#### Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

* ```engine_hp``` and ```year```
* ```engine_hp``` and ```engine_cylinders```
* ```highway_mpg``` and ```engine_cylinders```
* ```highway_mpg``` and ```city_mpg```

#### Answer: ```highway_mpg``` and ```city_mpg```

In [16]:
df.dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
price                  int64
dtype: object

In [17]:
numerical = ['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 
             'city_mpg']

In [18]:
df[numerical].corr()

Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg
year,1.0,0.338714,-0.041479,0.25824,0.198171
engine_hp,0.338714,1.0,0.780998,-0.415707,-0.424918
engine_cylinders,-0.041479,0.780998,1.0,-0.621606,-0.600776
highway_mpg,0.25824,-0.415707,-0.621606,1.0,0.886829
city_mpg,0.198171,-0.424918,-0.600776,0.886829,1.0


#### Make price binary
* Now we need to turn the price variable from numeric into a binary format.
* Let's create a variable ```above_average``` which is 1 if the price is above its mean value and 0 otherwise.

In [21]:
df['above_average']=df['price'].apply(lambda x: 1 if df['price']>df['price'].mean()
                                                       else 0, axis=1)

TypeError: <lambda>() got an unexpected keyword argument 'axis'

In [23]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


In [22]:
df['price'].mean()

40594.737032063116

#### Split the data
* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value (```above_average```) is not in your dataframe.

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [26]:
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=42)

In [27]:
y_train = df_train.price.values
y_val = df_val.price.values

#### Question 3
* Calculate the mutual information score between ```above_average``` and other categorical variables in our dataset. Use the training set only.
* Round the scores to 2 decimals using round(score, 2).

Which of these variables has the lowest mutual information score?

* ```make```
* ```model```
* ```transmission_type```
* ```vehicle_style```

#### Answer: transmission_type

In [30]:
df.dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
price                  int64
dtype: object

In [32]:
categorical=['make', 'model', 'transmission_type', 'vehicle_style']

In [28]:
from sklearn.metrics import mutual_info_score

In [35]:
def calculate_mi(series):
    return round(mutual_info_score(series, df_train_full.price), 2)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


display(df_mi.head())

Unnamed: 0,MI
model,5.5
make,2.74
vehicle_style,1.69
transmission_type,0.58


#### Question 4
* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    * ```model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)```
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

* 0.60
* 0.72
* 0.84
* 0.95

#### Question 5
* Let's find the least useful feature using the feature elimination technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

* ```year```
* ```engine_hp```
* ```transmission_type```
* ```city_mpg```

#### Question 6
* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column ```price```. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver ```'sag'```. Set the seed to 42.
* This model also has a parameter ```alpha```. Let's try the following values: ```[0, 0.01, 0.1, 1, 10]```.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

* 0
* 0.01
* 0.1
* 1
* 10