## Homework

> Note: sometimes your answer doesn't match one of the options exactly. That's fine. 
Select the option that's closest to your solution.

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```
**Note**: The dataset was obtained from [this 
kaggle competition](https://www.kaggle.com/CooperUnion/cardataset).



In [1]:
!wget "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv"

--2023-09-27 11:06:27--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv.7’


2023-09-27 11:06:30 (1.06 MB/s) - ‘data.csv.7’ saved [1475504/1475504]



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
raw = pd.read_csv("data.csv")
raw.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500



We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`

### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.

In [4]:
cols_to_keep = [
    "Make",
    "Model",
    "Year",
    "Engine HP",
    "Engine Cylinders",
    "Transmission Type",
    "Vehicle Style",
    "highway MPG",
    "city mpg",
]

In [5]:
df = raw[cols_to_keep].copy()
df.columns = df.columns.str.lower().str.replace(" ", "_")
features = list(df.columns)
assert features == list(df.columns)

In [6]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18


In [7]:
categorical = list(df.dtypes[df.dtypes.eq("object")].index)
categorical


['make', 'model', 'transmission_type', 'vehicle_style']

In [8]:
for col in categorical:
    df[col] = df[col].str.lower().str.replace(" ", "_")

In [9]:
numerical = list(df.columns[~df.columns.isin(categorical)])
numerical

['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [10]:
for col in list(df.columns[df.isna().sum().gt(0)]):
    df[col] = df[col].fillna(0)

In [11]:
df["price"] = raw["MSRP"]

In [12]:
df.isna().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
price                0
dtype: int64


### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

>- **`AUTOMATIC`**
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`


In [13]:
df.transmission_type.value_counts()

transmission_type
automatic           8266
manual              2935
automated_manual     626
direct_drive          68
unknown               19
Name: count, dtype: int64

**Answer: AUTOMATIC**



### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
>- **`highway_mpg` and `city_mpg`


In [14]:
df_corr = df[numerical].corr()
# adding background gradient
df_corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg
year,1.0,0.338714,-0.040708,0.25824,0.198171
engine_hp,0.338714,1.0,0.774851,-0.415707,-0.424918
engine_cylinders,-0.040708,0.774851,1.0,-0.614541,-0.587306
highway_mpg,0.25824,-0.415707,-0.614541,1.0,0.886829
city_mpg,0.198171,-0.424918,-0.587306,0.886829,1.0




### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.


In [15]:
df["above_average"] = df.price.gt(df.price.mean()).astype(int)
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135,1
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650,1
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350,0
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450,0
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500,0



### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`price`) is not in your dataframe.


In [16]:
from sklearn.model_selection import train_test_split

In [64]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.2 / 0.8, random_state=42)
len(df_train), len(df_val), len(df_test)

(7148, 2383, 2383)

In [65]:
# Reset index and get y vectors
target = "price"

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train[target].values
y_val = df_val[target].values
y_test = df_test[target].values
        
del df_train[target] 
del df_val[target] 
del df_test[target] 

In [66]:
target = "above_average"

aa_y_train = df_train[target].values
aa_y_val = df_val[target].values
aa_y_test = df_test[target].values
        
del df_train[target] 
del df_val[target] 
del df_test[target] 


### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
>- **`model`**
- `transmission_type`
- `vehicle_style`


In [20]:
from sklearn.metrics import mutual_info_score

In [21]:
for cat in categorical:
    score = mutual_info_score(df_train[cat], aa_y_train)
    print(cat, round(score, 2))

make 0.24
model 0.46
transmission_type 0.02
vehicle_style 0.08




### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
>- **0.95**


In [22]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [23]:
dv = DictVectorizer(sparse=False)
train_dicts = df_train.to_dict(orient="records")
val_dicts = df_val.to_dict(orient="records")

X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)

In [24]:
# train the model
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, aa_y_train)

In [25]:
aa_y_pred = model.predict_proba(X_val)[:,1]
aa_y_pred

array([1.23478217e-03, 9.95791757e-01, 1.76463844e-04, ...,
       4.54991091e-04, 9.90574753e-01, 9.86817407e-01])

In [26]:
round((aa_y_val == model.predict(X_val)).mean(), 2)

0.95



### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

>- **`year`**
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive


In [27]:
base = ['year', 'engine_hp', 'transmission_type', 'city_mpg']

In [28]:
def train_selected(selected):
    dv = DictVectorizer(sparse=False)
    train_dicts = df_train[selected].to_dict(orient="records")
    val_dicts = df_val[selected].to_dict(orient="records")

    X_train = dv.fit_transform(train_dicts)
    X_val = dv.transform(val_dicts)
    
    # train the model
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train, aa_y_train)
    
    return round((aa_y_val == model.predict(X_val)).mean(), 2)


In [29]:
base_acc = train_selected(base)
base_acc

0.89

In [30]:
for feat in base:
    selected = base.copy()
    selected.remove(feat)
    acc = train_selected(selected)
    print(feat, "{:.2f} - {:.2f} = {:.2f}".format(acc, base_acc, acc - base_acc))

year 0.89 - 0.89 = 0.00
engine_hp 0.74 - 0.89 = -0.15
transmission_type 0.88 - 0.89 = -0.01
city_mpg 0.88 - 0.89 = -0.01




### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

>- **0**
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.


In [49]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [67]:
y_train

array([ 33599,  26245, 248000, ...,  28345,   2000,  40220])

In [68]:
# Prepare target ys
y_train = np.log1p(y_train) 
y_val = np.log1p(y_val) 
y_test = np.log1p(y_test) 

In [70]:
y_val

array([10.26381581, 11.00544424,  9.90802723, ...,  9.99747868,
       11.72558222, 10.87749987])

In [79]:
# Prepare Xs
dv = DictVectorizer()
train_dicts = df_train.to_dict(orient="records")
val_dicts = df_val.to_dict(orient="records")

X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)


In [75]:
def test_alpha(alpha):
    model = Ridge(
        alpha=alpha,
        solver="sag",
        max_iter=1000,
        random_state=42,
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    return mean_squared_error(y_pred, y_val)

In [82]:
for alpha in [0, 0.01, 0.1, 1, 10]:
    score = test_alpha(alpha)
    print(alpha, round(score,3), sep=" : ")

0 : 0.065
0.01 : 0.065
0.1 : 0.065
1 : 0.067
10 : 0.113




## Submit the results

* Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.
