## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).

### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [42]:
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor
#from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.tree import export_text

%matplotlib inline

In [2]:
#data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
data = 'car_fuel_efficiency.csv'

In [None]:
#!wget $data

--2025-10-29 19:24:49--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 874188 (854K) [text/plain]
Saving to: 'car_fuel_efficiency.csv.2'

     0K .......... .......... .......... .......... ..........  5% 1,32M 1s
    50K .......... .......... .......... .......... .......... 11% 2,20M 0s
   100K .......... .......... .......... .......... .......... 17% 6,46M 0s
   150K .......... .......... .......... .......... .......... 23% 10,5M 0s
   200K .......... .......... .......... .......... .......... 29% 2,84M 0s
   250K .......... .......... .......... .......... .......... 35% 9,20M 0s
   300K .......... .......... .......... .......... .......... 40% 23,4M 0s
   350K

In [4]:
df = pd.read_csv(data)
#df = pd.read_csv('car_fuel_efficiency.csv')

In [5]:
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [6]:
len(df)

9704

In [11]:
df.dtypes[df.dtypes=='object'],df.dtypes[df.dtypes=='int64'], df.dtypes[df.dtypes=='float64']

(origin        object
 fuel_type     object
 drivetrain    object
 dtype: object,
 engine_displacement    int64
 model_year             int64
 dtype: object,
 num_cylinders          float64
 horsepower             float64
 vehicle_weight         float64
 acceleration           float64
 num_doors              float64
 fuel_efficiency_mpg    float64
 dtype: object)

In [12]:
df.columns

Index(['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight',
       'acceleration', 'model_year', 'origin', 'fuel_type', 'drivetrain',
       'num_doors', 'fuel_efficiency_mpg'],
      dtype='object')

In [13]:
df.nunique()

engine_displacement      36
num_cylinders            14
horsepower              192
vehicle_weight         9704
acceleration            162
model_year               24
origin                    3
fuel_type                 2
drivetrain                2
num_doors                 9
fuel_efficiency_mpg    9704
dtype: int64

In [14]:
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

In [15]:
df.isnull().any().sum()

np.int64(4)

In [17]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_full_train = df_full_train.fuel_efficiency_mpg.values
y_train = df_train.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values
y_test = df_test.fuel_efficiency_mpg.values

del df_full_train['fuel_efficiency_mpg']
del df_train['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']
del df_test['fuel_efficiency_mpg']

len(df), len(df_full_train), len(df_train), len(df_val), len(df_test)

(9704, 7763, 5822, 1941, 1941)

In [19]:
# Define numerical and categorical features
numerical = ['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 
            'acceleration', 'model_year']
categorical = ['origin', 'fuel_type', 'drivetrain', 'num_doors']

In [21]:
df[numerical].isnull().sum(), df[categorical].isnull().sum(), df['fuel_efficiency_mpg'].isnull().sum()

(engine_displacement      0
 num_cylinders          482
 horsepower             708
 vehicle_weight           0
 acceleration           930
 model_year               0
 dtype: int64,
 origin          0
 fuel_type       0
 drivetrain      0
 num_doors     502
 dtype: int64,
 np.int64(0))

In [39]:
df_train[numerical]  = df_train[numerical].fillna(0).values
df_train[categorical] = df_train[categorical].fillna('NA').values

df_val[numerical]  = df_val[numerical].fillna(0).values
df_val[categorical] = df_val[categorical].fillna('NA').values   

df_test[numerical]  = df_test[numerical].fillna(0).values
df_test[categorical] = df_test[categorical].fillna('NA').values

df_full_train[numerical]  = df_full_train[numerical].fillna(0).values
df_full_train[categorical] = df_full_train[categorical].fillna('NA').values

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [40]:
from sklearn.metrics import root_mean_squared_error

# Create training dictionaries
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

# Train a decision tree regressor
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

# Make predictions
y_pred_train = dt.predict(X_train)
y_pred_val = dt.predict(X_val)

# Calculate RMSE for training and validation sets
rmse_train = root_mean_squared_error(y_train, y_pred_train)
rmse_val = root_mean_squared_error(y_val, y_pred_val)

print(f'Train RMSE: {rmse_train:.2f}')
print(f'Val RMSE: {rmse_val:.2f}')

# Print the tree structure
print('\nDecision Tree Structure:')
print(export_text(dt, feature_names=list(dv.get_feature_names_out())))

Train RMSE: 1.59
Val RMSE: 0.00

Decision Tree Structure:
|--- vehicle_weight <= 3022.11
|   |--- value: [16.88]
|--- vehicle_weight >  3022.11
|   |--- value: [12.94]



- Q1 - Response
- Which feature is used for splitting the data?


* `'vehicle_weight'`

## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
* 0.45
* 4.5
* 45.0

In [41]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error

rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_val)
rmse = root_mean_squared_error(y_val, y_pred)
print(f'Val RMSE: {rmse:.6f}')

Val RMSE: 1.554945
