## Homework
- In this homework, we continue using the fuel efficiency dataset.
- The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv

--2025-11-18 23:37:12--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 874188 (854K) [text/plain]
Saving to: ‘car_fuel_efficiency.csv’


2025-11-18 23:37:12 (96.7 MB/s) - ‘car_fuel_efficiency.csv’ saved [874188/874188]



In [5]:
df = pd.read_csv('car_fuel_efficiency.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
engine_displacement,170,130,170,220,210
num_cylinders,3.0,5.0,,4.0,1.0
horsepower,159.0,97.0,78.0,,140.0
vehicle_weight,3413.433759,3149.664934,3079.038997,2542.392402,3460.87099
acceleration,17.7,17.8,15.1,20.2,14.4
model_year,2003,2007,2018,2009,2009
origin,Europe,USA,Europe,USA,Europe
fuel_type,Gasoline,Gasoline,Gasoline,Diesel,Gasoline
drivetrain,All-wheel drive,Front-wheel drive,Front-wheel drive,All-wheel drive,All-wheel drive
num_doors,0.0,0.0,0.0,2.0,2.0


### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [26]:
df.dtypes

engine_displacement      int64
num_cylinders          float64
horsepower             float64
vehicle_weight         float64
acceleration           float64
model_year               int64
origin                  object
fuel_type               object
drivetrain              object
num_doors              float64
fuel_efficiency_mpg    float64
dtype: object

In [27]:
categorical_features = list(df.dtypes[df.dtypes == 'object'].index)
numerical_variables = list(df.dtypes[df.dtypes != 'object'].index)

# From the output below, numerical_fetures includes converted variable which is our target therefore the tagert variable has to be removed
features = (categorical_features + numerical_variables)
features.remove('fuel_efficiency_mpg')
print('categorical')
print(categorical_features)
print()
print('numerical')
print(numerical_variables)
print(features)

categorical
['origin', 'fuel_type', 'drivetrain']

numerical
['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'num_doors', 'fuel_efficiency_mpg']
['origin', 'fuel_type', 'drivetrain', 'engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'num_doors']


In [28]:
df.isnull().sum()

engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

In [29]:
# To see only columns that have missing values
df.isnull().sum()[df.isnull().sum() != 0]

Series([], dtype: int64)

It can be seen that all features that have missing values are numerical features.

In [31]:
df[numerical_variables] = df[numerical_variables].fillna(0.0)
# Since instruction said fill all missing values with 0, running df.fillna(o) suffice

In [13]:
df.isnull().sum()

engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

In [17]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [18]:
len(df_train), len(df_val), len(df_test)

(5822, 1941, 1941)

In [16]:
df_train.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
4685,120,5.0,169.0,2966.679505,13.9,2005,USA,Gasoline,Front-wheel drive,-1.0,15.301475
125,200,3.0,143.0,2950.822121,17.1,2013,Asia,Diesel,Front-wheel drive,-1.0,15.331215
542,180,6.0,180.0,3078.221669,17.4,2007,USA,Gasoline,All-wheel drive,0.0,15.336679
8901,280,5.0,174.0,2797.991793,0.0,2016,USA,Diesel,All-wheel drive,0.0,15.86585
7215,250,4.0,133.0,2362.42693,16.3,2010,USA,Diesel,Front-wheel drive,-1.0,18.102203


In [19]:
# Reset the index of the dataframe (not necessarily compulsory)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [23]:
# Extracting the features (input variables X) from the dataframe
X_train = df_train[features]
X_val = df_val[features]
X_test = df_val[features]

# Alternatively delete target from the dataframe, so that only features are left in the dataframe
# del df_train['fuel_efficiency_mpg']
# del df_val['fuel_efficiency_mpg']
# del df_test['fuel_efficiency_mpg']
# Therefore, df_train, df_val and df_test become equivalent to X_train/val/test in the code above
# since only the features are left upon deleting the target. 
# Now we can pass df_train/val/test straight into dictvectorizer and store the output in 
# X_train/val/test
# This way, the code in an earlier cell above (rewritten below for identification/convenience) to 
# extract features won't be necessary
# features = (categorical_features + numerical_variables)
# features.remove('fuel_efficiency_mpg')

In [24]:
# Extracting the target (output variable Y) from the dataframe
y_train = df_train.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values
y_test = df_test.fuel_efficiency_mpg.values



In [25]:
# One hot encoding
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)

X_train_dict = X_train.to_dict(orient='records')
X_train_en = dv.fit_transform(X_train_dict)

X_val_dict = X_val.to_dict(orient='records')
X_val_en = dv.transform(X_val_dict)

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [32]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

In [None]:
dt = DecisionTreeClassifier(max_depth=1)
dt.fit(X_train_en, y_train)

In [None]:
print(export_text(dt, feature_names=list(dv.get_feature_names_out())))