# SUPERVISED LEARNING PROJECT

# Setting all the necessary packages

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

# Reading and desctibing data

In this notebook a dataset concerning CO2 emission by vehicles in Canada is analyzed. The data are taken from https://www.kaggle.com/debajyotipodder/co2-emission-by-vehicles and cover a period of 7 years. There are total 7385 rows and 12 columns.

In [2]:
# Data location
location = 'archive'
df_file_name = 'CO2 Emissions_Canada.csv'
df = pd.read_csv(location+'/'+df_file_name, encoding='latin-1')

# A brief overview
print(df.dtypes)
print('Size of the DataSet: '+ str(df.shape))
df.head()

Make                                 object
Model                                object
Vehicle Class                        object
Engine Size(L)                      float64
Cylinders                             int64
Transmission                         object
Fuel Type                            object
Fuel Consumption City (L/100 km)    float64
Fuel Consumption Hwy (L/100 km)     float64
Fuel Consumption Comb (L/100 km)    float64
Fuel Consumption Comb (mpg)           int64
CO2 Emissions(g/km)                   int64
dtype: object
Size of the DataSet: (7385, 12)


Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


A brief description of the features follows:
* `Make`: Constructor name
* `Model`: Vehicle model (see below for abbreviations explained)
* `Vehicle Class`: Vehicle class (Compact, Small, SUV…)
* `Engine Size(L)`: Engine size in liters 
* `Cylinders`: Number of cylinders
* `Transmission`: Type of transmission and number of gears (see below for abbreviations explained)
* `Fuel Type`: Type of fuel (see below for abbreviations explained)
* `Fuel Consumption City (L/100 km)`: Fuel consumptions on city path 
* `Fuel Consumption Hwy (L/100 km)`: Fuel consumptions on highway path in liters for 100km
* `Fuel Consumption Comb (L/100 km)`: Fuel consumptions on highway path in liters for 100km
* `Fuel Consumption Comb (mpg)`: Fuel consumptions combined (55% city and 45% highway)
* `CO2 Emissions(g/km)`: CO2 emissions in grams for km

Below we give sense to the abbreviations that are present in the dataset:

MODEL
* 4WD/4X4 = Four-wheel drive
* AWD = All-wheel drive
* FFV = Flexible-fuel vehicle
* SWB = Short wheelbase
* LWB = Long wheelbase
* EWB = Extended wheelbase

TRANSMISSION

* A = Automatic
* AM = Automated manual
* AS = Automatic with select shift
* AV = Continuously variable
* M = Manual
* 3 - 10 = Number of gears

FUEL TYPE

* X = Regular gasoline
* Z = Premium gasoline
* D = Diesel
* E = Ethanol (E85)
* N = Natural gas

# Checking for null values 

Luckily there are no missing values.

In [3]:
missing_df = df.isnull().sum(axis=0).reset_index()
# we print the result in a more nice-looking form
missing_df.columns = ['variable', 'missing values']
missing_df

Unnamed: 0,variable,missing values
0,Make,0
1,Model,0
2,Vehicle Class,0
3,Engine Size(L),0
4,Cylinders,0
5,Transmission,0
6,Fuel Type,0
7,Fuel Consumption City (L/100 km),0
8,Fuel Consumption Hwy (L/100 km),0
9,Fuel Consumption Comb (L/100 km),0


# Feature Engineering



We expect the combined fuel consumption to be representable by the two separate fuel consumption (city and highway), so we drop the columns referring to combined fuel consumption.

In [4]:
cols_to_drop = ['Fuel Consumption Comb (L/100 km)','Fuel Consumption Comb (mpg)']
df.drop(cols_to_drop, axis=1, inplace=True)

In [5]:
# We also rename some features for a more clean labels
new_cols_dict = {'Vehicle Class': 'Class',
                'Engine Size(L)': 'Engine_Size',
                'Fuel Type': 'Fuel',
                'Fuel Consumption City (L/100 km)': 'Consumption_City',
                'Fuel Consumption Hwy (L/100 km)': 'Consumption_Hwy',
                'CO2 Emissions(g/km)': 'Emissions'}

df.rename(columns=new_cols_dict,  inplace = True)
df

Unnamed: 0,Make,Model,Class,Engine_Size,Cylinders,Transmission,Fuel,Consumption_City,Consumption_Hwy,Emissions
0,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,196
1,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,221
2,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,136
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,255
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,244
...,...,...,...,...,...,...,...,...,...,...
7380,VOLVO,XC40 T5 AWD,SUV - SMALL,2.0,4,AS8,Z,10.7,7.7,219
7381,VOLVO,XC60 T5 AWD,SUV - SMALL,2.0,4,AS8,Z,11.2,8.3,232
7382,VOLVO,XC60 T6 AWD,SUV - SMALL,2.0,4,AS8,Z,11.7,8.6,240
7383,VOLVO,XC90 T5 AWD,SUV - STANDARD,2.0,4,AS8,Z,11.2,8.3,232


For feature engineering, we want to parse the categorical features of `Make`, `Model`, `Class`, `Transmission`, `Fuel`. We produce a standard encode using dummy variables for  `Make`, as well as for `Class` and `Fuel`.
Considering `Model`, we are interested only in some major characteristics of the vehicle, such as whether the vehicle is 4-weels-driving or not, or if the vehicle is hybrid, etc.
Finally we notice that we can extrapolate two informations from the `Transmission` variable, that are the transmission category and the number of gears. The first is encoded with dummy variables, while de latter is used as a feature on its own. 


In [6]:
# Dummy encode for Make, Class, Fuel Type
ohc_cols = ['Make', 'Class', 'Fuel']
df_enc = pd.get_dummies(df, columns=ohc_cols, drop_first=True)
df_enc.shape

(7385, 67)

In [7]:
# Dealing with Transmission

# Creating two functions that pull out the numbers at the end of the trasmission string
def text_split(item):
    for index, letter in enumerate(item, 0):
        if letter.isdigit():
            return item[:index]
def num_split(item):
    for index, letter in enumerate(item, 0):
        if letter.isdigit():
            return int(item[index:])

df_enc['Gears'] = df_enc.Transmission.apply(num_split) # creating the columns containing the gears number
df_enc['Gears'] = df_enc['Gears'].fillna(0)# replacing 0 to NaN deriving from 
df_enc['Transmission'] = df_enc.Transmission.apply(text_split) # creating the columns containing Transmission type

# getting dummies with of Transmission
df_enc = pd.get_dummies(df_enc, columns=['Transmission'], drop_first=True)
df_enc.shape

(7385, 71)

In [8]:
# Dealing with Model

# Looking for key words in model columns
keywords = ['HYBRID', 'FFV', 'AWD', 'SWB', 'LWB', 'EWB']
for key in keywords:
    df_enc['Model_' + key] = df_enc['Model'].apply(lambda s: 1 if (key in s) else 0)
    print('There are ' + str(df_enc['Model_' + key].sum()) + ' ' + key + ' vehicles')

# special case: we have two keys
df_enc['Model_4WD'] = df_enc['Model'].apply(lambda s: 1 if ('4WD' in s or '4X4' in s) else 0) 
print('There are ' + str(df_enc['Model_4WD'].sum()) + ' 4WD vehicles')

# Drop Model
df_enc.drop(['Model'], axis=1, inplace=True)
df_enc.shape

There are 106 HYBRID vehicles
There are 592 FFV vehicles
There are 1148 AWD vehicles
There are 15 SWB vehicles
There are 39 LWB vehicles
There are 15 EWB vehicles
There are 804 4WD vehicles


(7385, 77)

# Supervised Learning

Now the dataset is formatted and we are ready to apply some linear regression techniques. The aim of this analysis is interpretation: the target variable is the emission of the vehicle and we aim to know which are the most relevant parameters that has influence on it. We will perform three linear regression: the first one will be a standard linear regression, in the second one we introduce cross validation and in the third one we will use some regularization techniques. 

SIMPLE LINEAR REGRESSION (With train and test split and scaler)

In [9]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge

y_col = 'Emissions'
X_cols = [x for x in df_enc.columns if x != y_col]
X = df_enc[X_cols]
y = df_enc[y_col]

In [10]:
# we use the mean square error and R2-score as a metric

lr = LinearRegression()
s = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, random_state=42)

X_train_s = s.fit_transform(X_train)
X_test_s = s.transform(X_test)

lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_score_lr = r2_score(y_test, y_pred_lr)
print('LR MSE: ', mse_lr)
print('LR R2-Score: ', r2_score_lr)

LR MSE:  31.506819902595513
LR R2-Score:  0.9906721235605888


In [11]:
coeff_weight_lr = pd.DataFrame([X.columns.values, lr.coef_], index=['Variable', 'Weight'])
coeff_weight_lr.T.sort_values(by='Weight')


Unnamed: 0,Variable,Weight
60,Fuel_E,-31.1385
62,Fuel_X,-14.7539
63,Fuel_Z,-13.799
59,Class_VAN - PASSENGER,-0.954894
58,Class_VAN - CARGO,-0.749539
...,...,...
70,Model_FFV,1.00133
1,Cylinders,1.19045
16,Make_FORD,1.30974
3,Consumption_Hwy,22.4054


CROSS VALIDATION

In [12]:
# we use R2 score ad a metric

scores = []


kf = KFold(shuffle=True, random_state=72018, n_splits=10)
lr = LinearRegression()
s = StandardScaler()

for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (X.iloc[train_index, :], 
                                        X.iloc[test_index, :], 
                                        y[train_index], 
                                        y[test_index])
    
    X_train_s = s.fit_transform(X_train)
    
    lr.fit(X_train_s, y_train)
    
    X_test_s = s.transform(X_test)
    
    y_pred = lr.predict(X_test_s)

    score = r2_score(y_test.values, y_pred)
    
    scores.append(score)
    
print('KF R2-Score: ',np.mean(scores))
coeff_weight_kf = pd.DataFrame([X.columns.values, lr.coef_], index=['Variable', 'Weight'])
coeff_weight_kf.T.sort_values(by='Weight')


KF R2-Score:  0.9923085930776331


Unnamed: 0,Variable,Weight
60,Fuel_E,-31.2985
62,Fuel_X,-14.8691
63,Fuel_Z,-13.9838
59,Class_VAN - PASSENGER,-0.823656
58,Class_VAN - CARGO,-0.729263
...,...,...
70,Model_FFV,1.04445
16,Make_FORD,1.128
1,Cylinders,1.31579
3,Consumption_Hwy,22.0984


RIDGE 

In [13]:
#we useR2-score as a metric

s = StandardScaler()
r = Ridge()

X_train_s = s.fit_transform(X_train)
X_test_s = s.transform(X_test)

r.fit(X_train_s, y_train)

y_pred = r.predict(X_test_s)

print('Lasso R2-Score: ', r2_score(y_pred, y_test))
coeff_weight_las = pd.DataFrame([X.columns.values, r.coef_], index=['Variable', 'Weight'])
coeff_weight_las.T.sort_values(by='Weight')


Lasso R2-Score:  0.9882189154847097


Unnamed: 0,Variable,Weight
60,Fuel_E,-31.2185
62,Fuel_X,-14.7728
63,Fuel_Z,-13.8817
59,Class_VAN - PASSENGER,-0.809971
58,Class_VAN - CARGO,-0.725404
...,...,...
70,Model_FFV,1.0369
16,Make_FORD,1.13206
1,Cylinders,1.33248
3,Consumption_Hwy,22.1366


# Conclusions

As we can observe, the performance of the model does not increase as we add regularisartion, linear regression has an higher r2-score than Ridge. Nevertheless, all the three models perform quite well and we can find some common behaviors when it comes to interpretation. Indeed, taking a look to coefficients of linear regression (in the first two cases) and of Lasso in the last one, we observe that all the three methods agree on giving a strong positive wight to the consumptions variables while a strong negative wight to the dummy variable of Ethanol powered vehicles. Form the coefficients coming from the two linear regressions it seems also that Ford has the highest positive influence on emissions, this effect is anyway canceled in Lasso. Some minor positive influences are attached also to the number of cylinders and gears. Anyway, sticking to the r2-score interpretation, a simple linear regression with a cross validation procedure seems to be the best way to model the data.

Some further analysis could be done on tuning the parameters of Ridge, as well as reducing the number of variables on the dataset (we could for instance exclude the type of transmission or the constructor) and see if some more wight is distributed among the other variables, generating some interesting results. Another alternative is to add some polynomial feature and see if this complexifications pays off and produces an higher precision. 