## CO2 Emission by vehicles Prediction

The dataset captures the details of how CO2 emissions by a vehicle can vary with the different features. The dataset has been taken from Canada Government official open data website. This is a compiled version taken from [Kaggle datasets](https://www.kaggle.com/code/ahmetcanertekn/co2-emission-by-vehicles-eda-visualization/input). This contains data over a period of 7 years.

### Model

4WD/4X4 = Four-wheel drive

AWD = All-wheel drive

FFV = Flexible-fuel vehicle

SWB = Short wheelbase

LWB = Long wheelbase

EWB = Extended wheelbase

### Transmission

A = Automatic

M = Manual

### Fuel type

X = Regular gasoline(petrol)

Z = Premium gasoline(petrol)

D = Diesel

E = Ethanol (E85)

N = Natural gas

### Fuel Consumption

City and highway fuel consumption ratings are shown in litres per 100 kilometres (L/100 km) - the combined rating (55% city, 45% hwy) is shown in L/100 km and in miles per gallon (mpg)

**To convert mpg to kmpl**, divide the fuel economy value by **2.352**

### CO2 Emissions

The tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

In [2]:
dataset = pd.read_csv("CO2_Emissions.csv")

In [3]:
dataset.head(10)

Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,ACURA,ILX,COMPACT,2.0,4,A,Z,9.9,6.7,8.5,33,196
1,ACURA,ILX,COMPACT,2.4,4,M,Z,11.2,7.7,9.6,29,221
2,ACURA,ILX HYBRID,COMPACT,1.5,4,A,Z,6.0,5.8,5.9,48,136
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,A,Z,12.7,9.1,11.1,25,255
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,A,Z,12.1,8.7,10.6,27,244
5,ACURA,RLX,MID-SIZE,3.5,6,A,Z,11.9,7.7,10.0,28,230
6,ACURA,TL,MID-SIZE,3.5,6,A,Z,11.8,8.1,10.1,28,232
7,ACURA,TL AWD,MID-SIZE,3.7,6,A,Z,12.8,9.0,11.1,25,255
8,ACURA,TL AWD,MID-SIZE,3.7,6,M,Z,13.4,9.5,11.6,24,267
9,ACURA,TSX,COMPACT,2.4,4,A,Z,10.6,7.5,9.2,31,212


### Extracting required features for training

In [4]:
dataset = dataset[["Vehicle Class", "Engine Size(L)", "Cylinders", "Transmission", "Fuel Type", "Fuel Consumption Comb (mpg)", "CO2 Emissions(g/km)"]]

In [5]:
dataset.head(5)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,COMPACT,2.0,4,A,Z,33,196
1,COMPACT,2.4,4,M,Z,29,221
2,COMPACT,1.5,4,A,Z,48,136
3,SUV - SMALL,3.5,6,A,Z,25,255
4,SUV - SMALL,3.5,6,A,Z,27,244


### Dropping rows with Fuel Type "E" and "N"

In [6]:
dataset = dataset[dataset["Fuel Type"] != "E"]

In [7]:
dataset = dataset[dataset["Fuel Type"] != "N"]

In [8]:
dataset.head(5)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,COMPACT,2.0,4,A,Z,33,196
1,COMPACT,2.4,4,M,Z,29,221
2,COMPACT,1.5,4,A,Z,48,136
3,SUV - SMALL,3.5,6,A,Z,25,255
4,SUV - SMALL,3.5,6,A,Z,27,244


In [9]:
dataset[["Fuel Type"]].value_counts()["X"]

3637

In [10]:
fuel_type_mapping = {'X': 0, 'Z': 1, 'D': 2}

In [11]:
dataset['Fuel Type'] = dataset['Fuel Type'].replace(fuel_type_mapping)

In [12]:
dataset.tail(10)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
7375,MID-SIZE,2.0,4,A,1,29,223
7376,STATION WAGON - SMALL,2.0,4,A,1,32,208
7377,STATION WAGON - SMALL,2.0,4,A,1,30,219
7378,STATION WAGON - SMALL,2.0,4,A,1,30,220
7379,SUV - SMALL,2.0,4,A,0,31,210
7380,SUV - SMALL,2.0,4,A,1,30,219
7381,SUV - SMALL,2.0,4,A,1,29,232
7382,SUV - SMALL,2.0,4,A,1,27,240
7383,SUV - STANDARD,2.0,4,A,1,29,232
7384,SUV - STANDARD,2.0,4,A,1,26,248


In [13]:
transmission_mapping = {'A': 0, 'M': 1}

In [14]:
dataset["Transmission"] = dataset["Transmission"].replace(transmission_mapping)

In [15]:
dataset.head(10)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
0,COMPACT,2.0,4,0,1,33,196
1,COMPACT,2.4,4,1,1,29,221
2,COMPACT,1.5,4,0,1,48,136
3,SUV - SMALL,3.5,6,0,1,25,255
4,SUV - SMALL,3.5,6,0,1,27,244
5,MID-SIZE,3.5,6,0,1,28,230
6,MID-SIZE,3.5,6,0,1,28,232
7,MID-SIZE,3.7,6,0,1,25,255
8,MID-SIZE,3.7,6,1,1,24,267
9,COMPACT,2.4,4,0,1,31,212


In [16]:
dataset.tail(10)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (mpg),CO2 Emissions(g/km)
7375,MID-SIZE,2.0,4,0,1,29,223
7376,STATION WAGON - SMALL,2.0,4,0,1,32,208
7377,STATION WAGON - SMALL,2.0,4,0,1,30,219
7378,STATION WAGON - SMALL,2.0,4,0,1,30,220
7379,SUV - SMALL,2.0,4,0,0,31,210
7380,SUV - SMALL,2.0,4,0,1,30,219
7381,SUV - SMALL,2.0,4,0,1,29,232
7382,SUV - SMALL,2.0,4,0,1,27,240
7383,SUV - STANDARD,2.0,4,0,1,29,232
7384,SUV - STANDARD,2.0,4,0,1,26,248


### Convert mpg to kmpl and drop the original mpg column as not needed anymore

In [17]:
dataset["Fuel Consumption Comb (kmpl)"] = dataset["Fuel Consumption Comb (mpg)"] / 2.352

dataset.drop(columns=["Fuel Consumption Comb (mpg)"], inplace=True)

In [18]:
dataset = dataset[['Vehicle Class', 'Engine Size(L)', 'Cylinders', 'Transmission', 'Fuel Type', 'Fuel Consumption Comb (kmpl)', 'CO2 Emissions(g/km)']]

In [19]:
dataset.head(5)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (kmpl),CO2 Emissions(g/km)
0,COMPACT,2.0,4,0,1,14.030612,196
1,COMPACT,2.4,4,1,1,12.329932,221
2,COMPACT,1.5,4,0,1,20.408163,136
3,SUV - SMALL,3.5,6,0,1,10.629252,255
4,SUV - SMALL,3.5,6,0,1,11.479592,244


In [20]:
dataset["Transmission"].unique()

array([0, 1], dtype=int64)

In [21]:
dataset["Vehicle Class"].unique()

array(['COMPACT', 'SUV - SMALL', 'MID-SIZE', 'TWO-SEATER', 'MINICOMPACT',
       'SUBCOMPACT', 'FULL-SIZE', 'STATION WAGON - SMALL',
       'SUV - STANDARD', 'VAN - CARGO', 'VAN - PAENGER',
       'PICKUP TRUCK - STANDARD', 'MINIVAN', 'SPECIAL PURPOSE VEHICLE',
       'STATION WAGON - MID-SIZE', 'PICKUP TRUCK - SMALL'], dtype=object)

In [22]:
dataset["Fuel Type"].unique()

array([1, 2, 0], dtype=int64)

In [23]:
dataset.head(10)

Unnamed: 0,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (kmpl),CO2 Emissions(g/km)
0,COMPACT,2.0,4,0,1,14.030612,196
1,COMPACT,2.4,4,1,1,12.329932,221
2,COMPACT,1.5,4,0,1,20.408163,136
3,SUV - SMALL,3.5,6,0,1,10.629252,255
4,SUV - SMALL,3.5,6,0,1,11.479592,244
5,MID-SIZE,3.5,6,0,1,11.904762,230
6,MID-SIZE,3.5,6,0,1,11.904762,232
7,MID-SIZE,3.7,6,0,1,10.629252,255
8,MID-SIZE,3.7,6,1,1,10.204082,267
9,COMPACT,2.4,4,0,1,13.180272,212


### Feature Scaling and Training the model

In [24]:
from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler

# sc = MinMaxScaler()

In [25]:
X = dataset.drop(columns=["Vehicle Class", "CO2 Emissions(g/km)"])

In [26]:
X.head(10)

Unnamed: 0,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (kmpl)
0,2.0,4,0,1,14.030612
1,2.4,4,1,1,12.329932
2,1.5,4,0,1,20.408163
3,3.5,6,0,1,10.629252
4,3.5,6,0,1,11.479592
5,3.5,6,0,1,11.904762
6,3.5,6,0,1,11.904762
7,3.7,6,0,1,10.629252
8,3.7,6,1,1,10.204082
9,2.4,4,0,1,13.180272


In [27]:
y = dataset["CO2 Emissions(g/km)"]

In [28]:
y.head(10)

0    196
1    221
2    136
3    255
4    244
5    230
6    232
7    255
8    267
9    212
Name: CO2 Emissions(g/km), dtype: int64

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [30]:
X_test

Unnamed: 0,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption Comb (kmpl)
3210,2.0,4,1,0,13.180272
4895,1.6,4,0,0,14.455782
267,5.3,8,0,0,8.503401
1610,2.4,4,0,0,19.132653
1988,3.4,6,1,1,11.054422
...,...,...,...,...,...
5246,2.0,4,0,0,14.455782
334,1.4,4,0,0,15.306122
3512,1.6,4,1,2,19.132653
5909,1.8,4,0,0,14.455782


In [31]:
lr = LinearRegression()

In [32]:
y_train

5650    261
6068    273
6243    215
248     173
5832    312
       ... 
4032    198
5501    230
5536    273
5710    242
952     179
Name: CO2 Emissions(g/km), Length: 4699, dtype: int64

In [33]:
model = lr.fit(X_train, y_train)

In [34]:
model.predict(X_test[:100])

array([213.98694296, 194.90508071, 325.44501145, 139.62443811,
       265.22456729, 260.54233133, 304.7510686 , 166.37177994,
       271.66182343, 205.88163322, 251.70245109, 234.14012103,
       314.32551934, 231.59455734, 193.06905328, 198.23608102,
       289.0756321 , 285.06092741, 273.40577739, 175.06872058,
       293.90080765, 250.81935919, 198.23608102, 289.0756321 ,
       226.03481129, 243.42358571, 187.34053574, 246.14270503,
       251.70245109, 217.14406492, 273.40577739, 315.75265536,
       246.14270503, 327.30687076, 191.88002089, 187.34053574,
       330.34608737, 222.28605842, 317.68155389,  77.41584308,
       337.47295241, 206.34139076, 306.63796216, 348.98484012,
       179.27723099, 189.20239506,  95.88104018, 120.00691795,
       293.90080765, 283.09813349, 181.78078969, 211.90113681,
       316.18737865, 226.03481129, 279.8179993 , 268.38168925,
       181.16412454, 261.27690183, 196.37422171, 296.98761105,
       208.4271969 , 218.73669308, 123.67977045, 317.14

In [35]:
y_test[:100]

3210    214
4895    194
267     317
1610    145
1988    251
       ... 
6558    220
6826    244
2917    261
2957    226
4110    175
Name: CO2 Emissions(g/km), Length: 100, dtype: int64

In [36]:
model.score(X_test, y_test)

0.9250165077034531

In [37]:
model_svr = SVR()

In [38]:
model_svr.fit(X_train, y_train)

In [39]:
model_svr.predict(X_test[:20])

array([212.00206049, 192.77620688, 324.49327681, 154.39005312,
       255.86891725, 247.07313963, 295.12148202, 169.10095472,
       265.58078671, 205.62733473, 237.26109721, 235.09983952,
       306.19070587, 234.91267696, 189.1916861 , 192.79749448,
       293.89962495, 292.10391858, 273.36095625, 172.31399858])

In [40]:
y_test[:20]

3210    214
4895    194
267     317
1610    145
1988    251
6438    250
1871    281
3798    170
3634    270
4188    204
101     232
1549    230
1551    292
6497    233
5966    218
1166    193
1368    290
4087    306
1138    269
3793    173
Name: CO2 Emissions(g/km), dtype: int64

In [41]:
model_svr.score(X_test, y_test)

0.9458388992920579

### Saving the model

In [47]:
# import pickle

In [48]:
# pickle.dump(model_lr, open("model.pkl", "wb"))

In [49]:
# pickle.dump(model_svr, open("model_svr.pkl", "wb"))

In [50]:
# pickle.dump(model_xgboost, open("model_xgboost.pkl", "wb"))

In [42]:
import joblib

In [43]:
joblib.dump(model, "model_lr.pkl")

['model_lr.pkl']

In [44]:
joblib.dump(model_svr, "model_svr.pkl")

['model_svr.pkl']