
Pollution is a big issue in the city. Policy makers in the city want to take action and deploy some measures to address this problem. You have been hired as a machine learning expert to analyze some data and help them make good decisions.

Cars that consume more fuel pollute more. As a first step, we want to estimate how much fuel each individual car consumes every 100 km. The provided dataset concerns city-cycle fuel consumption in liters per 100 kilometers (target).

The aim of this homework is to help you apply the skills that you have learned so far to a real dataset. This involves learning what data means, how to handle and visualize data, training, cross validation, prediction, testing your model, etc.

Description of covariates This dataset has 3 multi-valued discrete and 5 continuous covariates:
1. cylinders:     multi-valued discrete
2. displacement:  continuous
3. horsepower:    continuous
4. weight:        continuous
5. acceleration:  continuous
6. model year:    multi-valued discrete
7. origin:        multi-valued discrete
8. car name:      string (unique for each instance)

### Load data

In [91]:
# importing the essentials:

import numpy as np
import pandas as pd
import scipy as sp 


In [139]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.datasets import make_regression



In [93]:
# for neural networks:

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam
from keras.regularizers import l2


In [94]:
# for plotting:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [95]:
# extras:
import seaborn as sns


In [96]:
# Loading data:
train_data = pd.read_csv('https://raw.githubusercontent.com/onefishy/Rwanda-course-2020/master/Competition_data/train.csv',na_values='?')
train_data.head(5)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,fuel (L/100km)
0,4,106.0,63.0,2123.0,14.7,82,1,"""amc hornet""",6.189737
1,8,400.0,150.0,3760.0,8.5,70,1,"""dodge challenger se""",15.680667
2,4,104.0,70.0,2150.0,13.9,79,1,"""honda civic cvcc""",6.817681
3,4,92.0,68.0,-1971.0,17.6,82,3,"""volvo 264gl""",7.587419
4,6,167.0,120.0,3819.0,16.7,76,2,"""pontiac firebird""",14.255152


In [97]:
test_data = pd.read_csv('https://raw.githubusercontent.com/onefishy/Rwanda-course-2020/master/Competition_data/test.csv')
test_data.head(5)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,4,122.0,80.0,2450.0,15.5,74,1,"""ford galaxie 500"""
1,6,257.0,95.0,3191.0,17.8,76,1,"""ford pinto"""
2,4,87.0,65.0,2108.0,18.9,80,3,"""plymouth duster"""
3,8,399.0,167.0,4906.0,12.5,73,1,"""ford country squire (sw)"""
4,8,399.0,150.0,4462.0,13.0,73,1,"""dodge aspen 6"""


In [98]:
train_data.dtypes

cylinders           int64
displacement      float64
horsepower         object
weight            float64
acceleration      float64
model year          int64
origin              int64
car name           object
fuel (L/100km)    float64
dtype: object

### Clean data

In [99]:
# weights cannot be negative
train_data['weight'] = train_data['weight'].abs()

# horsepower is an object, need to be changed to numerical
train_data['horsepower'] = pd.to_numeric(train_data['horsepower'], errors = 'coerce')

# let us drop the type of car from both train  test dataset
train_data = train_data.drop(['car name'], axis=1)
test_data = test_data.drop(['car name'], axis=1)

### Check for null vals


In [100]:
train_data.isna().sum()

cylinders         0
displacement      0
horsepower        6
weight            0
acceleration      0
model year        0
origin            0
fuel (L/100km)    0
dtype: int64

### Fill nullvals using a linear regression models

In [101]:
# select cols with null vals
null_data = train_data[train_data.isnull().any(axis=1)]
dropped_data = train_data.dropna()

x_imp = dropped_data.drop(['horsepower'], axis=1)
y_imp = dropped_data.horsepower

X_train, X_test, y_train, y_test = train_test_split(x_imp,y_imp, test_size = 0.3, random_state = 42)

lr_model = LinearRegression()
lr_model.fit(X_train,y_train)

LinearRegression()

In [102]:
dropped_null_data = null_data.drop(['horsepower'],axis = 1)

predictions = []
for i in range(dropped_null_data.shape[0]):
    predictions.append(dropped_null_data.iloc[i,:])

#append all null values to the original data
values = []
for i in range(len(predictions)):
    for j in range(dropped_null_data.shape[1]):
        values.append(predictions[i][j])

#start looping fora each cell with null value
#instantaite some local variables for the loop      
i = 0
j = dropped_null_data.shape[1]
lr_predictions =[]

for a in range(0,dropped_null_data.shape[0]):
    print("Prediction {}".format(a+1))
    print(lr_model.predict((np.array([values[i:j]]))))
    lr_predictions.append(lr_model.predict((np.array([values[i:j]])))[0])
    print("---------------")
    i = i+(int(len(values) / len(predictions)))
    j = j+(int(len(values) / len(predictions)))

null_index = train_data[train_data["horsepower"].isna()].index

#Append the predicted null values to our original dataset
for i in range(len(null_index)):
    train_data["horsepower"][null_index[i]] = lr_predictions[i]

#print the number of missing values after imputation
print("Missing Values: {}".format(train_data.isnull().sum().sum()))

Prediction 1
[59.58975531]
---------------
Prediction 2
[74.67456643]
---------------
Prediction 3
[80.85621823]
---------------
Prediction 4
[59.77335237]
---------------
Prediction 5
[106.19192562]
---------------
Prediction 6
[90.55133999]
---------------
Missing Values: 0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data["horsepower"][null_index[i]] = lr_predictions[i]


### Remove outliers

In [103]:
# define limits
low = 0.05
high = 0.95

quantile_df = train_data.quantile([low,high])
features=['displacement', 'horsepower', 'weight', 'acceleration','fuel (L/100km)']

for i in features:
    train_data_rm = train_data[(train_data[i] > quantile_df.loc[0.05, i]) & (train_data[i] < quantile_df.loc[0.95, i])]
    print('Number of rows after outlier in {} removal: {}'.format(i,train_data_rm.shape[0]))

train_data = train_data_rm


Number of rows after outlier in displacement removal: 267
Number of rows after outlier in horsepower removal: 264
Number of rows after outlier in weight removal: 268
Number of rows after outlier in acceleration removal: 264
Number of rows after outlier in fuel (L/100km) removal: 255


### Assign X and y

In [104]:
X = train_data.drop(['fuel (L/100km)'], axis = 1)

y = train_data['fuel (L/100km)']

### One-hot encoding

In [105]:
# encoding all categorical variables

X['origin'] = X['origin'].astype(str)
X['cylinders'] = X['cylinders'].astype(str)
X = pd.get_dummies(X)

test_data['origin'] = test_data['origin'].astype(str)
test_data['cylinders'] = test_data['cylinders'].astype(str)
test_data = pd.get_dummies(test_data)

test_data.insert(loc=7, column='cylinders_5', value = 0)

### Split dataset

In [119]:
x_train,x_test,y_train,y_test = train_test_split(X,y, test_size=0.2,random_state=0)

In [120]:
print("Shape of Training data is:",x_train.shape)
print("Shape of Testing data is:",x_test.shape)

Shape of Training data is: (204, 13)
Shape of Testing data is: (51, 13)


# Modeling

## 1. Support Vector Regression - SVR

In [160]:
svm_regr = make_pipeline(StandardScaler(),SVR(C=8,epsilon=2e-1))
svm_regr.fit(x_train,y_train)
  
# Step 2: Predict label on training set
y_train_pred = svm_regr.predict(x_train)
# Step 3: Compute RMSE on training set 
print('RMSE on Training Data on :', np.sqrt(mean_squared_error(y_train, y_train_pred)))
# Step 4: Predict label on test set
y_test_pred = (svm_regr.predict(x_test))
# Step 5: Compute RMSE on test set 
print('RMSE on Testing Data  :', np.sqrt(mean_squared_error(y_test, y_test_pred)),'\n')

# Saving to a df
prediction_SVM = svm_regr.predict(x_test)

RMSE on Training Data on : 0.772848752617257
RMSE on Testing Data  : 1.0458077072583167 



## 2. Linear Regression

In [161]:
lin_regr = LinearRegression()
lin_regr.fit(x_train,y_train)

y_train_pred = lin_regr.predict(x_train)

print('RMSE on training data on :', np.sqrt(mean_squared_error(y_train,y_train_pred)))

y_test_pred = lin_regr.predict(x_test)
print('RMSE on testing data on :', np.sqrt(mean_squared_error(y_test,y_test_pred)))

prediction_linreg = lin_regr.predict(x_test)


RMSE on training data on : 1.02287719099466
RMSE on testing data on : 0.9605270609452948


## 3. Random Forest

In [162]:
x,y = make_regression(n_features=8, n_informative=3, random_state=0, shuffle=False)
rand_regr = RandomForestRegressor(max_depth=3, random_state=0)
rand_regr.fit(x_train,y_train)

y_train_pred = rand_regr.predict(x_train)
print('RMSE on training data on: ', np.sqrt(mean_squared_error(y_train,y_train_pred)))

y_test_pred = rand_regr.predict(x_test)
print('RMSE on testing data on:', np.sqrt(mean_squared_error(y_test,y_test_pred)))

prediction_randreg = lin_regr.predict(x_test)

RMSE on training data on:  0.956258101283766
RMSE on testing data on: 1.140804750053362


## 4. Decision Tree 

In [163]:
dec_regr = DecisionTreeRegressor(max_depth=6)

dec_regr.fit(x_train,y_train)

y_train_pred = dec_regr.predict(x_train)
print('RMSE on training data on: ', np.sqrt(mean_squared_error(y_train, y_train_pred)))

y_test_pred = dec_regr.predict(x_test)
print('RMSE on training data on: ', np.sqrt(mean_squared_error(y_test, y_test_pred)))


prediction_decreg = dec_regr.predict(x_test)



RMSE on training data on:  0.5358281574186513
RMSE on training data on:  1.3523547731922596
