<a href="https://colab.research.google.com/github/ejini6969/Optimization-Deep-Learning-ODL-/blob/main/MLR_MPG_(_Lab_5).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Auto MPG Data Set

Using an appropriate model (Linear Regression / Logistic Regression / Non Linear Regression), conduct the analysis for the "mpg" data. Do perform hyper parameter tuning to improve the model performance

Data Dictionary
1. mpg - fuel efficiency measured in miles per gallon (mpg)
2. cylinders - number of cylinders in the engine
3. displacement - engine displacement (in cubic inches)
4. horsepower - engine horsepower
5. weight - vehicle weight (in pounds)
6. acceleration - time to accelerate from O to 60 mph (in seconds)
7. model year
8. origin - origin of car (1: American, 2: European, 3: Japanese)
9. car name


In [1]:
# Importing the libraries
import numpy as np
import pandas  as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

In [3]:
from google.colab import drive
drive.mount("/drive")

Mounted at /drive


In [4]:
path = "/content/mpg.csv"
df = pd.read_csv(path)
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18,8,307,130,3504,12,70,1,\tchevrolet chevelle malibu
1,15,8,350,165,3693,12,70,1,\tbuick skylark 320
2,18,8,318,150,3436,11,70,1,\tplymouth satellite
3,16,8,304,150,3433,12,70,1,\tamc rebel sst
4,17,8,302,140,3449,11,70,1,\tford torino
...,...,...,...,...,...,...,...,...,...
393,27,4,140,86,2790,16,82,1,\tford mustang gl
394,44,4,97,52,2130,25,82,2,\tvw pickup
395,32,4,135,84,2295,12,82,1,\tdodge rampage
396,28,4,120,79,2625,19,82,1,\tford ranger


In [5]:
corr = df.corr(method = "pearson")
corr.style.background_gradient(cmap = "coolwarm").set_precision(2)

  corr.style.background_gradient(cmap = "coolwarm").set_precision(2)


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.0,-0.77,-0.8,-0.78,-0.83,0.4,0.58,0.56
cylinders,-0.77,1.0,0.95,0.84,0.9,-0.5,-0.35,-0.56
displacement,-0.8,0.95,1.0,0.9,0.93,-0.54,-0.37,-0.61
horsepower,-0.78,0.84,0.9,1.0,0.86,-0.68,-0.42,-0.45
weight,-0.83,0.9,0.93,0.86,1.0,-0.41,-0.31,-0.58
acceleration,0.4,-0.5,-0.54,-0.68,-0.41,1.0,0.26,0.21
model_year,0.58,-0.35,-0.37,-0.42,-0.31,0.26,1.0,0.18
origin,0.56,-0.56,-0.61,-0.45,-0.58,0.21,0.18,1.0


In [6]:
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'car_name'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   mpg           398 non-null    int64 
 1   cylinders     398 non-null    int64 
 2   displacement  398 non-null    int64 
 3   horsepower    398 non-null    int64 
 4   weight        398 non-null    int64 
 5   acceleration  398 non-null    int64 
 6   model_year    398 non-null    int64 
 7   origin        398 non-null    int64 
 8   car_name      398 non-null    object
dtypes: int64(8), object(1)
memory usage: 28.1+ KB


In [8]:
df.shape

(398, 9)

In [16]:
Y = df["mpg"]
X = df.drop(["car_name", "mpg"], axis = 1)

In [17]:
X.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,8,307,130,3504,12,70,1
1,8,350,165,3693,12,70,1
2,8,318,150,3436,11,70,1
3,8,304,150,3433,12,70,1
4,8,302,140,3449,11,70,1


In [18]:
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(278, 7) (120, 7) (278,) (120,)


In [20]:
# normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [21]:
model = LinearRegression()
model.fit(X_train, Y_train)

# to calculate R squared
R2 = model.score(X_test, Y_test)
print("R-squared: %.4f" %R2) # 67.35% of price are influenced or explained by the feature variables

print("Intercept (b0): ", model.intercept_)
print("slope(b1):", model.coef_) # average price not influenced by the feature variable

y_pred = model.predict(X_test)

# Find the mean square error
MSE = mean_squared_error(Y_test, y_pred)
print("Mean squared error: %.2f" % MSE)

R-squared: 0.8180
Intercept (b0):  23.413669064748202
slope(b1): [-0.52297989  2.11182116 -0.82179902 -5.47187939  0.43395704  2.75287047
  1.15906569]
Mean squared error: 11.65


In [22]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV # cross validation
parameters = {'loss': ('squared_error', 'huber'), 
              'penalty': ('l2', 'l1', 'elasticnet'), 
              'alpha': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1], 
              'max_iter': [3000, 5000],
              'random_state': [0, 2, 5]
              }
model3 = SGDRegressor()
clf = GridSearchCV(model3, parameters)
clf.fit(X_train, Y_train)

In [26]:
clf.best_params_

{'alpha': 0.01,
 'loss': 'squared_error',
 'max_iter': 3000,
 'penalty': 'l1',
 'random_state': 2}

In [23]:
clf.best_score_

0.8090621255916555

In [24]:
clf.best_estimator_.intercept_

array([23.40744004])

In [25]:
clf.best_estimator_.coef_

array([ 0.        ,  0.58675971, -0.9791737 , -4.52616865,  0.1946593 ,
        2.69931413,  1.05782259])