# **Importing Python Libraries & Reading the Data**

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import metrics

import plotly as py
import datetime
from datetime import date
import cufflinks as cf
from plotly.offline import iplot
import plotly.express as px


car = pd.read_csv("../input/car-data/CarPrice_Assignment.csv")


Using with pandas head function, the first 5 rows can be reviewed.

In [None]:
car.head()

We can use pandas describe function to get stastical information about numeric variables in our dataset.

In [None]:
car.describe()


I will check data types and non null values with info function.

In [None]:
car.info()



# Explotary Data Analysis

For understanding numeric variables and their relations among them, we can use data visualization methods and statistical results. Lets start with numeric variables.

I will look at distribution of price values with histogram plot.

The distribution seem right skewed and normal

In [None]:
sns.distplot(car["price"])

We can use pairplot to understand pairwise relationships between variables

In [None]:
sns.pairplot(car)

As you can see on the graph above, some of the variables seem have strong correlations for example citympg and highwaympg. We can use corr() function to calculate pairwise correlations of all numerical variables in our dataset. 

In [None]:
car.corr()

To see these relations both visually and numerically, we can use seaborn heatmap function like below;

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(car.corr(),annot=True,cmap="YlGnBu")

Some of the pairs with high correlations are carlength-wheelbase ,highwaympg-citympg, curbweight-enginesize, horsepower-price etc.

In [None]:
#Car_ID is irrelevant so I dropped it 
car=car.drop('car_ID',axis=1)

Now, lets explore categorical variables in our dataset. Firstly, I will start with evaluating Carname. 

In [None]:
sns.countplot(car["CarName"])


As you can see above there are high variety of car names. I looked at the values of carname and decided to split carnames and create the brand information from this carname variable.

In [None]:
car["CarBrand"] = pd.NaT
for i in range(len(car["CarName"])):
               car["CarBrand"][i]=car["CarName"][i].split()[0]

In [None]:
sns.set(rc = {'figure.figsize':(30,15)})
sns.countplot(car["CarBrand"])


As you can see above, car brand variable seem more convenient to use, but some of the car brand names seems wrong. I will also fix these mistakes.

In [None]:
car["CarBrand"].unique()

In [None]:
car["CarBrand"]=car["CarBrand"].replace("maxda,mazda")
car["CarBrand"]=car["CarBrand"].replace("alfa-romero","alfa-romeo")
car["CarBrand"]=car["CarBrand"].replace("Nissan","nissan")
car["CarBrand"]=car["CarBrand"].replace("porcshce","porsche")
car["CarBrand"]=car["CarBrand"].replace("toyouta","toyota")
car["CarBrand"]=car["CarBrand"].replace("toyouta","toyota")
car["CarBrand"]=car["CarBrand"].replace("vokswagen","volkswagen")
car["CarBrand"]=car["CarBrand"].replace("vw","volkswagen")

In [None]:
sns.set(rc = {'figure.figsize':(20,8)})
sns.countplot(car["CarBrand"])

I will check carbrand and price relation with boxplot. If I find carbrand might affect price information then I will convert carbrand information to dummy variables to use them in our regression model.

In [None]:
sns.boxplot(data=car, x="CarBrand" , y="price")

Carbrand might have an affect on price value, so I decided to use carbrand variable with converting dummy variable.

In [None]:
car=pd.get_dummies(car, prefix='', prefix_sep='', 
                            columns=['CarBrand'])
car=car.drop('CarName',axis=1)
car

With the same logic, I decided to which variables will be encoded and which of them will be removed from the dataset. You can see these code lines given below ;

In [None]:
sns.countplot(car["fueltype"])
sns.boxplot(data=car, x="fueltype",y="price")
car=pd.get_dummies(car, prefix='', prefix_sep='', 
                            columns=['fueltype'])




In [None]:
sns.countplot(car['aspiration'])
sns.boxplot(data=car,x="aspiration", y="price")
car=pd.get_dummies(car, prefix='', prefix_sep='', 
                            columns=['aspiration'])




In [None]:

sns.countplot(car['doornumber'])
sns.boxplot(data=car,x="doornumber",y="price")

sns.stripplot(data=car,x="doornumber",y="price")

car=car.drop('doornumber',axis=1)








In [None]:
sns.countplot(car['carbody'])
sns.stripplot(data=car,x="carbody",y="price")
car=car.drop('carbody',axis=1)


In [None]:
sns.countplot(car['drivewheel'])
sns.stripplot(data=car,x="drivewheel",y="price")
car=pd.get_dummies(car, prefix='', prefix_sep='', 
                            columns=['drivewheel'])


In [None]:

sns.countplot(car['enginelocation'])
sns.stripplot(data=car,x="enginelocation",y="price")
car=car.drop('enginelocation',axis=1)

In [None]:

sns.countplot(car['enginetype'])
sns.stripplot(data=car,x="enginetype",y="price")
car=car.drop('enginetype',axis=1)

In [None]:

sns.countplot(car['cylindernumber'])
sns.stripplot(data=car,x="cylindernumber",y="price")
car=pd.get_dummies(car, prefix='cylindernum', prefix_sep='', 
                            columns=['cylindernumber'])

In [None]:

sns.countplot(car['fuelsystem'])
sns.stripplot(data=car,x="fuelsystem",y="price")
car=pd.get_dummies(car, prefix='fuel', prefix_sep='', 
                            columns=['fuelsystem'])

I dropped doornumber, carbody, enginelocation and enginetype due to their stripplot and boxplot graphs. These variables doesn't seem affect on price values

In [None]:
car.info()

# Model Building - Linear Regression

Firstly, we can import train_test_split function from sklearn.model_selection then divided to our dateset with 30% of test and 70% of training set.

In [None]:
from sklearn.model_selection import train_test_split

X=car.drop("price",axis=1)
y=car["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)


I imported linear regression model from sklearn.linear_model and I imported statmodel.api for looking statistical results of the model. 

In [None]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.feature_selection import RFE


lm = LinearRegression()

lm.fit(X_train,y_train)

X_reg = sm.add_constant(X)
model = sm.OLS(y_train, X_train).fit()

predictions = lm.predict( X_test)

I will evaluate the model with MAE, MSE, RMSE, R square and Adjusted R square value. I will record these metrics for each model to a data frame to see how we improve our results from beginning to end. 

In [None]:

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

In [None]:
count=1
MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj

model_metrics = pd.DataFrame(columns = ["# No","MAE", "MSE", "RMSE", "R_square", "Adjusted_R_square"])
model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
model_metrics

# Linear Regression with Recursive Feature Elimination (RFE) 

In the first linear regression model built, there are 57 variables and there are multicollinearity problems between variables, so I just want to measure metrics without any interference firstly. Now, we can use RFE method to select desired number of features. I will choose 30 variables to build new model.

In [None]:

lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 30)             
rfe = rfe.fit(X_train, y_train)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))


[('symboling', False, 25),
 ('wheelbase', False, 16),
 ('carlength', False, 17),
 ('carwidth', False, 7),
 ('carheight', False, 26),
 ('curbweight', False, 27),
 ('enginesize', False, 12),
 ('boreratio', True, 1),
 ('stroke', True, 1),
 ('compressionratio', False, 2),
 ('horsepower', False, 24),
 ('peakrpm', False, 28),
 ('citympg', False, 20),
 ('highwaympg', False, 23),
 ('alfa-romeo', False, 13),
 ('audi', True, 1),
 ('bmw', True, 1),
 ('buick', True, 1),
 ('chevrolet', True, 1),
 ('dodge', True, 1),
 ('honda', False, 8),
 ('isuzu', False, 14),
 ('jaguar', True, 1),
 ('mazda', False, 5),
 ('mercury', True, 1),
 ('mitsubishi', True, 1),
 ('nissan', True, 1),
 ('peugeot', True, 1),
 ('plymouth', True, 1),
 ('porsche', True, 1),
 ('renault', True, 1),
 ('saab', True, 1),
 ('subaru', False, 22),
 ('toyota', False, 3),
 ('volkswagen', False, 11),
 ('volvo', True, 1),
 ('diesel', True, 1),
 ('gas', True, 1),
 ('std', True, 1),
 ('turbo', True, 1),
 ('4wd', False, 19),
 ('fwd', False, 6),
 ('rwd', True, 1),
 ('cylindernumeight', True, 1),
 ('cylindernumfive', False, 21),
 ('cylindernumfour', True, 1),
 ('cylindernumsix', False, 18),
 ('cylindernumthree', False, 29),
 ('cylindernumtwelve', True, 1),
 ('cylindernumtwo', True, 1),
 ('fuel1bbl', False, 4),
 ('fuel2bbl', False, 9),
 ('fuel4bbl', True, 1),
 ('fuelidi', True, 1),
 ('fuelmfi', True, 1),
 ('fuelmpfi', True, 1),
 ('fuelspdi', False, 10),
 ('fuelspfi', False, 15)]

In [None]:
col_sup = X_train.columns[rfe.support_]

col_sup


Index(['boreratio', 'stroke', 'audi', 'bmw', 'buick', 'chevrolet', 'dodge',
       'jaguar', 'mercury', 'mitsubishi', 'nissan', 'peugeot', 'plymouth',
       'porsche', 'renault', 'saab', 'volvo', 'diesel', 'gas', 'std', 'turbo',
       'rwd', 'cylindernumeight', 'cylindernumfour', 'cylindernumtwelve',
       'cylindernumtwo', 'fuel4bbl', 'fuelidi', 'fuelmfi', 'fuelmpfi'],
      dtype='object')

In [None]:
X_train_rfe = X_train[col_sup]


I run the model with 30 selected variables according to RFE method, there are some multicollinearity problems and high p-values seen from the results summary as you can see below. We will continue to remove features to solve these problems.

In [None]:
import statsmodels.api as sm  
X_train_rfe= sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())


X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)


MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
model_metrics

# Detecting Multicollinearity with Variance Inflation Factors ( VIF) 

As it mentioned above, there are multicollinearity problems with model as expected from the above graphs we mentioned. We can measure multicollinearity with VIF score. Generally, variable VIF score should be less than 5.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()

vif_data["feature"] = X_train_rfe.columns

vif_data["VIF"] = [variance_inflation_factor(X_train_rfe.values, i)
                          for i in range(len(X_train_rfe.columns))]
vif_data

In [None]:
#I dropped diesel due to inf vif value

X_train_rfe=X_train_rfe.drop('diesel',axis=1)
vif_data = pd.DataFrame()

vif_data["feature"] = X_train_rfe.columns

vif_data["VIF"] = [variance_inflation_factor(X_train_rfe.values, i)
                          for i in range(len(X_train_rfe.columns))]
vif_data

In [None]:
#I dropped gas due to inf vif value

X_train_rfe=X_train_rfe.drop('gas',axis=1)
vif_data = pd.DataFrame()

vif_data["feature"] = X_train_rfe.columns

vif_data["VIF"] = [variance_inflation_factor(X_train_rfe.values, i)
                          for i in range(len(X_train_rfe.columns))]
vif_data

In [None]:
#I dropped std due to inf vif value

X_train_rfe=X_train_rfe.drop('std',axis=1)
vif_data = pd.DataFrame()

vif_data["feature"] = X_train_rfe.columns

vif_data["VIF"] = [variance_inflation_factor(X_train_rfe.values, i)
                          for i in range(len(X_train_rfe.columns))]
vif_data

Now, all the variables' VIF score are less than 5 and we can run the model again. I assume that multicollinearity would not be problem due to OLS summary results.

In [None]:

X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

According to OLS summary results, we fix multicollinearity problem. In the next step, we will evaluate p-values of variables and we will remove variables with high p-values from the model and improve model metrics further. I will remove plymouth variables due to high p value.

In [None]:
#I dropped plymouth due to high p value

X_train_rfe=X_train_rfe.drop('plymouth',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1



In [None]:
#I dropped fuel4bbl due to high p value

X_train_rfe=X_train_rfe.drop('fuel4bbl',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped mitsubishi due to high p value

X_train_rfe=X_train_rfe.drop('mitsubishi',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped chevrolet due to high p value

X_train_rfe=X_train_rfe.drop('chevrolet',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped fuelidi due to high p value

X_train_rfe=X_train_rfe.drop('fuelidi',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped fuelmfi due to high p value

X_train_rfe=X_train_rfe.drop('fuelmfi',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped mercury due to high p value

X_train_rfe=X_train_rfe.drop('mercury',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped renault due to high p value

X_train_rfe=X_train_rfe.drop('renault',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped nissan due to high p value

X_train_rfe=X_train_rfe.drop('nissan',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped dodge due to high p value

X_train_rfe=X_train_rfe.drop('dodge',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped cylindernumtwo due to high p value

X_train_rfe=X_train_rfe.drop('cylindernumtwo',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped rwd due to high p value

X_train_rfe=X_train_rfe.drop('rwd',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped stroke due to high p value

X_train_rfe=X_train_rfe.drop('stroke',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

In [None]:
#I dropped cylindernumtwelve due to high p value

X_train_rfe=X_train_rfe.drop('cylindernumtwelve',axis=1)
X_train_rfe = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfe).fit()

#Summary of linear model
print(lm_rfe.summary())

# Adding constant
X_test_rfe = sm.add_constant(X_test)

X_test_new = X_test_rfe[X_train_rfe.columns]
predictions = lm_rfe.predict(X_test_new)

model=lm_rfe

 
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("R square:",metrics.r2_score(y_test, predictions))
print('Adjusted R square : ', model.rsquared_adj)

MAE=metrics.mean_absolute_error(y_test, predictions)
MSE=metrics.mean_squared_error(y_test, predictions)
RMSE=np.sqrt(metrics.mean_squared_error(y_test, predictions))
R_square=metrics.r2_score(y_test, predictions)
Adjusted_R_square= model.rsquared_adj


model_metrics.loc[count]=[count,MAE , MSE, RMSE, R_square, Adjusted_R_square]
count=count+1

Now, all the variables are significant in the model. We can evaluate the model results with consolidated model_metrics array

In [None]:
model_metrics

In [None]:
sns.set_style("whitegrid")
sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.ylim(ymin=3000,ymax=4000)
sns.lineplot(x=model_metrics['# No'],y=model_metrics["RMSE"])

As you can see above, RMSE error values distributed among 3400 -3500 after the first model. The improvement is significant when it is compared the first linear model we build and it deviates due to RFE score, VIF score and p-value elimination 

In [None]:

sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.ylim(ymin=0.70,ymax=1.0)
sns.lineplot(x=model_metrics['# No'],y=model_metrics["R_square"])

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.ylim(ymin=0.70,ymax=1.0)
sns.lineplot(x=model_metrics['# No'],y=model_metrics["Adjusted_R_square"])

R square value increased to 83% whereas Adjusted R square decreased to 91%. Both of the metrics are satisfactory and prove that the final model will have a good fit with the data while making a prediction for car prices.

In this post, I shared how to build linear regression model with scikit learn library and how to select features with using RFE and VIF.