### Car Price Prediction using Linear Regression
<img src='https://i.pinimg.com/originals/72/0c/a5/720ca5bb5a0a70ea0fb60e7db560c952.gif' height=500 width=500/>
<p>Business Goal: 
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.</p>

In [None]:
#importing libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
#reading data 

dfcars = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')
dfcars.head()


In [None]:
dfcars.dtypes

In [None]:
dfcars.columns

#### Checking for missing values

In [None]:
dfcars.isnull().sum()

In [None]:
dfcars.isnull().values.any()

There are no missing values in the data so lets move on and check for duplicate records 

#### Checking for duplicate records

In [None]:
dfcars.duplicated(dfcars.columns[1:]).sum()

There are no duplicate rows in the data. Thus, we can say that the data is clean and ready for further process

In [None]:
# drop car_id column as it is irrelevant for the analysis and modeling
dfcars.drop('car_ID',axis=1,inplace=True)

In [None]:
dfcars.head()

### Exploratory Data Analysis

As there is no column for car brands and by the knowledge of car's business we can say that the price range varies with brand.
Lets create a brand column from car name.

In [None]:
dfcars['brand'] = dfcars['CarName'].str.split(' ',expand=True)[0]

In [None]:
dfcars.brand.unique()

Here, it is clearly evident that there are some values which have different spellings but they are the same brand. Lets resolve this issue

In [None]:
dfcars.loc[dfcars['brand']=='maxda','brand'] = 'mazda'
dfcars.loc[dfcars['brand']=='Nissan','brand'] = 'nissan'
dfcars.loc[dfcars['brand']=='porcshce','brand'] = 'porsche'
dfcars.loc[dfcars['brand']=='toyouta','brand'] = 'toyota'
dfcars.loc[(dfcars['brand']=='vokswagen') | (dfcars['brand']=='vw'),'brand'] = 'volkswagen'

In [None]:
dfcars.brand.unique()

In [None]:
#explore dependent variable
sns.distplot(dfcars['price'])
plt.axvline(dfcars['price'].mean(),color='red')
plt.axvline(dfcars['price'].median(),color='green')

The histogram tells us that the mean is somewhere near 13000 USD and most of the cars are below the price of 20000 USD 

In [None]:
#box plot
sns.boxplot(dfcars['price'])

Box plot tells us that there are outliers in the data, which are some high end costly cars priced above 30000 USD

In [None]:
dfcars.columns

In [None]:
#Categorical variables
cat = ['price','symboling','brand','CarName','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','enginetype','cylindernumber','fuelsystem']
dfcars_cat = dfcars[cat]

#Continous variables
cont = [i for i in dfcars.columns if i not in cat]+['price']
dfcars_cont = dfcars[cont]

#### Univariate Analysis for Categorical features

In [None]:
dfcars_cat.symboling.value_counts().plot(kind='bar')


symboling is a score associated with risk of owning the car. +ve number is riskier and -ve is safer.

The above graph shows that there are more cars with positive symboling thus more riskier to own these cars.
Our assumption is that: if symboling is high number, car should have lesser price comparatively.
    



In [None]:
print(len(dfcars_cat.CarName.unique()),'unique cars')


In [None]:
dfcars_cat.brand.value_counts().plot(kind='bar')

In [None]:
dfcars_cat.fueltype.value_counts()

Most of the cars are with gas fuel type.

In [None]:
dfcars_cat.aspiration.value_counts().plot(kind='bar')

Most of the cars have standard aspiration. We now assume that turbo aspiration should be costly. We will take that into consideration when we are going to do segmented analysis

In [None]:
dfcars_cat.doornumber.value_counts()


There are more than 50% cars with 4 doors

In [None]:
dfcars_cat.carbody.value_counts()


Close to 50% of cars are sedan in the dataset

In [None]:
dfcars_cat.drivewheel.value_counts().plot(kind='bar')

There are more front wheel drive cars and then rare wheel drive cars and a very few numbers of all wheel drive cars

In [None]:
dfcars_cat.enginelocation.value_counts()

Since, there are not enough number of data points having enginelocation as 'rear', it makes no sense to include this feature for modeling.

In [None]:
#dropping engine location
dfcars.drop('enginelocation',axis=1,inplace=True)

In [None]:
dfcars_cat.enginetype.value_counts()

72% of the cars are of ohc type engine

In [None]:
dfcars_cat.cylindernumber.value_counts().plot(kind='bar')

78% of the cars have 4 cylinders

In [None]:
dfcars_cat.fuelsystem.value_counts()

#### Segmented Univariate analysis with price

In [None]:
sns.boxplot(data=dfcars,x='symboling',y='price')

It is evident from the box plot that our initial assumption is correct that negative symboling cars are costlier.

In [None]:
sns.boxplot(data=dfcars_cat,x='price',y='brand')

BMW, porsche, Jaguar and buick has the costlier car variants

In [None]:
dfcars_cat.columns

In [None]:
sns.boxplot(data=dfcars_cat,y='price',x='fueltype')

On average diesel cars are costlier than gas

In [None]:
sns.boxplot(data=dfcars_cat,x='price',y='aspiration')

In [None]:
sns.boxplot(data=dfcars_cat,x='doornumber',y='price')

Doornumber does not impact price much as it is visible from the boxplot. Thus we can remove this variable but it is too early to do that. Lets build some models and then we will see if doornumber has to be removed

In [None]:
sns.boxplot(data=dfcars_cat,x='carbody',y='price')

Hatchbacks and wagons are cheaper cars as per the above box plot

In [None]:
sns.boxplot(data=dfcars_cat,x='enginetype',y='price')

#### univariate analysis for continous variables

In [None]:
dfcars_cont.columns

In [None]:
dfcars_cont[dfcars_cont.columns[:7]].describe()

In [None]:
dfcars_cont[dfcars_cont.columns[7:]].describe()

In [None]:
sns.pairplot(dfcars_cont)

In [None]:
sns.distplot(dfcars_cont.wheelbase)

In [None]:
sns.distplot(dfcars_cont.carlength)

In [None]:
sns.distplot(dfcars_cont.horsepower)

#### Bivariate Analysis

In [None]:
dfcars_cont.corr()

In [None]:
#lets see pictorial representation in form of heatmap
sns.color_palette("YlOrBr", as_cmap=True)
plt.figure(figsize = (16,5))
sns.heatmap(dfcars_cont.corr(),annot=True)

Lets observe correlation of continous features with price.
<ul>
    <li>Stroke has low correlation with price</li>
    <li>compressionratio has low correlation with price</li>
    <li>Carheight  and peakrpm is also low correlation</li>
</ul>
Remove these vars from dataframe

NOTE: We observed multicollinearity in the dataset but lets continue to build LR model and we will check VIFs for each of the features

In [None]:
dfcars.drop(['stroke','compressionratio','carheight','peakrpm'],axis=1,inplace=True)

In [None]:
dfcars.drop('CarName',axis=1,inplace=True)

In [None]:
dfcars.columns

### Modeling : Linear Regression

Since there are categorical variables which can not be taken into modeling directly. We need to do encoding before moving forward.

In [None]:
dfcars['doornumber'] = dfcars['doornumber'].map({'two':0,'four':1})
dfcars['fueltype'] = dfcars['fueltype'].map({'gas':0,'diesel':1})


In [None]:
#lets convert other categorical vars to dummy vars
cols_for_dummy = ['brand','aspiration','carbody','drivewheel','enginetype','fuelsystem','cylindernumber']
dummy = pd.get_dummies(dfcars[cols_for_dummy],drop_first=True)
dfcars.drop(cols_for_dummy,axis=1,inplace=True)

In [None]:
#concat dfcars and dummy
dfcars = pd.concat([dfcars,dummy],axis=1)

In [None]:
dfcars.head()

In [None]:
#importing libs for MachineLearning
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
scale = MinMaxScaler()
to_be_scaled = ['symboling','wheelbase','carlength','carwidth','curbweight','enginesize','boreratio','horsepower','citympg'
               ,'highwaympg','price']
dfcars[to_be_scaled] = scale.fit_transform(dfcars[to_be_scaled])

In [None]:
dfcars

In [None]:
#train test split
x = list(dfcars.columns)
x.remove('price')
xtrain,xtest,Ytrain,Ytest = train_test_split(dfcars[x],dfcars['price'],train_size=0.7)

In [None]:
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()

In [None]:
lr_model.summary()

In [None]:
#calculate VIFs
def giveVIFs():
    vif = pd.DataFrame()
    vif['features'] = xtrain.columns
    vif['VIF'] = [variance_inflation_factor(xtrain.values,i) for i in range(xtrain.shape[1])]
    vif = vif.sort_values(by='VIF',ascending=False)
    display(vif)
giveVIFs()

we observe fuelsystem_spdi, fuelsystem_mpfi, fuelsystem_mfi,fuelsystem_4bbl,enginetype_ohc,drivewheel_rwd p-values are huge, lets drop this variable and rebuild the model


In [None]:
xtrain.drop(['cylindernumber_two', 'enginetype_l', 'enginetype_rotor','fuelsystem_idi','fueltype','cylindernumber_three'],axis=1,inplace=True)

In [None]:
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#Lets remove symboling
xtrain.drop(['symboling'],axis=1,inplace=True)

In [None]:
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#Lets remove carlength
xtrain.drop(['carlength'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#Lets remove carlength
xtrain.drop(['citympg'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#Lets remove horspower
xtrain.drop(['horsepower'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#Lets remove wheelbase and brand_subaru
xtrain.drop(['wheelbase','brand_subaru'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#leats remove highwaympg and fuelsystem_mpfi
xtrain.drop(['highwaympg','fuelsystem_mpfi'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove drivewheel_fwd and drivewheel_rwd
xtrain.drop(['drivewheel_fwd','drivewheel_rwd'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_mazda
xtrain.drop(['brand_mazda'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
#lets remove brand_honda and doornumber
xtrain.drop(['brand_honda','doornumber'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_renault
xtrain.drop(['brand_renault'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_jaguar and brand_dodge
xtrain.drop(['brand_jaguar','brand_dodge'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove fuelsystem_4bbl and fuelsystem_spdi
xtrain.drop(['fuelsystem_4bbl','fuelsystem_spdi'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_chevrolet, brand_volkswagen
xtrain.drop(['brand_chevrolet','brand_volkswagen'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_mercury
xtrain.drop(['brand_mercury'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove enginetype_ohcv
xtrain.drop(['enginetype_ohcv'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#lets remove brand_isuzu and enginetype_ohc
xtrain.drop(['brand_isuzu','enginetype_ohc'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
xtrain.drop(['brand_peugeot','brand_audi'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
xtrain.drop(['brand_buick'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#let remove carbody_hardtop
xtrain.drop(['carbody_hardtop'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove fuelsystem_2bbl,enginetype_ohcf and brand_toyota
xtrain.drop(['fuelsystem_2bbl','enginetype_ohcf','brand_toyota'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove curbweight
xtrain.drop(['curbweight'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove carwidth
xtrain.drop(['carwidth'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()


In [None]:
#remove boreratio
xtrain.drop(['boreratio'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove brand_plymouth
xtrain.drop(['brand_plymouth'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
# remove fuelsystem_mfi
xtrain.drop(['fuelsystem_mfi'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove cylindernumber_five
xtrain.drop(['cylindernumber_five'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove enginetype_dohcv
xtrain.drop(['enginetype_dohcv'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

In [None]:
#remove cylindernumber_four
xtrain.drop(['cylindernumber_four'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
#remove brand_volvo
xtrain.drop(['brand_volvo'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
#remove brand_saab
xtrain.drop(['brand_saab'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
#remove cylindernumber_six
xtrain.drop(['cylindernumber_six'],axis=1,inplace=True)
xtrain_sm = sm.add_constant(xtrain)
lr = sm.OLS(Ytrain,xtrain_sm)
lr_model = lr.fit()
lr_model.summary()

In [None]:
giveVIFs()

#### It seems like the model is good now and VIFs are also looks fine for all the remianing features

## Residual Analysis

In [None]:
y_train_pred = lr_model.predict(xtrain_sm)
y_train_pred

In [None]:
res = Ytrain - y_train_pred
sns.distplot(res)

Residuals are normally distributed centered at 0

In [None]:
sns.scatterplot(res)

## Predictions

In [None]:
xtest.describe()

In [None]:
xtest = xtest[['enginesize', 'brand_bmw', 'brand_mitsubishi', 'brand_nissan',
       'brand_porsche', 'aspiration_turbo', 'carbody_hatchback',
       'carbody_sedan', 'carbody_wagon', 'fuelsystem_spfi',
       'cylindernumber_twelve']]
xtest_sm = sm.add_constant(xtest)
xtest_sm

In [None]:
#make predictions on test data
y_test_pred = lr_model.predict(xtest_sm)
y_test_pred

In [None]:
#check r2 score
from sklearn.metrics import r2_score

In [None]:
r2_score(Ytest,y_test_pred)