# Geely Automobile 

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 
 
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market.

## Step 1: Reading and Understanding the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#Importing dataset
car_df = pd.read_csv('CarPrice_Assignment.csv')

FileNotFoundError: [Errno 2] File b'CarPrice_Assignment.csv' does not exist: b'CarPrice_Assignment.csv'

In [None]:
#analyze the dataframe
car_df.head()

In [None]:
# Total records in the dataframe is 205 and 26 columns
car_df.shape

In [None]:
car_df.info()

In [None]:
# percentage of missing values
round(car_df.isnull().sum()/len(car_df.index), 2)*100

#### Seems like there is no null value in the dataframe , hence cleaning and imputing is not required

#### Creating model fitting function for reusability

In [None]:
def fit_LRM(X_train,Y_train):
    X_train = sm.add_constant(X_train)
    lm = sm.OLS(Y_train,X_train).fit() 
    print(lm.summary())
    return lm

In [None]:
def getVIF(X_train):
    # Calculate the VIFs for the new model
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
def draw_categorical(column_name,df,title=None,hue=None):
       fig, ax=plt.subplots(figsize=(14,8))
       plt.title(title)
       ax = sns.countplot(data = df, x= column_name, order=df[column_name].value_counts().index,hue = hue) 
       ax.set(xlabel=column_name, ylabel="Count")

       for p in ax.patches:
        ax.text(p.get_x() + p.get_width()/2., np.nan_to_num(p.get_height()), '%d' % int(np.nan_to_num(p.get_height())), 
                fontsize=12, color='black', ha='center', va='bottom')
       plt.xticks(rotation='vertical',fontsize=12)
       plt.show()

In [None]:
def draw_continous_plot(column_name,df,hue=None,title=None):
 sns.set(style="darkgrid")   
 sns.set_palette(sns.color_palette("hls",20))
 fig, ax=plt.subplots(nrows =1,ncols=2,figsize=(20,8))
 ax[0].set_title("Distribution Plot")
 sns.distplot(df[column_name],ax=ax[0])
 ax[1].set_title("Box Plot")
 sns.boxplot(data =df, x=column_name,ax=ax[1],orient='v')
 plt.xticks(rotation='vertical',fontsize=12)
 plt.show() 

## Step 2 : Data Preparation

There is a variable named CarName which is comprised of two parts - the first word is the name of 'car company' and the second is the 'car model'. For example, chevrolet impala has 'chevrolet' as the car company name and 'impala' as the car model name. You need to consider only company name as the independent variable for model building. 

In [None]:
#Removing the car Name from the dataframe and renaming the column to Car Company
car_df['CarName']=car_df['CarName'].str.split(' ').str[0]
car_df.rename(index=str, columns={"CarName": "CarCompany"},inplace=True)

In [None]:
#analyze the dataframe
car_df.head()

In [None]:
# distinct Car Company to check all names are consistent or not ,certainly the names are incorrectly spelled 
car_df['CarCompany'].unique()

In [None]:
#Correcting the incorrect names in the dataframe
car_df.CarCompany.replace(['Nissan', 'toyouta','porcshce','vw','vokswagen','maxda']
                        , ['nissan', 'toyota','porsche','volkswagen','volkswagen','mazda'], inplace=True)

In [None]:
#check the dataframe again
car_df['CarCompany'].unique()

In [None]:
#checking the distinct value in dataframe ,Car ID is unique ,no duplicate values
car_df.nunique()

In [None]:
#Drop the carId as it is unique and does not have any impact 
car_df.drop(['car_ID'], axis =1, inplace = True)

In [None]:
car_df.head()

## Step3 - Data Analysis 

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(car_df._get_numeric_data().corr(),annot = True)

In [None]:
sns.pairplot(car_df)
plt.show()

From the above visualization, we can see variables (non dependent) are correlated with the dependent variable(price)
- wheelbase
- carlength
- carwidth
- curbweight
- enginesize
- boreratio
- horsepower

Finding the mullticollinearity among the non dependent variables by looking to the heat map
- carlength
- carwidth 
- curbweight
- wheelbase 

In [None]:
#There is a high correlation between the wheelbase,carlength,carwidth and curbweight ,hence dropping 3 and keeping one
#car_df.drop(['carwidth','curbweight','wheelbase'], axis =1, inplace = True)
#will retain carlength

Below variables are also highly correlated 97%
- highwaympg
- Citympg 

In [None]:
#There is a high correlation between the highwaympg and Citympg ,hence dropping 1 and keeping one
#car_df.drop(['citympg'], axis =1, inplace = True)
#will retain highwaympg

Below variables are also highly correlated 84%
- carlength
- carwidth 

### Draw Categorical Variable Count and check which feature is predominant in US market

In [None]:
categorical_columns=list(car_df.columns[car_df.dtypes == 'object'])
categorical_columns

Clearly Toyata captured the US market

In [None]:
draw_categorical(column_name='CarCompany',title='CarCompany Distribution',df=car_df)

People Prefer Gas Vehicle more than the Diesel

In [None]:
draw_categorical(column_name='fueltype',title='fueltype Distribution',df=car_df)

STD Aspiration predominates

In [None]:
draw_categorical(column_name='aspiration',title='aspiration Distribution',df=car_df)

Sedan & Hachback predominates

In [None]:
draw_categorical(column_name='carbody',title='carbody Distribution',df=car_df)

People Prefer FWD drive wheel more than RWD

In [None]:
draw_categorical(column_name='drivewheel',title='drivewheel Distribution',df=car_df)

No Doubt Front Engine location are made more 

In [None]:
draw_categorical(column_name='enginelocation',title='enginelocation Distribution',df=car_df)

DHC engine vehicle are predominat 

In [None]:
draw_categorical(column_name='enginetype',title='enginetype Distribution',df=car_df)

MPFI fuelSystem cars are produced more followed by 2bbl

In [None]:
draw_categorical(column_name='fuelsystem',title='fuelsystem Distribution',df=car_df)

Car Company has definelty affected the car prices

In [None]:
car_df['price'].mean()

Jaguar buick Porshe bmw vovl has car prices above average car prices in across all the companies

In [None]:
car_df.groupby('CarCompany').price.mean().sort_values(ascending=False).head()

In [None]:
continous_variables = list(car_df.columns[car_df.dtypes != 'object'])
continous_variables

In [None]:
#draw_continous_plot(column_name='wheelbase',title='wheelbase Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='carlength',title='carlength Distribution',df=car_df)

In [None]:
#draw_continous_plot(column_name='carwidth',title='carwidth Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='carheight',title='carheight Distribution',df=car_df)

In [None]:
#draw_continous_plot(column_name='curbweight',title='curbweight Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='symboling',title='symboling Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='carlength',title='carlength Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='carheight',title='carheight Distribution',df=car_df)

Engine Size has Outlier 

In [None]:
draw_continous_plot(column_name='enginesize',title='enginesize Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='boreratio',title='boreratio Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='stroke',title='stroke Distribution',df=car_df)

CompressionRatio has outliers

In [None]:
draw_continous_plot(column_name='compressionratio',title='compressionratio Distribution',df=car_df)

Hourse Power have outliers

In [None]:
draw_continous_plot(column_name='horsepower',title='horsepower Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='peakrpm',title='peakrpm Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='highwaympg',title='highwaympg Distribution',df=car_df)

Price is also having outliers

In [None]:
draw_continous_plot(column_name='price',title='price Distribution',df=car_df)

Treating outliers with the 96 percentile value for EngineSize ,CompressionRatio and Hourse Power

In [None]:
car_df[['horsepower','compressionratio','enginesize']].quantile([0.01,0.90,.96])

In [None]:
# Outilers in price of cars
car_df['horsepower'][np.abs(car_df['horsepower'] > 182.00)]= 182.00

In [None]:
car_df['enginesize'][np.abs(car_df['enginesize'] > 209.00)]= 209.00

In [None]:
car_df['compressionratio'][np.abs(car_df['compressionratio'] > 10.94)]= 10.94

Remove outlier of price DF 

In [None]:
car_df= car_df[np.abs(car_df.price-car_df.price.mean()) <= (3*car_df.price.std())]

In [None]:
car_df.shape

In [None]:
draw_continous_plot(column_name='price',title='price Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='compressionratio',title='compressionratio Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='enginesize',title='enginesize Distribution',df=car_df)

In [None]:
draw_continous_plot(column_name='horsepower',title='horsepower Distribution',df=car_df)

In [None]:
car_df.info()

### Dealing with Categorical variables
#### Covert the characters to numerical values for column with distinct value as 2
- FuelType 1(Gas) 0(Diesel)
- Aspiration 1(Std) 0(Turbo)
- DoorNumber 2(two) 4(Four)
- EngineLocation 1(front) 0(rear)
- Cylindernumber 2(two),3(three),4(four),5(five),6(six),7(seven),8(eight),12(twelve)

In [None]:
#Fuel Type ,Only two distinct values ,hence convert them to numbers
car_df['fueltype']=car_df['fueltype'].map({'gas':1,'diesel':0})

In [None]:
#Aspiration has two distinct values ,hence convert them to numbers
car_df['aspiration']=car_df['aspiration'].map({'std':1,'turbo':0})

In [None]:
#doornumber has two distinct values ,hence convert them to numbers
car_df['doornumber']=car_df['doornumber'].map({'two':1,'four':0})

In [None]:
#enginelocation has two distinct values ,hence convert them to numbers
car_df['enginelocation']=car_df['enginelocation'].map({'front':1,'rear':0})

In [None]:
car_df.head()

### Create Derived Variables based on the correlations

In [None]:
car_df['highwaympg/citympg']=car_df['highwaympg']/car_df['citympg']

In [None]:
car_df['carlength/carwidth']=car_df['carlength']/car_df['carwidth']

In [None]:
car_df['carlength/carheight']=car_df['carlength']/car_df['carheight']

In [None]:
car_df['horsepower/curbweight']=car_df['horsepower']/car_df['curbweight']

In [None]:
car_df.drop(['carlength','highwaympg','citympg','carwidth','carheight'],axis=1,inplace=True)

In [None]:
car_df.price.describe()

In [None]:
# Loan Amount ,looking to the above data 
car_segment_bin = [0, 10000, 20000, 40000]
car_segment_slot = ['lowline', 'midline', 'highline']
car_df['car_segment_bin'] = pd.cut(car_df['price'], car_segment_bin, labels=car_segment_slot)

In [None]:
car_df.head()

### Dummy Variables
- carbody
- CarCompany
- driveWheel
- EngineType
- FuelType
- cylindernumber 
- car_segment_bin

In [None]:
# Get the dummy variables for the feature 'carbody' and store it in a new variable - 'body'
# we don't need five columns. We can drop the `convertible` column, as the type of body can be identified with just the four columns where — 
car_df_new = pd.get_dummies(car_df)


In [None]:
car_df=car_df_new

In [None]:
# Droping 1 dummy variable and Keeping n-1 varaibles for each feature 
car_df.drop(['carbody_hatchback',
         'drivewheel_4wd',
         'enginetype_l',
         'cylindernumber_three',
         'fuelsystem_1bbl',
         'car_segment_bin_lowline'],axis=1,inplace=True)
car_df.columns

In [None]:
# Get the dummy variables for the feature 'carbody' and store it in a new variable - 'body'
# we don't need five columns. We can drop the `convertible` column, as the type of body can be identified with just the four columns where — 
body = pd.get_dummies(car_df['symboling'], drop_first = True,prefix="symboling_")

# Add the results to the original Car dataframe
car_df = pd.concat([car_df, body], axis = 1)

#drop the original column 
car_df.drop(['symboling'], axis =1, inplace = True)
car_df.head()

In [None]:
car_df_new.info()

## Splitting the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
car_df_train, car_df_test = train_test_split(car_df, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features 

We will use MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
car_df_train.info()

In [None]:
cols_to_scale = ['wheelbase','curbweight',
         'enginesize', 'boreratio', 'stroke', 'compressionratio','horsepower', 'peakrpm'
                 , 'price'
                ,'highwaympg/citympg','carlength/carwidth','carlength/carheight','horsepower/curbweight'
                ]


In [None]:
car_df_train[cols_to_scale] = scaler.fit_transform(car_df_train[cols_to_scale])

car_df_train.head()

In [None]:
car_df_train.describe()

### Dividing into X and Y sets for the model building

In [None]:
Y_car_df_train = car_df_train.pop('price')
X_car_df_train = car_df_train

## Building our model

We will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (which is a utility from sklearn)

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_car_df_train, Y_car_df_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_car_df_train, Y_car_df_train)

In [None]:
list(zip(X_car_df_train.columns,rfe.support_,rfe.ranking_))

In [None]:
X_car_df_train.shape

In [None]:
#columns that are supported
col_supported = X_car_df_train.columns[rfe.support_]
col_supported

In [None]:
#columns that are not supported
col_not_supported = X_car_df_train.columns[~rfe.support_]
col_not_supported

In [None]:
X_car_df_train_rfe = X_car_df_train.drop(col_not_supported,1)
X_car_df_train_rfe.head()

In [None]:
X_car_df_train_rfe.columns

### Building model using statsmodel, for the detailed statistics

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm  

In [None]:
# fit LRM
lm = fit_LRM(X_car_df_train_rfe,Y_car_df_train)

In [None]:
getVIF(X_car_df_train_rfe)

In [None]:
#enginetype__dohcv p value is > 0.05 ,hence dropping it
X_car_df_train1 = X_car_df_train_rfe.drop(['CarCompany_porsche'], axis=1)

In [None]:
X_car_df_train1.columns

### Model1 

In [None]:
#fit LRM
lm = fit_LRM(X_car_df_train1,Y_car_df_train)

In [None]:
getVIF(X_car_df_train1)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train1.corr(),annot = True)

horsepower is highly correlated to the horsepower/curbweight,hence droppping horsepower/curbweight

In [None]:
X_car_df_train2 = X_car_df_train1.drop(['CarCompany_subaru'], axis=1)

In [None]:
X_car_df_train2.columns

### Model2 
- Dropped column CarCompany__buick based on the p value

In [None]:
#fit LRM
lm = fit_LRM(X_car_df_train2,Y_car_df_train)

In [None]:
getVIF(X_car_df_train2)

In [None]:
X_car_df_train3 = X_car_df_train2.drop(['enginetype_dohcv'], axis=1)

### Model 3

In [None]:
#fit LRM
lm = fit_LRM(X_car_df_train3,Y_car_df_train)

In [None]:
getVIF(X_car_df_train3)

In [None]:
X_car_df_train4 = X_car_df_train3.drop(['enginetype_ohcf'], axis=1)

### Model 4

In [None]:
#fit LRM
lm = fit_LRM(X_car_df_train4,Y_car_df_train)

In [None]:
getVIF(X_car_df_train4)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train4.corr(),annot = True)

In [None]:
X_car_df_train5=X_car_df_train4.drop(['carlength/carwidth'], axis=1)

## Model 5

In [None]:
#fit LRM
lm = fit_LRM(X_car_df_train5,Y_car_df_train)

In [None]:
getVIF(X_car_df_train5)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train5.corr(),annot = True)

hourse power is highly correlated with carlength/carheight ,hence dropping carlength/carheight

In [None]:
X_car_df_train6=X_car_df_train5.drop(['CarCompany_saab'], axis=1)

## Model 6 
- Chosing Model 5 as base

In [None]:
lm = fit_LRM(X_car_df_train6,Y_car_df_train)

In [None]:
getVIF(X_car_df_train6)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train6.corr(),annot = True)

hoursepower is correlated to the cylindernumber__6 ,hence dropping cylindernumber__6

In [None]:
X_car_df_train7=X_car_df_train6.drop(['carlength/carheight'], axis=1)

## Model 7

In [None]:
lm = fit_LRM(X_car_df_train7,Y_car_df_train)

In [None]:
getVIF(X_car_df_train7)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train7.corr(),annot = True)

In [None]:
X_car_df_train8=X_car_df_train7.drop(['horsepower/curbweight'], axis=1)

### Model 8

In [None]:
lm = fit_LRM(X_car_df_train8,Y_car_df_train)

In [None]:
getVIF(X_car_df_train8)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train8.corr(),annot = True)

### Model 9

In [None]:
X_car_df_train9=X_car_df_train8.drop(['peakrpm'], axis=1)

In [None]:
lm = fit_LRM(X_car_df_train9,Y_car_df_train)

In [None]:
getVIF(X_car_df_train9)

In [None]:
plt.figure(figsize = (20,10))  
sns.heatmap(X_car_df_train9.corr(),annot = True)

## Model 10

In [None]:
X_car_df_train10=X_car_df_train9.drop(['CarCompany_bmw'], axis=1)

In [None]:
lm = fit_LRM(X_car_df_train10,Y_car_df_train)

In [None]:
getVIF(X_car_df_train10)

## Model 11

In [None]:
X_car_df_train11=X_car_df_train10.drop(['enginelocation'], axis=1)

In [None]:
lm = fit_LRM(X_car_df_train11,Y_car_df_train)

In [None]:
getVIF(X_car_df_train11)

In [None]:
X_car_df_train11 = sm.add_constant(X_car_df_train11)
lm = sm.OLS(Y_car_df_train,X_car_df_train11).fit() 
print(lm.summary())

### Residual Analysis of the train data
So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_price = lm.predict(X_car_df_train11)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((Y_car_df_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

### Making Predictions
Before making any inference about the equation of the linear regression, let's test it on the test set


Applying the scaling on the test sets
Applying the transformation on the test set using the learning of the train set therefore we only transform.
we only use those variables which we used to train the final model. So lets use only those.

In [None]:
num_vars = ['wheelbase','curbweight',
         'enginesize', 'boreratio', 'stroke', 'compressionratio','horsepower', 'peakrpm'
                 , 'price'
                ,'highwaympg/citympg','carlength/carwidth','carlength/carheight','horsepower/curbweight'
                ]

car_df_test[num_vars] = scaler.transform(car_df_test[num_vars])

Dividing into X_test and y_test

In [None]:
Y_car_df_test = car_df_test.pop('price')
X_car_df_test = car_df_test

In [None]:

# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_car_df_train11= X_car_df_train11.drop(['const'], axis=1)
X_car_df_test_new = X_car_df_test[X_car_df_train11.columns]

# Adding a constant variable 
X_car_df_test_new = sm.add_constant(X_car_df_test_new)

In [None]:
# Making predictions
y_pred = lm.predict(X_car_df_test_new)

In [None]:
Y_car_df_test.head()

### Model Evaluation¶

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(Y_car_df_test,y_pred)
fig.suptitle('Y_car_df_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('Y_car_df_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label


In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(Y_car_df_test, y_pred))
print('Model RMSE:',rmse)

from sklearn.metrics import r2_score
r2=r2_score(Y_car_df_test, y_pred)
print('Model r2_score:',r2)

##  Inferences 

#### Predicators 
- wheelbase
- horsepower
- carbody_convertible
- car_segment_bin_highline

- Adj R-Square = 0.909
- R-squared    = 0.911
- price = -0.0634 + 0.2251*wheelbase + 0.4297*horsepower + 0.1571*carbody_convertible + 0.3419*car_segment_bin_highline