# Automobile Data: Car Price Prediction

In this project I will explore features of automobile data set to predict price of car.

## 1. Data Wrangling

Data Wrangling is the process of converting data from the initial format to a clean format that may be better for analysis.

In [1]:
# import pandas, numpy, seaborn, matplotlib, scipy library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats 

In [2]:
!head -n5 datasets/autodata.data

'head' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
df = pd.read_csv(other_path, header=None)

In [None]:
# Check the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

### Add Headers
<p>
To better describe the data, let's introduce headers for the dataframe, this information is available at:  <a href="https://archive.ics.uci.edu/ml/datasets/Automobile" target="_blank">https://archive.ics.uci.edu/ml/datasets/Automobile</a>
</p>
<p>

In [None]:
# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)

In [None]:
# replace headers and recheck the data frame
df.columns = headers
df.head()

In [None]:
# checking shape
df.shape

### Identify and handle missing values

In [None]:
# replace the "?" symbol with NaN
df.replace('?', np.NaN , inplace=True)
df.head()

In [None]:
# Evaluating for Missing Data
missing_data = df.isnull()
missing_data.head(5)

In [None]:
# Count missing values in each column
df.isnull().sum().sort_values(ascending=False)

Replace data by mean
1. "normalized-losses": 41 missing data, replace them with mean
2. "stroke": 4 missing data, replace them with mean
3. "bore": 4 missing data, replace them with mean
4. "horsepower": 2 missing data, replace them with mean
5. "peak-rpm": 2 missing data, replace them with mean 

In [None]:
# Calculate the mean of the column "normalized-losses"
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
# replace NaN by mean value in "normalized-losses" column
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

In [None]:
# Calculate the mean of the column "bore"
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
# replace NaN by mean value in "bore" column
df["bore"].replace(np.nan, avg_bore, inplace=True)

In [None]:
#Calculate the mean vaule for "stroke" column
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)
# replace NaN by mean value in "stroke" column
df["stroke"].replace(np.nan, avg_stroke, inplace = True)

In [None]:
#Calculate the mean vaule for "horsepower" column
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
# replace NaN by mean value in "horsepower" column
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

In [None]:
#Calculate the mean vaule for "peak-rpm" column
avg_peakrpm = df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
# replace NaN by mean value in "peak-rpm" column
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)

Replace "num-of-doors" by frequency: 2 missing data, replace them with mode.

In [None]:
# count the cell value
df['num-of-doors'].value_counts()

In [None]:
# Obtain the most frequent value
df['num-of-doors'].value_counts().idxmax()

In [None]:
# replace the missing 'num-of-doors' values by the most frequent 
df["num-of-doors"].replace(np.nan, "four", inplace=True)

Drop the whole row: "price": 4 missing data (Any data entry without price data cannot be used for prediction)
    

In [None]:
# drop missing values along the column "price" 
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)
df.head(20)

### Data Type Conversion

In [None]:
# data types of the data frame
print(df.dtypes)

In [None]:
# Convert data types to proper format
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

#List data type after conversion
print(df.dtypes)

### Data Standardization

In [None]:
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df["city-mpg"]

# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-L/100km"] = 235/df["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"
# df.rename(columns={"highway-mpg":'highway-L/100km'}, inplace=True)

# check your transformed data 
df.head()

### Data Normalization

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1

In [None]:
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max() 

# show the scaled columns
df[["length","width","height"]].head()

### Binning

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis. 

In this dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values. I will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins: high horsepower, medium horsepower, and little horsepower (3 types)

In [None]:
# data type conversion
df["horsepower"]=df["horsepower"].astype(int, copy=True)

In [None]:
# Plot the histogram of horspower, to see what the distribution of horsepower looks like
%matplotlib inline
plt.hist(df["horsepower"])

# set x/y labels and plot title
plt.xlabel("Horsepower")
plt.ylabel("Count")
plt.title("Horsepower Distribution")
plt.show()

In [None]:
# make new bins
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
group_names = ['Low', 'Medium', 'High']
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True )
df["horsepower-binned"].value_counts()

In [None]:
%matplotlib inline

plt.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.xlabel("Horsepower")
plt.ylabel("Count")
plt.title("Horsepower bins")
plt.show()

In [None]:
# draw historgram of attribute "horsepower" with bins = 3
plt.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.xlabel("Horsepower")
plt.ylabel("Count")
plt.title("Horsepower Distribution")
plt.show()

### Indicator variable

An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. These variables can be used for regression analysis. To use fuel type attribute in regression analysis, I convert "fuel-type" into indicator variables.

In [None]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()

In [None]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis = 1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace = True)
df.head()

In [None]:
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])

# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

In [None]:
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)
df.head()

In [None]:
# save the clean dataset to csv
df.to_csv("automobile_clean.csv", index=False)

## Exploratory Data Analysis

Exploratory data analysis involves examining the distribution of various variables in the dataset, identifying outliers, finding trends and patterns, looking for relationships between variables by using heat maps or correlation metrics.

### Continuous numerical variables

Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the relationship between an individual variable and the price, I have used "regplot", which plots the scatterplot plus the fitted regression line for the data.

#### Engine Size Vs Price

In [None]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
plt.show()

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

In [None]:
df[["engine-size", "price"]].corr()

I examined the correlation between 'engine-size' and 'price' and saw it's approximately 0.87

#### Highway mpg Vs Price

In [None]:
sns.regplot(x="highway-mpg", y="price", data=df)
plt.show()

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

In [None]:
df[['highway-mpg', 'price']].corr()

I examined the correlation between 'highway-mpg' and 'price' and saw it's approximately -0.704. Now Let's check if "Peak-rpm" as a predictor variable of "price".

#### Peak RPM  Vs Price

In [None]:
sns.regplot(x="peak-rpm", y="price", data=df)
plt.show()

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it is not a reliable variable.

In [None]:
df[['peak-rpm','price']].corr()

The correlation between 'peak-rpm' and 'price'is approximately -0.101616 

#### Stroke Vs Price

In [None]:
df[["stroke","price"]].corr()

 the correlation between "stroke" and "price" is approximately 0.082

There is a weak correlation between the variable 'stroke' and 'price.' and such regression will not work well. This can be demonstrated using regplot.

In [None]:
sns.regplot(x="stroke", y="price", data=df)
plt.show()

### Categorical variables

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots. Let's look at the relationship between "body-style" and "price".

#### Body-Style Vs Price

In [None]:
sns.boxplot(x="body-style", y="price", data=df)
plt.show()

It can be seen that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price".

#### Engine Location Vs Price

In [None]:
sns.boxplot(x="engine-location", y="price", data=df)
plt.show()

Here it can be seen that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.Now, let's examine "drive-wheels" and "price".

#### Drive Wheels  Vs Price

In [None]:
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
plt.show()

It can be seen that the distribution of price between the different drive-wheels categories differs; so drive-wheels could potentially be a predictor of price.

### Descriptive Statistical Analysis

In [None]:
# describe all the columns in "df" statistical summary 
df.describe(include = "all")

Value-counts is a good way of understanding how many units of each characteristic/variable we have.

In [None]:
# drive-wheels as variable
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts.index.name = 'drive-wheels'
print(drive_wheels_counts)

In [None]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
print(engine_loc_counts)

On examining the value counts of the engine location, it can be concluded that it would not be a good predictor variable for the price. This is because there are three cars with a rear engine and 198 with an engine in the front, this result is skewed.

The "groupby" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.

In [None]:
df_group_one = df[['drive-wheels','body-style','price']]
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one

From the above data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

Now, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'.

In [None]:
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean().sort_values('price')
grouped_test1

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row.

In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:

In [None]:
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

In [None]:
#visualize the grouped results
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

### Feature Selection

Here we do correlation analysis amongst all the independent continuous variables and eliminate variables that have very close correlation. We also do a feature importance test to find out the importance of each independent feature as it pertains to the dependent variable. We eliminate the features that are not so important.

In order to check for multicollinearity, first I am storing the names of all continuous variables in cont and categorical variables in categ.

In [None]:
#splitting up column names into cont and categ

# df.columns
categ = ['make', 'body-style', 'drive-wheels','engine-location', 'engine-type', 'fuel-type-diesel', 'fuel-type-gas', 'aspiration-std','aspiration-turbo','horsepower-binned']
cont = ['symboling', 'normalized-losses', 'num-of-doors', 'wheel-base', 'length', 'width','height', 'curb-weight','num-of-cylinders','engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio','horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

#subsetting out day into datasets having continuous and categorical variables
df_cont = df.loc[:, cont]
df_categ= df.loc[:, categ]



In [None]:
#  Calculate the correlation between variables 

# setting width and height of plot
f, ax = plt.subplots(figsize = (14, 14))

#generate the correlation matrix
corr = df_cont.corr()

#putting corr into perspective via heatmap
sns.heatmap(corr, 
            mask = np.zeros_like(corr, dtype = np.bool), 
            cmap = sns.diverging_palette(220, 10, as_cmap = True), 
            square = True, 
            ax = ax, 
            annot = True, 
            cbar = True)
plt.show()

From the correlation plot, we can see that features that affect price could be: highway-mpg, city-mpg, horsepower, engine-size, curb-weight, width, length, wheel-base and bore. The parameters from wheel-base, length, width, height, curb-weight, engine-size and bore have dependence on each other. So, we may choose only some of these features. Either engine-size or curb-weight can be selected because they have strong correlation. We may choose highway-mpg or city-mpg beacuse of high correlation. Let's investigate further.


### Pearson Correlation Coefficient and P-value

#### 'wheel-base' Vs 'price'

In [None]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

Since the p-value is  <  0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585). So this variable may be eliminated.

####  'horsepower' Vs 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

Since the p-value is  <  0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

#### 'length' Vs 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

Since the p-value is  <  0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

####  'width' Vs 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

#### 'curb-weight' Vs 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

Since the p-value is  <  0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

####  'engine-size' Vs 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

Since the p-value is  <  0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

#### 'bore' Vs 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value ) 

Since the p-value is  <  0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

#### City-mpg Vs Price

In [None]:
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

Since the p-value is  <  0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.

#### Highway-mpg Vs Price

In [None]:
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value ) 

Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of ~ -0.705 shows that the relationship is negative and moderately strong.

### ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant is our calculated score value.

If the price variable is strongly correlated with the variable, expect ANOVA to return a sizeable F-test score and a small p-value.

In [None]:
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.get_group('4wd')['price']

In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)  

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val) 

### Conclusion: Important Variables

I now have a better idea of what the data looks like and which variables are important to take into account when predicting the car price. I have narrowed it down to the following variables:

Continuous numerical variables:

1. Length
2. Width
3. Curb-weight
4. Engine-size
5. Horsepower
6. City-mpg
7. Highway-mpg
8. Wheel-base
9. Bore

Categorical variables: Drive-wheels

Now I can build machine learning models to automate the analysis, feeding the model with variables that meaningfully affect the target variable will improve the model's prediction performance.

## Model Development

Now that all the preprocessing has been done, let us split the data into train and test. The model will be made on the train data, and then implemented on the test data. Then the accuracy of the model will be determined based on the accuracy obtained on the test data.

In [None]:
#
y_data = df['price']

In [None]:
#
x_data=df.drop('price',axis=1)

In [None]:
from sklearn.model_selection import train_test_split

#
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)

#
print("Number of test samples :", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])

In [None]:
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error

Now, I will develop several models that will predict the price of the car using the variables.

### Simple Linear Regression

In [None]:
# Create a linear regression object
lm = LinearRegression()

I want to look at how highway-mpg can help predict car price. Using simple linear regression, I will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.

In [None]:
# Fit the linear model using highway-mpg
X = x_train[['highway-mpg']]
Y = y_train
lm.fit(X,Y)
# Find the R^2
print('The R-square of training data is: ', lm.score(X, Y))
print('The R-square of test data is: ', lm.score(x_test[['horsepower']], y_test))

#### Cross-validation Score

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
Rcross = cross_val_score(lm, x_data[['horsepower']], y_data, cv=4)
print("The mean of the folds are", Rcross.mean(), "and the standard deviation is" , Rcross.std())

In [None]:
# output a prediction
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])  

In [None]:
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)

Now I will train the model using 'engine-size' as the independent variable and 'price' as the dependent variable.

In [None]:
# Create a linear regression object
lm1 = LinearRegression()

In [None]:
# Fit the linear model using highway-mpg
lm1.fit(df[['engine-size']], df[['price']])

In [None]:
# the slope and intercept of the model
print(lm1.intercept_)
print(lm1.coef_)

### Multiple Linear Regression

Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.

From the previous section I know that other good predictors of price could be: Horsepower, Curb-weight, Engine-size and Highway-mpg. Let's develop a model using these variables as the predictor variables.

In [None]:
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))

In [None]:
Y_predict_multifit = lm.predict(Z)

In [None]:
print('The mean square error of price and predicted value using multifit is: ', \
      mean_squared_error(df['price'], Y_predict_multifit))

In [None]:
lm.intercept_
lm.coef_

### Model Evaluation using Visualization

#### Regression Plot

When it comes to simple linear regression, an excellent way to visualize the fit of the model is by using regression plots.

In [None]:
width = 6
height = 5
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
plt.show()

#### Residual Plot

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis. If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

In [None]:
width = 6
height = 5
plt.figure(figsize=(width, height))
sns.residplot(x='highway-mpg', y='price',data=df)
plt.show()

#### Distribution Plot

Multiple Linear Regression models can't be visualized with regression or residual plot. One way to look at the fit of the model is by looking at the distribution plot.

In [None]:
Y_hat = lm.predict(Z)
plt.figure(figsize=(width, height))

ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.legend()
plt.show()

### Polynomial Regression and Pipelines

Polynomial regression is a particular case of the general linear regression model or multiple linear regression models. We get non-linear relationships by squaring or setting higher-order terms of the predictor variables. The linear model did not provide the best fit while using highway-mpg as the predictor variable. Let's see if fitting a polynomial model to the data instead helps.

In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

In [None]:
x = df['highway-mpg']
y = df['price']
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

In [None]:
from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)
mean_squared_error(df['price'], p(x))

In [None]:
PlotPolly(p, x, y, 'highway-mpg')
np.polyfit(x, y, 3)

We can perform a polynomial transform on multiple features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
pr=PolynomialFeatures(degree=2)
Z_pr=pr.fit_transform(Z)

#### Pipeline

Data Pipelines simplify the steps of processing the data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

In [None]:
pipe=Pipeline(Input)

In [None]:
pipe.fit(Z,y)

In [None]:
ypipe=pipe.predict(Z)
ypipe[0:4]