# Car Price Model Development

In this section, several models will be developed that will predict the price of the car using the variables or features. This will be just an estimation but it should give an objective idea of how much a car should cost.

Some questions taht can be answered after creating the model:

+ Do I know if the dealer is offering fair value for my trade-in?
+ Do I know if I put a fair value on my car?

In data analytics, Model Development is often used to help us predict future observations from the data we have.

A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

##### First we import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Then we create a path to our dataset and load it into a dataframe. 

In [None]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
df = pd.read_csv(path)
df.head()

### Let's start first with Linear Regression and Multiple Linear Regression

We load the modules for linear regression and create a linear regression object from the library **"scikit.learn"**.

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm

For the linear regression model we choose one of the features to fit in our model. In this case we choose `'highway-mpg'` as a predictor variable and `'price'` as a response variable.

In [None]:
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X,Y)           # Fitting the linear model using mpg

A prediction can be output:

In [None]:
Yhat=lm.predict(X)
Yhat[0:5]

The values of intercept and slope can also be calculated:

In [None]:
print('The intercept value is:',lm.intercept_)
print('The slope is:',lm.coef_)

### Multilinear regression

Even without the help of statistics we know that the highway mpg alone is not enough to accurately predict the price of a car. In this case it is crucial to take into account more features and train our model by using them. 

In [None]:
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

In [None]:
lm.fit(Z, df['price'])        # Fit the values into the model

In [None]:
print('The intercept value is:',lm.intercept_)
print('The slope is:',lm.coef_)

### Evaluating the model using visualization

In [None]:
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline 

Let's visualize **highway-mpg** as potential predictor variable of price:

In [None]:
width = 8
height = 6
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

We can see from this plot that price is negatively correlated to highway-mpg since the regression slope is negative. However the data points seem to be a bit too far from the regression line which can be an indication that this linear model might not be the best fit.

Let us try doing the same with peak rpm.

In [None]:
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)

Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for "highway-mpg" are much closer to the generated line and, on average, decrease. This graph shows us that the correlation between 'peak-rpm' and 'price' is weak. We can further prove this by using the **.corr()** method.

In [None]:
df[["peak-rpm","highway-mpg","price"]].corr()

In order to visualize and see whether a linear model is appropriate for the data is using a residual plot. If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data.

In [None]:
plt.figure(figsize=(width, height))
sns.residplot(x=df['highway-mpg'],y=df['price'])
plt.show()

Here we see that the point are not evenly spread out so a non-linear model might be a better fit. 

The multiple linear regression model that was created earlier cannot be visualized with regression or residual plot. One way to look at the fit of the model is by looking at the **distribution plot**. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

In [None]:
Y_hat = lm.predict(Z)       # First make the prediction

In [None]:
plt.figure(figsize=(width, height))                   #Then we compare it to the actual value


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.

### Using polynomial regression and pipelines

We will use the following function to plot the data:

In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

Let's get the variables:

In [None]:
x = df['highway-mpg']
y = df['price']

Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.

In [None]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

In [None]:
PlotPolly(p, x, y, 'highway-mpg')

We can see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.

We can perform a polynomial transform on multiple features. First, we import the module:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

We create a PolynomialFeatures object of degree 2:

In [None]:
pr=PolynomialFeatures(degree=2)
pr

In [None]:
Z_pr=pr.fit_transform(Z)

In the original data, there are 201 samples and 4 features.

In [None]:
Z.shape

After the transformation, there are 201 samples and 15 features.

In [None]:
Z_pr.shape

### Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

In [None]:
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

We input the list as an argument to the pipeline constructor:

In [None]:
pipe=Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data, perform a transform and fit the model simultaneously.

In [None]:
Z = Z.astype(float)
pipe.fit(Z,y)

Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously.

In [None]:
ypipe=pipe.predict(Z)
ypipe[0:4]

## Calculating the Mean Squared Error and R^2

#### Model 1 Simple Linear Regression

Calculating R^2

In [None]:
#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

We can say that ~49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".

Let's calculate the MSE:

We can predict the output i.e., "yhat" using the predict method, where X is the input variable:

In [None]:
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

Let's import the function mean_squared_error from the module metrics:

In [None]:
from sklearn.metrics import mean_squared_error

We can compare the predicted results with the actual results:

In [None]:
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)

#### Model 2 Multiple Linear Regression

Calculating R^2

In [None]:
# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))

We can say that ~80.935 % of the variation of price is explained by this multiple linear regression "multi_fit".

Let's calculate the MSE.

We produce a prediction:

In [None]:
Y_predict_multifit = lm.predict(Z)

We compare the predicted results with the actual results:

In [None]:
print('The mean square error of price and predicted value using multifit is: ', \
      mean_squared_error(df['price'], Y_predict_multifit))

#### Model 3: Polynomial Fit

Calculating the R^2.

Let’s import the function r2_score from the module metrics as we are using a different function.

In [None]:
from sklearn.metrics import r2_score

We apply the function to get the value of R^2:

In [None]:
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)

We can say that ~67.419 % of the variation of price is explained by this polynomial fit.

##### MSE

We can also calculate the MSE:

In [None]:
mean_squared_error(df['price'], p(x))