# Multiple Linear Regression

In this notebook we will expand our simple linear regression model that we built to predict car prices in the last chapter to include several independent variables in order to produce better predictions.

### Package and Data Loading

As before, we will import the required packages and our car price data set.

In [None]:
import pandas as pd
import matplotlib.pyplot as plot
import statsmodels.api as stats
import numpy as np

In [None]:
carprice_df = pd.read_csv('CarPrice_Assignment.csv')

### Assessing the Data

In [None]:
carprice_df

In [None]:
carprice_df

In [None]:
carprice_df

Here we are checking the number of unique values in specifically the categorical variables using df.select_dtypes()

In [None]:
carprice_df

As we can see, the data contains a mixture of numeric types and categorical (object) types. We will remove the car_ID field from the data as this is only an identifier. For the purposes of this lesson we will also remove CarName from the data as it contains a large number of unique values (How could we extract more useful information from this variable?).

In [None]:
carprice_df = carprice_df.drop(columns=['car_ID', 'CarName'])

## Basic Multiple Regression Model

In [None]:
Y_basic = 
X_basic = 

In [None]:
model_basic = 
results_basic = 

We can see our results and the parameters for each of the independent variables using the .summary() attribute again.

In [None]:
print(results_basic.summary())

## Full Multiple Regression Model

We can look at the correlations between different numerical variables in a handy way using a correlation matrix - this allows us to see the correlation between all pairs of variables at once. We can then remove some of the independent variables that are highly correlated and would cause problems with the algorithm due to multicollinearity. We can create this correlation matrix using the df.corr() method. We add a red/blue heatmap to better see where the extreme correlations are.

In [None]:
carprice_df.select_dtypes(exclude='object')

In [None]:
carprice_df = carprice_df

#### One Hot Encoding

In [None]:
dummy = 

In [None]:
carprice_df = pd.concat([carprice_df.select_dtypes(exclude='object'), dummy], axis=1)

We can repeat the above process where we remove highly correlated variables, now including the one hot encoded features.

In [None]:
carprice_df.corr().style.background_gradient(cmap='coolwarm')

In [None]:
carprice_df = carprice_df.drop(columns=['compressionratio', 'drivewheel_fwd', 'enginetype_rotor', 'fuelsystem_4bbl', 'fuelsystem_idi'])

In [None]:
carprice_df.shape

### Test/Train Split

In [None]:
train_df = carprice_df.sample(frac = 0.7, random_state = 99) #random state is a seed value
test_df = carprice_df.drop(train_df.index)

In [None]:
train_df.shape

In [None]:
test_df.shape

### Fitting the Linear Regression Model

We once again use statsmodels to fit our linear regression model. We do this in the same way as the previous notebook except now our X_train contains all of our independent variables (plus the constant column).

In [None]:
Y_train = 
X_train = 

In [None]:
model_carprice = stats.OLS(Y_train, X_train)
results_carprice = model_carprice.fit()

In [None]:
print(results_carprice.summary())

In [None]:
print('The sum of square residuals is {:.1f}'.format(results_carprice.ssr))

We can also use our test set to compare our predictions with the observed values.

In [None]:
Y_test = test_df.price
test_df = stats.add_constant(test_df)
X_test = test_df[X_train.columns]

In [None]:
test_predictions = results_carprice.predict(X_test)

In [None]:
plot.scatter(test_predictions, Y_test)
plot.plot([5000, 50000], [5000, 50000], c='k', ls='--')
plot.xlabel('Predicted Price [$]')
plot.ylabel('Observed Price [$]')
plot.show()

## Scikit-Learn


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

### Test/Train Split

In [None]:
Y = carprice_df.price
X = carprice_df.drop(columns=['price'])

In [None]:
sk_X_train, sk_X_test, sk_Y_train, sk_Y_test = train_test_split(X, Y, test_size=0.3, random_state=99)

In [None]:
regressor = LinearRegression()  
regressor.fit(sk_X_train, sk_Y_train)

In [None]:
sk_intercept_carprice = regressor.intercept_
sk_engsize_coeffs = regressor.coef_
sk_ssr_carprice = np.sum((sk_Y_train-regressor.predict(sk_X_train))**2)

In [None]:
pd.Series(sk_engsize_coeffs, index=sk_X_train.columns)

In [None]:
print('The intercept value is {:.1f}'.format(sk_intercept_carprice))
print('The sum of square residuals is {:.1f}'.format(sk_ssr_carprice))