The objective of this project is to give us an opportunity to test what we've been learning in the Coursera course. The dataset contains information about cars and how much they sold for. We want to be able to predict the price of cars using Multiple Linear Regression.

First, we have to import the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
df = pd.read_csv('coursera_machine_learning_specialization/course_1_supervised_machine_learning:_regression_and_classification/week_2/custom_project_cars/cars.csv')
df.head()

In [None]:
# Get only the numberic columns
df = df.select_dtypes(include=np.number)
# Get rid of the ID column
df.drop(columns=['ID'], inplace=True)
df.head()

I'll be following along the following article walking through exploritory data analysis with Pandas before doing the regression. https://www.kaggle.com/code/kashnitsky/topic-1-exploratory-data-analysis-with-pandas

In [None]:
print(df.shape)
df.info()

In [None]:
df.describe()

In [None]:
df.corr(numeric_only=True)

In [None]:
plt.matshow(df.corr(numeric_only=True))
plt.show()

There are a few things that I'm noticing fromt the correlation matrix.
- Length, width, and weight are very correlated
- gas milage for city and highway are very correlated
- horsepower is correlated with enginesize
- price seems to be correlated with wheelbase, width, length, weight, enginesize, boreratio, horsepower, citympg, highwaympg

Now we have some idea of what features we want to use, let's drop unimportant features and do some other feature engineering.

In [None]:
df_select = df.copy()
df_select['mpg'] = np.mean((df_select['citympg'], df_select['highwaympg']), axis=0)
df_select = df_select.loc[:, ['curbweight', 'boreratio', 'horsepower', 'mpg', 'price']]
df_select.head()

Before doing some modelling, let's get a training and test set. We don't need a validation set as we're not really using other model types.

In [None]:
train_all, test_all = train_test_split(df, test_size=0.2, random_state=1)
train_select, test_select = train_test_split(df_select, test_size=0.2, random_state=1)

In [None]:
linear_model_all = LinearRegression()
linear_model_select = LinearRegression()

linear_model_all.fit(train_all.loc[:, :'highwaympg'], train_all.loc[:, 'price'])
linear_model_select.fit(train_select.loc[:, :'mpg'], train_select.loc[:, 'price'])

b_all = linear_model_all.intercept_
w_all = linear_model_all.coef_
b_select = linear_model_select.intercept_
w_select = linear_model_select.coef_

print('ALL FEATURES:')
for coef in list(zip(linear_model_all.feature_names_in_, w_all)):
    print(coef)
print()
print('SELECT FEATURES:')
for coef in list(zip(linear_model_select.feature_names_in_, w_select)):
    print(coef)

print()

print(f"b_all = {b_all:0.2f}")
print(f"b_select = {b_select:0.2f}")

print()

test_all_pred = linear_model_all.predict(test_all.loc[:, :'highwaympg'])
test_select_pred = linear_model_select.predict(test_select.loc[:, :'mpg'])

print(f"training r2 score all = {r2_score(linear_model_all.predict(train_all.loc[:, :'highwaympg']), train_all.loc[:, 'price'])}")
print(f"training r2 score select = {r2_score(linear_model_select.predict(train_select.loc[:, :'mpg']), train_select.loc[:, 'price'])}")

print()

print(f"r2 score all = {r2_score(test_all_pred, test_all.loc[:, 'price'])}")
print(f"r2 score select = {r2_score(test_select_pred, test_select.loc[:, 'price'])}")

As we can see, the model using all the features seems to overfit to the training data. We can see this buy the fact that the r2 value for the training set is much higher than the r2 for the test set. Also, the selective model seems to perform better overall with the test set, which is a plus. We could do furter feature engineering and selection, but this will be enough for the purposes of practice.