In [1]:
import pandas as pd
import matplotlib
import numpy as np
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split

## Introduction
The first attempt using linear regression (see the "LinearRegression" notebook in this same directory) did not go well. We got a R^2 score of -1377777, which is very bad.

I think the reason is that we trained on 400+ models on only 42,000 rows of data. This means some models only had 3 examples, which doesn't seem like enough to learn to identify them.

For this notebook, I've restricted the data to only those models that have 100 or more examples in the dataset.

## Data Preparation
Read the data and look at the first few rows:

In [8]:
df = pd.read_csv('../../data/cars/processed_data/data-3-27-18-restricted-model.csv')
df.head(5)

Unnamed: 0,listing_id,vin,make,model,year,condition,mpg_city,mileage,transmission,cars_rating,state,price
0,728835398,1FM5K7B86DGB59447,Ford,Explorer,2013,Used,18.0,115244.0,Automatic,4.4,CO,12788
1,728467649,JTEZT14R350022851,Toyota,4-Runner,2005,Used,17.0,183578.0,Automatic,4.7,GA,7021
2,728759154,1GNKVGKDXHJ317848,Chevrolet,Traverse,2017,Used,17.0,23077.0,Automatic,4.8,NM,25900
3,728922954,1N6AD0ER7EN718734,Nissan,Frontier,2014,Used,17.0,25864.0,Automatic,4.4,CA,18995
4,728946737,5XYPGDA38GG104959,Kia,Sorento,2016,Used,19.0,26598.0,Automatic,4.8,IA,21969


Replace missing values in those columns that are missing data:

In [9]:
# Continuous
df['mpg_city'] = df['mpg_city'].fillna((df['mpg_city'].mean()))
df['mileage'] = df['mileage'].fillna((df['mileage'].mean()))
df['cars_rating'] = df['cars_rating'].fillna((df['cars_rating'].mean()))
df['year'] = df['year'].fillna((df['year'].mean()))

# Categorical
df['make'] = df['make'].fillna(df['make'].value_counts().index[0])
df['model'] = df['model'].fillna(df['model'].value_counts().index[0])
df['state'] = df['state'].fillna(df['state'].value_counts().index[0])
df['transmission'] = df['transmission'].fillna(df['transmission'].value_counts().index[0])

## Model Training

In [10]:
features = df[['cars_rating', 'make', 'model', 'mileage', 'mpg_city', 'state', 'transmission', 'year']]
labels = df[['price']]
features_encoded = pd.get_dummies(features, columns=['make', 'model', 'state', 'transmission'])
X_train, X_test, Y_train, Y_test = train_test_split(features_encoded, labels, test_size=0.2, train_size=0.8)
features.describe()

Unnamed: 0,cars_rating,mileage,mpg_city,year
count,33363.0,33363.0,33363.0,33363.0
mean,4.593293,54180.981816,22.267125,2013.502083
std,0.220367,43189.484623,5.817766,3.603217
min,3.0,1.0,10.0,1990.0
25%,4.5,24160.0,18.0,2012.0
50%,4.6,40946.0,22.0,2015.0
75%,4.8,74200.5,25.0,2016.0
max,5.0,302417.0,116.0,2018.0


In [11]:
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [12]:
model.score(X_test, Y_test)

0.77899004713176589

A much better score! Restricting the model to only examples with lots of data has really helped. 