In [2]:
import pandas as pd
import matplotlib
import numpy as np
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split

## Introduction
The first attempt using linear regression (see the "LinearRegression" notebook in this same directory) did not go well. We got a R^2 score of -1377777, which is very bad.

I think the reason is that we trained on 400+ models on only 42,000 rows of data. This means some models only had 3 examples, which doesn't seem like enough to learn to identify them.

For this notebook, I've restricted the data to only those models that have 100 or more examples in the dataset.

## Data Preparation
Read the data and look at the first few rows:

In [3]:
# df = pd.read_csv('../../data/cars/processed_data/data-3-27-18-restricted-model.csv')
df = pd.read_csv('../../data/aggregated/processed_data/data-3-28-18-restricted.csv')
df.head(5)

Unnamed: 0,listing_id,vin,make,model,year,mileage,transmission,exterior_color,state,price,source
0,728835398,1FM5K7B86DGB59447,Ford,Explorer,2013,115244.0,Automatic,Silver,CO,12788,Cars.com
1,728467649,JTEZT14R350022851,Toyota,4-Runner,2005,183578.0,Automatic,White,GA,7021,Cars.com
2,728759154,1GNKVGKDXHJ317848,Chevrolet,Traverse,2017,23077.0,Automatic,White,NM,25900,Cars.com
3,728922954,1N6AD0ER7EN718734,Nissan,Frontier,2014,25864.0,Automatic,Blue,CA,18995,Cars.com
4,729022991,2T3WFREV9FW164238,Toyota,RAV4,2015,28785.0,Automatic,White,CA,20995,Cars.com


Replace missing values in those columns that are missing data:

In [5]:
# Continuous
df['mileage'] = df['mileage'].fillna((df['mileage'].mean()))
df['year'] = df['year'].fillna((df['year'].mean()))

# Categorical
df['make'] = df['make'].fillna(df['make'].value_counts().index[0])
df['model'] = df['model'].fillna(df['model'].value_counts().index[0])
df['state'] = df['state'].fillna(df['state'].value_counts().index[0])
df['transmission'] = df['transmission'].fillna(df['transmission'].value_counts().index[0])

## Model Training

In [7]:
features = df[['make', 'model', 'mileage', 'state', 'transmission', 'year']]
labels = df[['price']]
features_encoded = pd.get_dummies(features, columns=['make', 'model', 'state', 'transmission'])
X_train, X_test, Y_train, Y_test = train_test_split(features_encoded, labels, test_size=0.2, train_size=0.8)

In [8]:
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:
model.score(X_test, Y_test)

0.77724205828266502

A much better score! Restricting the model to only examples with lots of data has really helped. 