Processing
---

This notebook prepares the data for modeling. Null values in numeric data are imputed. Lists of car features are cleaned and consolidated (for example, colors `black` and `Black` are considered the same). Strings and lists of string features are split, and the most common are made into dummies.

In [13]:
#import libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# import sklearn

#display options
pd.options.display.max_columns = 40
%matplotlib inline
plt.style.use('dark_background')

In [11]:
#import data
data_path = '../data/'
train_data_filename = 'Training_DataSet.csv'
test_data_filename = 'Test_Dataset.csv'

traindf = pd.read_csv(data_path + train_data_filename)


#copy training data to a new dataframe to use for modeling
traindf_proc = traindf.copy()
traindf_proc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6298 entries, 0 to 6297
Data columns (total 29 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ListingID             6298 non-null   int64  
 1   SellerCity            6298 non-null   object 
 2   SellerIsPriv          6298 non-null   bool   
 3   SellerListSrc         6296 non-null   object 
 4   SellerName            6298 non-null   object 
 5   SellerRating          6298 non-null   float64
 6   SellerRevCnt          6298 non-null   int64  
 7   SellerState           6298 non-null   object 
 8   SellerZip             6296 non-null   float64
 9   VehBodystyle          6298 non-null   object 
 10  VehCertified          6298 non-null   bool   
 11  VehColorExt           6225 non-null   object 
 12  VehColorInt           5570 non-null   object 
 13  VehDriveTrain         5897 non-null   object 
 14  VehEngine             5937 non-null   object 
 15  VehFeats             

# SellerCity
Although SellerCity did not appear to influence the average dealer listing price, it is possible that patterns with SellerCity could be paired with other features to derive the value. This will not be effective in a linear regression model but could be grabbed onto in a decision tree / NN / etc.

In [36]:
#make dummies for Seller Cities with the most sales in the training set.
#perhaps a bit arbitrary, but let's cut it off at cities above 30 sales.
#that's the first 20 common cities.
cities = traindf['SellerCity'].value_counts(ascending= False)[:20]
cities = cities.index
cities

Index(['Chicago', 'Battle Creek', 'Columbus', 'Louisville', 'Houston',
       'Atlanta', 'Richmond', 'Raleigh', 'Indianapolis', 'Vienna',
       'Cincinnati', 'Dallas', 'White Bear Lake', 'Palmyra', 'Rochester',
       'Nashville', 'Milwaukee', 'St. Louis', 'Lexington', 'Pittsburgh'],
      dtype='object')

In [46]:
# takes a dataframe, a column to expand upon, and a list of values for that column to make dummies.
# any value in the column that is not in the dummy_list will be ignored

#the purpose of this is to make the same ordered columns in the training set as in any test set. Otherwise,
#the dummy columns may notinclude the same cities, if there is a different frequency distribution,
#or if a certain city isn't represented.
def make_specific_dummies(df, column, dummy_list):
    #make a copy so we don't change the original
    df2 = df.copy()
    #remove any entries in column that aren't in the dummy list
    for dummy_value in dummy_list:
        df2[column + '_' + dummy_value] = df2[column].apply(lambda entry: 1 if entry == dummy_value else 0)
    return df2.drop(columns = column)
    

In [48]:
traindf = make_specific_dummies(traindf, 'SellerCity', cities)
traindf.head()

Unnamed: 0,ListingID,SellerIsPriv,SellerListSrc,SellerName,SellerRating,SellerRevCnt,SellerState,SellerZip,VehBodystyle,VehCertified,VehColorExt,VehColorInt,VehDriveTrain,VehEngine,VehFeats,VehFuel,VehHistory,VehListdays,VehMake,VehMileage,...,SellerCity_Chicago,SellerCity_Battle Creek,SellerCity_Columbus,SellerCity_Louisville,SellerCity_Houston,SellerCity_Atlanta,SellerCity_Richmond,SellerCity_Raleigh,SellerCity_Indianapolis,SellerCity_Vienna,SellerCity_Cincinnati,SellerCity_Dallas,SellerCity_White Bear Lake,SellerCity_Palmyra,SellerCity_Rochester,SellerCity_Nashville,SellerCity_Milwaukee,SellerCity_St. Louis,SellerCity_Lexington,SellerCity_Pittsburgh
0,3287,False,Inventory Command Center,Prime Motorz,5.0,32,MI,48091.0,SUV,False,White,Black,4X4,3.6L V6,"['Adaptive Cruise Control', 'Antilock Brakes',...",Gasoline,"1 Owner, Non-Personal Use Reported, Buyback Pr...",8.600069,Jeep,39319.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,3920,False,Cadillac Certified Program,Gateway Chevrolet Cadillac,4.8,1456,ND,58103.0,SUV,True,Black,,,,,Gasoline,"1 Owner, Buyback Protection Eligible",2.920127,Cadillac,30352.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4777,False,Jeep Certified Program,Wilde Chrysler Jeep Dodge Ram &amp; Subaru,4.8,1405,WI,53186.0,SUV,True,Brilliant Black Crystal Pearlcoat,Black,4x4/4WD,Regular Unleaded V-6 3.6 L/220,['18 WHEEL &amp; 8.4 RADIO GROUP-inc: Nav-Capa...,Gasoline,"1 Owner, Buyback Protection Eligible",28.107014,Jeep,38957.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,6242,False,Inventory Command Center,Century Dodge Chrysler Jeep RAM,4.4,21,MO,63385.0,SUV,False,Diamond Black Crystal Pearlcoat,Black,4WD,3.6L V6,"['Android Auto', 'Antilock Brakes', 'Apple Car...",Gasoline,"1 Owner, Non-Personal Use Reported, Buyback Pr...",59.816875,Jeep,20404.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,7108,False,HomeNet Automotive,Superior Buick GMC of Fayetteville,3.7,74,AR,72703.0,SUV,False,Radiant Silver Metallic,Cirrus,FWD,Gas V6 3.6L/222.6,"['4-Wheel Disc Brakes', 'ABS', 'Adjustable Ste...",Gasoline,"1 Owner, Non-Personal Use Reported, Buyback Pr...",98.665301,Cadillac,19788.0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# SellerListSrc

Only two are nulls --- this will leave those as the only 0 for dummies. (no dropfirst)
Again, `pd.get_dummies` could be a problem if there are a different order. Make dummy columns manually.

In [59]:
sources = traindf['SellerListSrc'].dropna().unique()
sources

array(['Inventory Command Center', 'Cadillac Certified Program',
       'Jeep Certified Program', 'HomeNet Automotive',
       'Digital Motorworks (DMi)', 'My Dealer Center', 'Sell It Yourself',
       'Five Star Certified Program'], dtype=object)

In [61]:
traindf = make_specific_dummies(traindf, 'SellerListSrc', sources)

# SellerName
Treat this like cities, with the most popular names marked. Again, arbitrarily for now, just picking the top 20 sellers.
Some sellers might be tied with low prices for certain models.

In [66]:
sellers = traindf['SellerName'].value_counts(ascending = False)[:20]
sellers = sellers.index