# Lab 2 - Vehicles with over 100K Kilometers.

    Scott Gozdzialski
    Adam Baca
    Zoheb Allam
    Ethan Graham
    
    The data can be found https://www.kaggle.com/mirosval/personal-cars-classifieds

##  Data Preparation Part 1 

There are roughly 3,5 Million rows and the following columns:

- maker - The manufacturer of the vehicle
- model - The distinct model of the vehicle
- mileage - in KM (our Response variable)
- manufacture_year
- engine_displacement - in cc
- engine_power - in kW
- body_type - Coupe, van, sedan, etc.
- color_slug - main color of the vehicle
- stk_year - year of the last emission control
- transmission - automatic or manual
- door_count
- seat_count
- fuel_type - gasoline, diesel, cng, lpg, electric
- date_created - when the ad was scraped
- date_last_seen - when the ad was last seen. Our policy was to remove all ads older than 60 days
- price_eur - list price converted to EUR

The first step is to download the data.

In [64]:
#Import the file of 3.5 Million records we will parse it down to 81000 usable records
import pandas as pd
import numpy as np
from __future__ import print_function
path = "~\\Desktop\\Cars.csv"

df = pd.read_csv(path,sep = ",")

First we are going to have to clean the data.  As can be seen below most of the data is object type wich will not work for our classification models.

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3552912 entries, 0 to 3552911
Data columns (total 16 columns):
maker                  object
model                  object
mileage                float64
manufacture_year       float64
engine_displacement    float64
engine_power           float64
body_type              object
color_slug             object
stk_year               object
transmission           object
door_count             object
seat_count             object
fuel_type              object
date_created           object
date_last_seen         object
price_eur              float64
dtypes: float64(5), object(11)
memory usage: 433.7+ MB


First, we change the date the ad was created and the date it was removed to a interger of the number of days the ad ran. then we drop the columns we will not be using. Stk-year is very close to model year, model takes up to much memory seperate and is unworkable.  Finally we will drop all the rows with NAs.  With 3.5 rows we have plenty to use after removing them.

In [66]:
#Convert the date varibles into a delta between and type int
df.date_created = pd.to_datetime(df['date_created'])
df.date_last_seen = pd.to_datetime(df['date_last_seen'])
df['total_days'] = df['date_last_seen'] - df['date_created']
df.total_days = df['total_days'].dt.days.astype(int)

df.drop(['stk_year','model','date_created','date_last_seen'], axis=1, inplace=True)

df = df.dropna()

Next, we convert door count and seat count to ints, and remove eronious information.  There are no vehicles with a 10cc engine.

In [67]:
df.door_count = df.door_count.replace('None','0')
df.door_count = df.door_count.astype(int)
df.seat_count = df.door_count.replace('None','0')
df.seat_count = df.door_count.astype(int)

df = df.sort_values('engine_displacement', ascending=False)
df = df[:82088]

df = df.sort_values('engine_power', ascending=False)
df = df[:81500]

Then, we OneHotEncode maker, body type,color slug, and fuel type. We turn Transmision to binary (1,0) variable. Finally we remove the columns that we OneHotEncoded.  We also make a back up dataframe incase we make a mistake we can go backto this point without rerunning everything above.

In [68]:
tmp_df = pd.get_dummies(df.maker,prefix='Maker')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.body_type,prefix='Body type')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.color_slug,prefix='Color')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.fuel_type,prefix='Fuel')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

df['manual'] = df.transmission=='man' 
df.manual = df.manual.astype(np.int)

df.drop(['body_type','color_slug','fuel_type','maker','transmission'], axis= 1, inplace = True)

df_backup = df

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81500 entries, 3215366 to 3153910
Data columns (total 83 columns):
mileage                   81500 non-null float64
manufacture_year          81500 non-null float64
engine_displacement       81500 non-null float64
engine_power              81500 non-null float64
door_count                81500 non-null int32
seat_count                81500 non-null int32
price_eur                 81500 non-null float64
total_days                81500 non-null int32
Maker_alfa-romeo          81500 non-null uint8
Maker_aston-martin        81500 non-null uint8
Maker_audi                81500 non-null uint8
Maker_bentley             81500 non-null uint8
Maker_bmw                 81500 non-null uint8
Maker_chevrolet           81500 non-null uint8
Maker_chrysler            81500 non-null uint8
Maker_citroen             81500 non-null uint8
Maker_dacia               81500 non-null uint8
Maker_dodge               81500 non-null uint8
Maker_fiat                8

Changing mileage to a binary of milage over 100K Kilometers.

In [70]:
df['mileage_100K'] = df['mileage'] > 100000

# we want to predict the X and y data as follows:

y = df['mileage_100K'].values # get the labels we want
del df['mileage_100K'] 
del df['mileage']# get rid of the class label
X = df.values # use everything else to predict!

## Data Preparation part 2 - Final dataset.
The final dataset constists of 81500 records with 82 columns.

First we downloaded our dataset of car sales in the Czech Republic and Germany.  Most of it downloads as obect type so we changed door count and seat count to intergers.  

We calculated the difference between when the advertisment started and was dropped and created a new variable of total days the ad ran.    

We dropped the stk-year and model, stk-year is not needed since it is very similar to model year.  Model has to many classifications in the rows for us to be able to seperate it out.  When we tried we ran out of memory.

Speaking of seperation the classifactions out, we spereated out Maker, Body type, Color, and Fuel type with one hot endoing. 

We also, dropped any row with a NA value.  This was done for two reasons, first it removed useless rows that will mess with our classifiaction models, second since we started with 3.5M records dropping the rows with NAs left us with 81500 rows of usable data.  The entire 3.5M records would eat up the resources of our machines and 81500 records should be a large enough sampleset to properly capture the nature of the data.

## Model and Evaluation 1 - Evaluation Metric.


**********Needs to be written better****

I think we should focus on accuarcy.  We are not worried if ther are false positives or false negatives.  F-measurements could be used, but we will use somethign like that .

## Model and Evaluation 2 - dividing data


****Needs to be written up
I say we do 10 fold cross-validation

In [71]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=10)

## Model and Evaluation 3 - Model selection

** Needs write up
I say we do K nearest neighbors, random forest, and Logistic regression

### Random Forest

In [72]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=50,random_state=1)
iter_num = 1
rf_acc=acc

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    y_hat = clf.predict(X_test)
    rf_acc[inter_num] = mt.accuracy_score(y_test,y_hat)
    print("Random Forest ", iter_num," accuracy", rf_acc[inter_num] )
    iter_num+=1
    
print ("Average accuracy = ", rf_acc.mean()*100, "+-", rf_acc.std()*100)

Random Forest  1  accuracy 0.428045638572
Random Forest  2  accuracy 0.855109802478
Random Forest  3  accuracy 0.87314439946
Random Forest  4  accuracy 0.84713532082
Random Forest  5  accuracy 0.847116564417
Random Forest  6  accuracy 0.794601226994
Random Forest  7  accuracy 0.772610136213
Random Forest  8  accuracy 0.744263099767
Random Forest  9  accuracy 0.801816173764
Random Forest  10  accuracy 0.746717388637
Average accuracy =  80.2910878674 +- 4.72213663787


### K Nearest Neighbors

In [73]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics as mt

clf = KNeighborsClassifier(n_neighbors=3)
iter_num = 1
knn_acc=acc

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    y_hat = clf.predict(X_test)
    knn_acc[inter_num] = mt.accuracy_score(y_test,y_hat)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc[inter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc.mean()*100, "+-", knn_acc.std()*100)

K Nearest Neighbors  1  accuracy 0.521899153478
K Nearest Neighbors  2  accuracy 0.714513556619
K Nearest Neighbors  3  accuracy 0.770580296896
K Nearest Neighbors  4  accuracy 0.719175561281
K Nearest Neighbors  5  accuracy 0.694969325153
K Nearest Neighbors  6  accuracy 0.699386503067
K Nearest Neighbors  7  accuracy 0.688428027979
K Nearest Neighbors  8  accuracy 0.684501165787
K Nearest Neighbors  9  accuracy 0.7229107866
K Nearest Neighbors  10  accuracy 0.686219167996
Average accuracy =  79.686105661 +- 5.69138102529


### Logistic Regression

In [79]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression() # get object
iter_num = 1
lr_acc=acc

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    y_hat = clf.predict(X_test)
    lr_acc[inter_num] = mt.accuracy_score(y_test,y_hat)
    print("K Nearest Neighbors ", iter_num," accuracy", lr_acc[inter_num] )
    iter_num+=1
    
print ("Average accuracy = ", lr_acc.mean()*100, "+-", lr_acc.std()*100)

K Nearest Neighbors  1  accuracy 0.707888602625
K Nearest Neighbors  2  accuracy 0.730585204269
K Nearest Neighbors  3  accuracy 0.752913752914
K Nearest Neighbors  4  accuracy 0.847258005153
K Nearest Neighbors  5  accuracy 0.796687116564
K Nearest Neighbors  6  accuracy 0.76736196319
K Nearest Neighbors  7  accuracy 0.77248742177
K Nearest Neighbors  8  accuracy 0.793348877163
K Nearest Neighbors  9  accuracy 0.788931157197
K Nearest Neighbors  10  accuracy 0.664744140385
Average accuracy =  76.2220624123 +- 4.87592120242


## Modeling and Evaluation 4 - Analyze Results

Need write up and visualizations

## Modeling and Evaluation 5 - Model advantages

Need write up


## Modeling and Evaluation 6 - Important attributes

## Deployment