# Lab 2 - Task 1 Vehicles with over 100K Kilometers/Task 2 Body Type.
IN this lab we will be using a dataset found on Kaggle.com about vehcile advertisements in the Czech republic and Germany.  For task 1 we will be focusing on predicting if a vehicle has over 100K kilometers. For task 2 we will be focusing on predicting the different body types of the vehicles.

    Scott Gozdzialski
    Adam Baca
    Zoheb Allam
    Ethan Graham
    
    The data can be found https://www.kaggle.com/mirosval/personal-cars-classifieds

##  Data Preparation Part 1 

There are roughly 3.5 Million rows and the following columns:

- maker - The manufacturer of the vehicle
- model - The distinct model of the vehicle
- mileage - in KM (our Response variable)
- manufacture_year
- engine_displacement - in cc
- engine_power - in kW
- body_type - Coupe, van, sedan, etc.
- color_slug - main color of the vehicle
- stk_year - year of the last emission control
- transmission - automatic or manual
- door_count
- seat_count
- fuel_type - gasoline, diesel, cng, lpg, electric
- date_created - when the ad was scraped
- date_last_seen - when the ad was last seen. Our policy was to remove all ads older than 60 days
- price_eur - list price converted to EUR

The first step is to download the data.

In [98]:
#Import the file of 3.5 Million records we will parse it down to 81000 usable records
import pandas as pd
import numpy as np
from __future__ import print_function
path = "~\\Desktop\\Cars.csv"

df = pd.read_csv(path,sep = ",")

First we are going to have to clean the data.  As can be seen below most of the data is object type wich will not work for our classification models.

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3552912 entries, 0 to 3552911
Data columns (total 16 columns):
maker                  object
model                  object
mileage                float64
manufacture_year       float64
engine_displacement    float64
engine_power           float64
body_type              object
color_slug             object
stk_year               object
transmission           object
door_count             object
seat_count             object
fuel_type              object
date_created           object
date_last_seen         object
price_eur              float64
dtypes: float64(5), object(11)
memory usage: 433.7+ MB


First, we change the date the ad was created and the date it was removed to a interger of the number of days the ad ran. then we drop the columns we will not be using. Stk-year is very close to model year, model takes up to much memory seperate and is unworkable.  Finally we will drop all the rows with NAs.  With 3.5 rows we have plenty to use after removing them.

In [100]:
#Convert the date varibles into a delta between and type int
df.date_created = pd.to_datetime(df['date_created'])
df.date_last_seen = pd.to_datetime(df['date_last_seen'])
df['total_days'] = df['date_last_seen'] - df['date_created']
df.total_days = df['total_days'].dt.days.astype(int)

df.drop(['stk_year','model','date_created','date_last_seen'], axis=1, inplace=True)

df = df.dropna()

Next, we convert door count and seat count to ints, and remove eronious information.  There are no vehicles with a 10cc engine.

In [101]:
df.door_count = df.door_count.replace('None','0')
df.door_count = df.door_count.astype(int)
df.seat_count = df.door_count.replace('None','0')
df.seat_count = df.door_count.astype(int)

df = df.sort_values('engine_displacement', ascending=False)
df = df[:82088]

df = df.sort_values('engine_power', ascending=False)
df = df[:81500]

#### Task 1
This is where the dataframe for task one and task two deviate from each other.  Task one will one hot encode body type to predict milage.

Then, we OneHotEncode maker, body type,color slug, and fuel type. We turn Transmision to binary (1,0) variable. Finally we remove the columns that we OneHotEncoded.  We also make a back up dataframe incase we make a mistake we can go backto this point without rerunning everything above.

In [102]:
df2 = df

tmp_df = pd.get_dummies(df.maker,prefix='Maker')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.body_type,prefix='Body type')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.color_slug,prefix='Color')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df.fuel_type,prefix='Fuel')
df = pd.concat((df,tmp_df),axis=1) # add back into the dataframe

df['manual'] = df.transmission=='man' 
df.manual = df.manual.astype(np.int)

df.drop(['body_type','color_slug','fuel_type','maker','transmission'], axis= 1, inplace = True)

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81500 entries, 3215366 to 3153910
Data columns (total 83 columns):
mileage                   81500 non-null float64
manufacture_year          81500 non-null float64
engine_displacement       81500 non-null float64
engine_power              81500 non-null float64
door_count                81500 non-null int32
seat_count                81500 non-null int32
price_eur                 81500 non-null float64
total_days                81500 non-null int32
Maker_alfa-romeo          81500 non-null uint8
Maker_aston-martin        81500 non-null uint8
Maker_audi                81500 non-null uint8
Maker_bentley             81500 non-null uint8
Maker_bmw                 81500 non-null uint8
Maker_chevrolet           81500 non-null uint8
Maker_chrysler            81500 non-null uint8
Maker_citroen             81500 non-null uint8
Maker_dacia               81500 non-null uint8
Maker_dodge               81500 non-null uint8
Maker_fiat                8

Changing mileage to a binary of milage over 100K Kilometers.

In [70]:
df['mileage_100K'] = df['mileage'] > 100000

# we want to predict the X and y data as follows:

y = df['mileage_100K'].values # get the labels we want
del df['mileage_100K'] 
del df['mileage']# get rid of the class label
X = df.values # use everything else to predict!

####  Task 2
For task 2 we will be predicting the body type and leaving mileage as a consistant number.

In [103]:
tmp_df = pd.get_dummies(df2.maker,prefix='Maker')
df2 = pd.concat((df2,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df2.color_slug,prefix='Color')
df = pd.concat((df2,tmp_df),axis=1) # add back into the dataframe

tmp_df = pd.get_dummies(df2.fuel_type,prefix='Fuel')
df = pd.concat((df2,tmp_df),axis=1) # add back into the dataframe

df2['manual'] = df2.transmission=='man' 
df2.manual = df2.manual.astype(np.int)

df2.drop(['color_slug','fuel_type','maker','transmission'], axis= 1, inplace = True)
    

In [104]:
y2 = df2['body_type'].values # get the labels we want
del df2['body_type'] 
X2 = df2.values

In [81]:
#from sklearn.feature_selection import VarianceThreshold

#dftest = df

#selector = VarianceThreshold(.05)
#selector.fit_transform(dftest)
#idx = selector.get_support()
#idx2 = idx.index()
#idx2
#dftest2 = dftest[:,idx2.values]

## Data Preparation part 2 - Final dataset.
The final dataset constists of 81500 records with 82 columns.

First we downloaded our dataset of car sales in the Czech Republic and Germany.  Most of it downloads as obect type so we changed door count and seat count to intergers.  

We calculated the difference between when the advertisment started and was dropped and created a new variable of total days the ad ran.    

We dropped the stk-year and model, stk-year is not needed since it is very similar to model year.  Model has to many classifications in the rows for us to be able to seperate it out.  When we tried we ran out of memory.

#### Task 1

Speaking of seperation the classifactions out, we spereated out Maker, Body type, Color, and Fuel type with one hot endoing. 

#### Task 2

For the seperations in task 2, we seperated out Maker, Color, and Fuel Type with one hot encoding.  We left body type in the dataframe because this will now be our response variable.  Body type has 9 different classifaction we will be focusing on predicting.

#### Both tasks
We also, dropped any row with a NA value.  This was done for two reasons, first it removed useless rows that will mess with our classifiaction models, second since we started with 3.5M records dropping the rows with NAs left us with 81500 rows of usable data.  The entire 3.5M records would eat up the resources of our machines and 81500 records should be a large enough sampleset to properly capture the nature of the data.

## Model and Evaluation 1 - Evaluation Metric.

To evaluate our different classification methods, we will be examining the accuracy with which they can predict vehicles with over 100k miles (task 1) and vehicle body type (task 2). The accuracy of each method will tell us the percentage of our sample correctly classified (PCC), in otherwords the percent of true positives and true negatives. PCC is the most commonly used metric to assess overall model accuracy and is calculated without taking into account what kind of errors are made, meaning each error has the same weight. Since our models aim to predict features in cars and not something related to health care like cancer, we are not concerned about the different impacts the false positives and false negatives may have on our sample and therefore can be treated equally.        

## Model and Evaluation 2 - dividing data

We will be using 10 fold cross validation in order to divide our data into training and testing splits. Cross validation is when you divide a sample of data into subsets and then perform the analysis on a training subset and validate those results with the testing subset. Cross validation allows you to determine if the results of the model will generalize to an independent dataset and also can limit issues like overfitting. 

With 10 fold cross validation, the cross validation process is repeated 10 times with each of the subsamples being used only once for validation. The main advantages to repeating this process 10 times is that all observations are used for both training and validation and therefore we do not lose sample size which can affect modeling capabilities. 

In order to aggregate the results from the 10 fold cross validation we will take the PCC for each model and average it together to form an overall accuracy measure. This will enable us to get a more accurate estimate of each models performance.  

In [8]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=10)

## Model and Evaluation 3 - Model selection

** Needs write up
I say we do K nearest neighbors, random forest, and Logistic regression

### Random Forest - task 1 (mileage)

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics as mt

clf = RandomForestClassifier(n_estimators=50,random_state=1)
iter_num = 0
rf_acc = np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    rf_y_hat = clf.predict(X_test)
    rf_acc[iter_num] = mt.accuracy_score(y_test,rf_y_hat)
    print("Random Forest ", iter_num," accuracy", rf_acc[iter_num])
    iter_num+=1
    
    
print ("Average accuracy = ", rf_acc.mean()*100, "+-", rf_acc.std()*100)

Random Forest  0  accuracy 0.428045638572
Random Forest  1  accuracy 0.855109802478
Random Forest  2  accuracy 0.87314439946
Random Forest  3  accuracy 0.84713532082
Random Forest  4  accuracy 0.847116564417
Random Forest  5  accuracy 0.794601226994
Random Forest  6  accuracy 0.772610136213
Random Forest  7  accuracy 0.744263099767
Random Forest  8  accuracy 0.801816173764
Random Forest  9  accuracy 0.746717388637
Average accuracy =  77.1055975112 +- 12.2282325962


### Random Forest - task 2 (Body type)

In [115]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics as mt

clf = RandomForestClassifier(n_estimators=50,random_state=1)
iter_num = 0
rf_acc2 = np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X2_train = X2[train_indices]
    y2_train = y2[train_indices]
    X2_test = X2[test_indices]
    y2_test = y2[test_indices]
    clf.fit(X2_train,y2_train) 
    rf_y_hat2 = clf.predict(X2_test)
    rf_acc2[iter_num] = mt.accuracy_score(y2_test,rf_y_hat2)
    print("Random Forest ", iter_num," accuracy", rf_acc2[iter_num])
    iter_num+=1
    
    
print ("Average accuracy = ", rf_acc2.mean()*100, "+-", rf_acc2.std()*100)

Random Forest  0  accuracy 0.587289903079
Random Forest  1  accuracy 0.600785179733
Random Forest  2  accuracy 0.596736596737
Random Forest  3  accuracy 0.595632437738
Random Forest  4  accuracy 0.56282208589
Random Forest  5  accuracy 0.576196319018
Random Forest  6  accuracy 0.571481163333
Random Forest  7  accuracy 0.581298318812
Random Forest  8  accuracy 0.656644987115
Random Forest  9  accuracy 0.692232175727
Average accuracy =  60.2111916718 +- 3.86970345172


### K Nearest Neighbors - Task 1 (Mileage)

In [24]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=3)
iter_num = 0
knn_acc= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    knn_y_hat = clf.predict(X_test)
    knn_acc[iter_num] = mt.accuracy_score(y_test,knn_y_hat)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc.mean()*100, "+-", knn_acc.std()*100)

K Nearest Neighbors  0  accuracy 0.521899153478
K Nearest Neighbors  1  accuracy 0.714513556619
K Nearest Neighbors  2  accuracy 0.770580296896
K Nearest Neighbors  3  accuracy 0.719175561281
K Nearest Neighbors  4  accuracy 0.694969325153
K Nearest Neighbors  5  accuracy 0.699386503067
K Nearest Neighbors  6  accuracy 0.688428027979
K Nearest Neighbors  7  accuracy 0.684501165787
K Nearest Neighbors  8  accuracy 0.7229107866
K Nearest Neighbors  9  accuracy 0.686219167996
Average accuracy =  69.0258354486 +- 6.11926978358


### K Nearest Neightbors Task 2 ( Body Type)

We will test 1,3,5, and 7 neighbors

Staring with 1

In [111]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=1)
iter_num = 0
knn_acc2= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X2_train = X2[train_indices]
    y2_train = y2[train_indices]
    X2_test = X2[test_indices]
    y2_test = y2[test_indices]
    clf.fit(X2_train,y2_train) 
    knn_y_hat2 = clf.predict(X2_test)
    knn_acc2[iter_num] = mt.accuracy_score(y2_test,knn_y_hat2)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc2[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc2.mean()*100, "+-", knn_acc2.std()*100)

K Nearest Neighbors  0  accuracy 0.267574530732
K Nearest Neighbors  1  accuracy 0.302416881364
K Nearest Neighbors  2  accuracy 0.324745430009
K Nearest Neighbors  3  accuracy 0.312722365354
K Nearest Neighbors  4  accuracy 0.306871165644
K Nearest Neighbors  5  accuracy 0.350552147239
K Nearest Neighbors  6  accuracy 0.351454166155
K Nearest Neighbors  7  accuracy 0.395385936925
K Nearest Neighbors  8  accuracy 0.430114124432
K Nearest Neighbors  9  accuracy 0.45465701313
Average accuracy =  34.9649376099 +- 5.69017219092


In [112]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=3)
iter_num = 0
knn_acc2= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X2_train = X2[train_indices]
    y2_train = y2[train_indices]
    X2_test = X2[test_indices]
    y2_test = y2[test_indices]
    clf.fit(X2_train,y2_train) 
    knn_y_hat2 = clf.predict(X2_test)
    knn_acc2[iter_num] = mt.accuracy_score(y2_test,knn_y_hat2)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc2[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc2.mean()*100, "+-", knn_acc2.std()*100)

K Nearest Neighbors  0  accuracy 0.299472457367
K Nearest Neighbors  1  accuracy 0.302784934364
K Nearest Neighbors  2  accuracy 0.32327321801
K Nearest Neighbors  3  accuracy 0.311372837689
K Nearest Neighbors  4  accuracy 0.305153374233
K Nearest Neighbors  5  accuracy 0.366748466258
K Nearest Neighbors  6  accuracy 0.374033623758
K Nearest Neighbors  7  accuracy 0.435636274389
K Nearest Neighbors  8  accuracy 0.476868327402
K Nearest Neighbors  9  accuracy 0.528899251442
Average accuracy =  37.2424276491 +- 7.762964895


In [113]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=5)
iter_num = 0
knn_acc2= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X2_train = X2[train_indices]
    y2_train = y2[train_indices]
    X2_test = X2[test_indices]
    y2_test = y2[test_indices]
    clf.fit(X2_train,y2_train) 
    knn_y_hat2 = clf.predict(X2_test)
    knn_acc2[iter_num] = mt.accuracy_score(y2_test,knn_y_hat2)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc2[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc2.mean()*100, "+-", knn_acc2.std()*100)

K Nearest Neighbors  0  accuracy 0.31713900135
K Nearest Neighbors  1  accuracy 0.33701386333
K Nearest Neighbors  2  accuracy 0.353944301313
K Nearest Neighbors  3  accuracy 0.352717457981
K Nearest Neighbors  4  accuracy 0.322085889571
K Nearest Neighbors  5  accuracy 0.391901840491
K Nearest Neighbors  6  accuracy 0.398944655786
K Nearest Neighbors  7  accuracy 0.463737881949
K Nearest Neighbors  8  accuracy 0.494784636152
K Nearest Neighbors  9  accuracy 0.545465701313
Average accuracy =  39.7773522923 +- 7.44879925335


In [114]:
from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=7)
iter_num = 0
knn_acc2= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X2_train = X2[train_indices]
    y2_train = y2[train_indices]
    X2_test = X2[test_indices]
    y2_test = y2[test_indices]
    clf.fit(X2_train,y2_train) 
    knn_y_hat2 = clf.predict(X2_test)
    knn_acc2[iter_num] = mt.accuracy_score(y2_test,knn_y_hat2)
    print("K Nearest Neighbors ", iter_num," accuracy", knn_acc2[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", knn_acc2.mean()*100, "+-", knn_acc2.std()*100)

K Nearest Neighbors  0  accuracy 0.313458471353
K Nearest Neighbors  1  accuracy 0.355293828978
K Nearest Neighbors  2  accuracy 0.372346951294
K Nearest Neighbors  3  accuracy 0.369402527297
K Nearest Neighbors  4  accuracy 0.331901840491
K Nearest Neighbors  5  accuracy 0.394969325153
K Nearest Neighbors  6  accuracy 0.416492821205
K Nearest Neighbors  7  accuracy 0.472327892993
K Nearest Neighbors  8  accuracy 0.509019511597
K Nearest Neighbors  9  accuracy 0.55061970794
Average accuracy =  40.858328783 +- 7.42898382347


### Logistic Regression - Task 1 only (MIleage)

In [26]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression() # get object
iter_num = 0
lr_acc= np.zeros(10)

for train_indices, test_indices in cv.split(X,y): 
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    clf.fit(X_train,y_train) 
    lr_y_hat = clf.predict(X_test)
    lr_acc[iter_num] = mt.accuracy_score(y_test,lr_y_hat)
    print("Logistic regression ", iter_num," accuracy", lr_acc[iter_num] )
    iter_num+=1
    
print ("Average accuracy = ", lr_acc.mean()*100, "+-", lr_acc.std()*100)

Logistic regression  0  accuracy 0.707888602625
Logistic regression  1  accuracy 0.730585204269
Logistic regression  2  accuracy 0.752913752914
Logistic regression  3  accuracy 0.847258005153
Logistic regression  4  accuracy 0.796687116564
Logistic regression  5  accuracy 0.76736196319
Logistic regression  6  accuracy 0.77248742177
Logistic regression  7  accuracy 0.793348877163
Logistic regression  8  accuracy 0.788931157197
Logistic regression  9  accuracy 0.664744140385
Average accuracy =  76.2220624123 +- 4.87592120242


### Naive Bayes - task 2 only (Body type)

In [118]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB


clf_mnb = MultinomialNB(alpha=1)
y_hat_mnb2 = clf_mnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_mnb2)
print("The Multinomial NB accuracy at alpha=1 is ", acc)

clf_bnb = BernoulliNB(alpha=1, binarize=0.0)
y_hat_bnb2 = clf_bnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_bnb2)
print("The Bernoulli NB accuracy at alpha=1 is ", acc)

clf_mnb = MultinomialNB(alpha=0.5)
y_hat_mnb2 = clf_mnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_mnb2)
print("The Multinomial NB accuracy at alpha=0.5 is ", acc)

clf_bnb = BernoulliNB(alpha=0.5, binarize=0.0)
y_hat_bnb2 = clf_bnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_bnb2)
print("The Bernoulli NB accuracy at alpha=0.5 is ", acc)

clf_mnb = MultinomialNB(alpha=0.1)
y_hat_mnb2 = clf_mnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_mnb2)
print("The Multinomial NB accuracy at alpha=0.1 is ", acc)

clf_bnb = BernoulliNB(alpha=0.1, binarize=0.0)
y_hat_bnb2 = clf_bnb.fit(X2_train, y2_train).predict(X2_test)
acc = mt.accuracy_score(y2_test,y_hat_bnb2)
print("The Bernoulli NB accuracy at alpha=0.1 is ", acc)


The Multinomial NB accuracy at alpha=1 is  0.346300159529
The Bernoulli NB accuracy at alpha=1 is  0.301263958768
The Multinomial NB accuracy at alpha=0.5 is  0.346300159529
The Bernoulli NB accuracy at alpha=0.5 is  0.301263958768
The Multinomial NB accuracy at alpha=0.1 is  0.346300159529
The Bernoulli NB accuracy at alpha=0.1 is  0.301263958768


## Modeling and Evaluation 4 - Analyze Results

Need write up and visualizations

## Modeling and Evaluation 5 - Model advantages

Need write up


## Modeling and Evaluation 6 - Important attributes

## Deployment