# Mini-Project 1 - Used Cars in the USA - SVM & LR Classification
#### By: David Wei, Sophia Wu, Dhruba Dey, Queena Wang

## Introduction
In this section we will continue using our used car dataset and be building out a classification model using Logistic Regression (LR) and Support Vector Machines (SVM). 

In [1]:
#importing libraries and reading in file
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore') #ignoring warnings
import missingno as msno

#plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine.data import economics
from plotnine import ggplot, aes, geom_line

#general sklearn libraries
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
import ptitprince as pt
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as cross_validation
import sklearn.linear_model as linear_model
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import ShuffleSplit

#logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

For convenience and clarity, we have exported all of the data tidying and cleaning we have applied to our original dataset from our initial EDA workbook as an importable file. This will allow us to simply pick up on where we left off without cluttering our notebook with all prior code. For reference, please refer to the following [github link](https://github.com/chee154/ml-Py-used_cars/blob/main/Used_Car_Lab_1_DataVisualization.ipynb) where all work has been contained.

In [2]:
df_raw = pd.read_csv(r'E:\Data Files\used_cars_data_cleaned.csv')
print('# of Records: '+str(len(df_raw)))
print('# of Columns: '+str(df_raw.shape[1]))

# of Records: 697989
# of Columns: 18


## Data Tidying

#### Data Cleanup
<b>NOTE</b>: First time running the logistic regression model returned an error: 
<br>
*ValueError: Input contains NaN, infinity or a value too large for dtype('float64').*
<br>
Therefore, we will fix this by once again removing any strangling NA records from our dataset.

In [3]:
df_cleaned = df_raw.copy()
df_cleaned = df_cleaned.dropna()
print(len(df_cleaned))
print('# of Records Removed: '+str(len(df_raw)-len(df_cleaned)))

697881
# of Records Removed: 108


#### Transforming Response Variable from Continuous to Categorical
Since our main interest in this dataset is the 'price' of a vehicle, we will transform our continuous price attribute into a categorical one by grouping all car prices into the following:
* "<5000"          : price < 5000
* "5000-10000"     : 5000 <= price <= 10000
* "10000-15000"    : 10000 < price <= 15000
* "15000-20000"    : 15000 < price <= 20000
* "20000-25000"    : 20000 < price <= 25000
* "25000 and over" : price > 25000

In [4]:
price_group = []
for price in df_cleaned["price"]:
    if price < 5000:
        price_group.append("<5000")
    elif 5000 <= price <= 10000:
        price_group.append("5000-10000")
    elif 10000 < price <= 15000:
        price_group.append("10000-15000")
    elif 15000 < price <= 20000:
        price_group.append("15000-20000")
    elif 20000 < price <= 25000:
        price_group.append("20000-25000")
    else:
        price_group.append("25000 and over")

In [5]:
df_price_group = df_cleaned.copy()
df_price_group["price_group"] = price_group
del df_price_group["price"]
print(df_price_group['price_group'])

0            10000-15000
1         25000 and over
2            20000-25000
3            20000-25000
4         25000 and over
               ...      
697984       15000-20000
697985       20000-25000
697986       15000-20000
697987    25000 and over
697988       15000-20000
Name: price_group, Length: 697881, dtype: object


#### OneHotEncoding
Once the data has been imported and cleaned, we will work on transforming our dataset to be more useful for our classification models. To start we will first one-hot encode all of our categorical (object) datatypes as numbers.

In [8]:
def number_encode_features(df_price_group):
    result = df_price_group.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object or result.dtypes[column]==np.bool:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    print('Columns converted: '+str(encoders))
    return result

Below shows a snap shot of what the final data looks like after categorical data has been encoded.
You can see the body type is in a numerical representation, instead of a string (object) type, before being encoded.
<br>
- Below shows a snap shot of what the final data looks like after categorical data has been encoded.
- You can see the body type is in a numerical representation, instead of a string (object) type, before being encoded.

In [9]:
encoded_data = number_encode_features(df_price_group)
encoded_data

Columns converted: {'body_type': LabelEncoder(), 'frame_damaged': LabelEncoder(), 'has_accidents': LabelEncoder(), 'is_new': LabelEncoder(), 'price_group': LabelEncoder()}


Unnamed: 0,body_type,city_fuel_economy,daysonmarket,engine_displacement,frame_damaged,has_accidents,height,highway_fuel_economy,horsepower,is_new,length,maximum_seating,mileage,owner_count,seller_rating,width,year,price_group
0,6,27.0,55,1500.0,0,0,57.6,36.0,160.0,0,57.6,5.0,42394.0,1.0,3.447761,73.0,2018,0
1,1,18.0,36,3500.0,0,0,55.1,24.0,311.0,0,55.1,4.0,62251.0,1.0,2.800000,81.5,2018,3
2,5,18.0,27,3600.0,0,0,70.7,27.0,310.0,0,70.7,8.0,36410.0,1.0,3.447761,78.6,2018,2
3,5,15.0,27,3600.0,0,1,69.9,22.0,281.0,0,69.9,8.0,36055.0,1.0,3.447761,78.5,2017,2
4,5,18.0,24,3600.0,0,0,69.3,25.0,295.0,0,69.3,5.0,25745.0,1.0,3.447761,84.8,2018,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697984,5,26.0,32,1400.0,0,0,66.0,31.0,138.0,0,66.0,5.0,7444.0,1.0,4.533333,69.9,2019,1
697985,5,26.0,17,2500.0,0,0,66.4,32.0,170.0,0,66.4,5.0,20160.0,1.0,4.333333,80.0,2017,2
697986,6,26.0,17,2500.0,0,0,57.9,37.0,179.0,0,57.9,5.0,62138.0,1.0,4.333333,72.0,2018,1
697987,4,18.0,89,3500.0,0,0,70.6,23.0,278.0,0,70.6,5.0,20009.0,1.0,5.000000,75.2,2017,3


Referncing our transformed variable 'price_group' as an int to it's original string value for future analysis and interpretation

In [41]:
print(df_final.groupby(['price_group']).size())
print('-----------------------------------------')
print(df_price_group.groupby(['price_group']).size())

price_group
0     96001
1    172368
2    115254
3    268595
4     40482
5      5181
dtype: int64
-----------------------------------------
price_group
10000-15000        96001
15000-20000       172368
20000-25000       115254
25000 and over    268595
5000-10000         40482
<5000               5181
dtype: int64


Now we are ready for some model building!

In [26]:
df_final = encoded_data.copy()
print('# of Records: '+str(len(df_final)))
print('# of Columns: '+str(df_final.shape[1]))
print()
print(df_final.dtypes)

# of Records: 697881
# of Columns: 18

body_type                 int32
city_fuel_economy       float64
daysonmarket              int64
engine_displacement     float64
frame_damaged             int64
has_accidents             int64
height                  float64
highway_fuel_economy    float64
horsepower              float64
is_new                    int64
length                  float64
maximum_seating         float64
mileage                 float64
owner_count             float64
seller_rating           float64
width                   float64
year                      int64
price_group               int32
dtype: object


## Training and Testing Split
Once our dataset ready for modeling, we will move on to our next steps of splitting up our data. For our dataset, we will use a 70:30 split that roughly leaves our training set with 488k records and test set with the remainder (200k records). We will then apply a 3-fold Cross Validation with a seed of 42 because it (42) is the answer to the ultimate question of life, the universe, and everything.
<br><br>
Our resposne variable will be price, more specifically the price group ('price_group') a car falls in.

In [11]:
if 'price_group' in df_final:
    y = df_final['price_group'].values # get the labels we want
    del df_final['price_group'] # get rid of the class label
    X = df_final.values # use everything else to predict!
print(y)
print()

num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,random_state=42, test_size= 0.3)     
print(cv_object)

[0 3 2 ... 1 3 1]

ShuffleSplit(n_splits=3, random_state=42, test_size=0.3, train_size=None)


## Logistic Regression

In [12]:
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) 

In [13]:
iter_num=0

for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1

====Iteration 0  ====
accuracy 0.5940104602010843
confusion matrix
 [[ 8282 17791     4  1384  1359     0]
 [ 2907 37266     1 11376   278     1]
 [  544 17526     7 16295    70     2]
 [  163  6378     2 74023    14     3]
 [ 4853  2435     0    69  4786     0]
 [   74    59     0     2  1410     1]]
====Iteration 1  ====
accuracy 0.5931268359085807
confusion matrix
 [[ 8003 17858     2  1482  1220     2]
 [ 2751 37299     1 11422   272     0]
 [  490 17589     1 16732    59     2]
 [  180  6272     1 74079    19     2]
 [ 4884  2305     0    81  4793     7]
 [   82    32     0     2  1436     5]]
====Iteration 2  ====
accuracy 0.5915076540969121
confusion matrix
 [[ 7979 18043     9  1418  1275     3]
 [ 2804 37257    24 11597   260     2]
 [  542 17366    11 16641    77     4]
 [  210  6342     1 73901    21     2]
 [ 4790  2460     0    76  4692     5]
 [   67    66     0     0  1419     1]]


### LR - Analysis & Intepretations

## Support Vector Machines