# Mini-Project 1 - Used Cars in the USA - SVM & LR Classification
#### By: David Wei, Sophia Wu, Dhruba Dey, Queena Wang

## Introduction
In this section we will continue using our used car dataset and be building out a classification model using Logistic Regression (LR) and Support Vector Machines (SVM). 

In [1]:
#importing libraries and reading in file
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore') #ignoring warnings
import missingno as msno

#plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine.data import economics
from plotnine import ggplot, aes, geom_line

#general sklearn libraries
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
import ptitprince as pt
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as cross_validation
import sklearn.linear_model as linear_model
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import StandardScaler #for scaling

#logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.model_selection import cross_val_score

For convenience and clarity, we have exported all of the data tidying and cleaning we have applied to our original dataset from our initial EDA workbook as an importable file. This will allow us to simply pick up on where we left off without cluttering our notebook with all prior code. For reference, please refer to the following [github link](https://github.com/chee154/ml-Py-used_cars/blob/main/Used_Car_Lab_1_DataVisualization.ipynb) where all work has been contained.

In [2]:
df_raw = pd.read_csv(r'E:\Data Files\used_cars_data_cleaned.csv')
print('# of Records: '+str(len(df_raw)))
print('# of Columns: '+str(df_raw.shape[1]))

# of Records: 697989
# of Columns: 18


## Data Tidying

#### Addressing Empty Values
<b>NOTE</b>: First time running the logistic regression model returned an error: 
<br>
*ValueError: Input contains NaN, infinity or a value too large for dtype('float64').*
<br>
Therefore, we will fix this by once again removing any strangling NA records from our dataset.

In [3]:
df_cleaned = df_raw.copy()
df_cleaned = df_cleaned.dropna()
print(len(df_cleaned))
print('# of Records Removed: '+str(len(df_raw)-len(df_cleaned)))

697881
# of Records Removed: 108


#### Cleaning up final dataframe
After the initial EDA, we will make the appropriate adjustments to our dataset based on our findings. To begin, let's get look at the relationship between **'length'** and **'height'** due to it's high correlation.

In [4]:
df_test = df_cleaned.copy()
print(df_test.shape[1])

if df_test['length'].equals(df_test['height']) == True:
    print('they are the same... dropping one')
    df_test = df_test.drop(columns='height')
    print(df_test.shape[1])
else: 
    print('they are not the same')

18
they are the same... dropping one
17


## Data Transformations for Modeling

In [None]:
df_test['rank']=df_test['price'].rank()

In [None]:
df_test['rank_pct']= df_test['price'].rank(pct=True)

In [None]:
df_test[['price', 'rank', 'rank_pct']].sort_values(by=['rank'])

In [None]:
def apply_class_func(price):
    if price >= .75:
        return "A"
    elif .5 <= price <= .74:
        return "B"
    elif .25 <= price <= .49:
        return "C"
    elif price <= .25:
        return "D"

In [None]:
df_test['price_cat']=df_test['rank_pct'].apply(apply_class_func)

In [None]:
df_test.groupby(['price_cat']).size()

In [None]:
print(df_test[['price', 'rank', 'rank_pct','price_cat']])

#### Transforming Response Variable from Continuous to Categorical
Since our main interest in this dataset is the 'price' of a vehicle, we will transform our continuous price attribute into a categorical one by grouping all car prices into the following:
* "<5000"          : price < 5000
* "5000-10000"     : 5000 <= price <= 10000
* "10000-15000"    : 10000 < price <= 15000
* "15000-20000"    : 15000 < price <= 20000
* "20000-25000"    : 20000 < price <= 25000
* "25000 and over" : price > 25000

In [5]:
df_test2 = df_cleaned.copy()

In [7]:
price_group = []
for price in df_test2["price"]:
    if price < 5000:
        price_group.append("<5000")
    elif 5000 <= price <= 10000:
        price_group.append("5000-10000")
    elif 10000 < price <= 15000:
        price_group.append("10000-15000")
    elif 15000 < price <= 20000:
        price_group.append("15000-20000")
    elif 20000 < price <= 25000:
        price_group.append("20000-25000")
    else:
        price_group.append("25000 and over")

In [8]:
df_price_group = df_test2.copy()
df_price_group["price_group"] = price_group
del df_price_group["price"]
print(df_price_group['price_group'])

0            10000-15000
1         25000 and over
2            20000-25000
3            20000-25000
4         25000 and over
               ...      
697984       15000-20000
697985       20000-25000
697986       15000-20000
697987    25000 and over
697988       15000-20000
Name: price_group, Length: 697881, dtype: object


In [9]:
df_price_group.groupby(['price_group']).size()

price_group
10000-15000        96001
15000-20000       172368
20000-25000       115254
25000 and over    268595
5000-10000         40482
<5000               5181
dtype: int64

#### OneHotEncoding
Once the data has been imported and cleaned, we will work on transforming our dataset to be more useful for our classification models. To start we will first one-hot encode all of our categorical (object) datatypes as numbers.

In [10]:
def number_encode_features(df):
    result = df_cleaned.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object or result.dtypes[column]==np.bool:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    print('Columns converted: '+str(encoders))
    return result

Below shows a snap shot of what the final data looks like after categorical data has been encoded.
You can see the body type is in a numerical representation, instead of a string (object) type, before being encoded.
<br>
- Below shows a snap shot of what the final data looks like after categorical data has been encoded.
- You can see the body type is in a numerical representation, instead of a string (object) type, before being encoded.

In [11]:
encoded_data = number_encode_features(df_cleaned)
encoded_data

Columns converted: {'body_type': LabelEncoder(), 'frame_damaged': LabelEncoder(), 'has_accidents': LabelEncoder(), 'is_new': LabelEncoder()}


Unnamed: 0,body_type,city_fuel_economy,daysonmarket,engine_displacement,frame_damaged,has_accidents,height,highway_fuel_economy,horsepower,is_new,length,maximum_seating,mileage,owner_count,price,seller_rating,width,year
0,6,27.0,55,1500.0,0,0,57.6,36.0,160.0,0,57.6,5.0,42394.0,1.0,14639.0,3.447761,73.0,2018
1,1,18.0,36,3500.0,0,0,55.1,24.0,311.0,0,55.1,4.0,62251.0,1.0,32000.0,2.800000,81.5,2018
2,5,18.0,27,3600.0,0,0,70.7,27.0,310.0,0,70.7,8.0,36410.0,1.0,23723.0,3.447761,78.6,2018
3,5,15.0,27,3600.0,0,1,69.9,22.0,281.0,0,69.9,8.0,36055.0,1.0,22422.0,3.447761,78.5,2017
4,5,18.0,24,3600.0,0,0,69.3,25.0,295.0,0,69.3,5.0,25745.0,1.0,29424.0,3.447761,84.8,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697984,5,26.0,32,1400.0,0,0,66.0,31.0,138.0,0,66.0,5.0,7444.0,1.0,17836.0,4.533333,69.9,2019
697985,5,26.0,17,2500.0,0,0,66.4,32.0,170.0,0,66.4,5.0,20160.0,1.0,20700.0,4.333333,80.0,2017
697986,6,26.0,17,2500.0,0,0,57.9,37.0,179.0,0,57.9,5.0,62138.0,1.0,17700.0,4.333333,72.0,2018
697987,4,18.0,89,3500.0,0,0,70.6,23.0,278.0,0,70.6,5.0,20009.0,1.0,40993.0,5.000000,75.2,2017


Referncing our transformed variable 'price_group' as an int to it's original string value for future analysis and interpretation

In [None]:
print(df_final.groupby(['price_group']).size())
print('-----------------------------------------')
print(df_price_group.groupby(['price_group']).size())

Now we are ready for some model building!

In [12]:
df_final = encoded_data.copy()
print('# of Records: '+str(len(df_final)))
print('# of Columns: '+str(df_final.shape[1]))
print()
print(df_final.dtypes)

# of Records: 697881
# of Columns: 18

body_type                 int32
city_fuel_economy       float64
daysonmarket              int64
engine_displacement     float64
frame_damaged             int64
has_accidents             int64
height                  float64
highway_fuel_economy    float64
horsepower              float64
is_new                    int64
length                  float64
maximum_seating         float64
mileage                 float64
owner_count             float64
price                   float64
seller_rating           float64
width                   float64
year                      int64
dtype: object


## Training and Testing Split
Once our dataset ready for modeling, we will move on to our next steps of splitting up our data. For our dataset, we will use a 70:30 split that roughly leaves our training set with 488k records and test set with the remainder (209k records). We will then apply a 3-fold Cross Validation with a seed of 42 because it (42) is the answer to the ultimate question of life, the universe, and everything.
<br><br>
Our resposne variable will be price, more specifically the price group ('price_group') a car falls in.

In [20]:
if 'has_accidents' in df_final:
    y = df_final['has_accidents'].values # get the labels we want
    del df_final['has_accidents'] # get rid of the class label
    X = df_final.values # use everything else to predict!
print(y)
print(X)

num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,random_state=42, test_size= 0.3)     
print(cv_object)

[0 0 0 ... 0 0 0]
[[6.00000000e+00 2.70000000e+01 5.50000000e+01 ... 3.44776119e+00
  7.30000000e+01 2.01800000e+03]
 [1.00000000e+00 1.80000000e+01 3.60000000e+01 ... 2.80000000e+00
  8.15000000e+01 2.01800000e+03]
 [5.00000000e+00 1.80000000e+01 2.70000000e+01 ... 3.44776119e+00
  7.86000000e+01 2.01800000e+03]
 ...
 [6.00000000e+00 2.60000000e+01 1.70000000e+01 ... 4.33333333e+00
  7.20000000e+01 2.01800000e+03]
 [4.00000000e+00 1.80000000e+01 8.90000000e+01 ... 5.00000000e+00
  7.52000000e+01 2.01700000e+03]
 [5.00000000e+00 2.60000000e+01 1.70000000e+01 ... 4.33333333e+00
  7.24000000e+01 2.01700000e+03]]
ShuffleSplit(n_splits=3, random_state=42, test_size=0.3, train_size=None)


## Logistic Regression

In [22]:
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) 

In [23]:
iter_num=0

for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1

====Iteration 0  ====
accuracy 0.8788479449764764
confusion matrix
 [[183886    184]
 [ 25181    114]]
====Iteration 1  ====
accuracy 0.8775201203639577
confusion matrix
 [[183600    206]
 [ 25437    122]]
====Iteration 2  ====
accuracy 0.8775153440164306
confusion matrix
 [[183611    202]
 [ 25442    110]]


In [24]:
print('Count of Training Set: '+str(len(train_indices)))
print('Count of Test Set: '+ str(len(test_indices)))

Count of Training Set: 488516
Count of Test Set: 209365


In [None]:
# here we can change some of the parameters interactively
from ipywidgets import widgets as wd

def lr_explor(cost):
    lr_clf = LogisticRegression(penalty='l2', C=cost, class_weight=None,solver='liblinear') # get object
    accuracies = cross_val_score(lr_clf,X,y=y,cv=cv_object) # this also can help with parallelism
    print(accuracies)

wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)

### LR - Analysis & Intepretations

**Weights *before* normalization:**
<br>
We can see that even before we normalized our dataset the **'mileage'** and **'frame_damaged'** attribute showns an extremely strong weight in regards to the price (and group) of a vehicle. Additionally, we can see that **'is_new'** has a particularly strong negative weight in regards to price, this can be interpreted that this variable less of an effect on pricing, which makes sense since prior analysis (lab 1) showed us that 99% of our data is used cars. 

In [25]:
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = df_final.columns
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])

body_type has weight of -5.221811619310908e-07
city_fuel_economy has weight of -3.0408958132218724e-05
daysonmarket has weight of 0.0003073801827651254
engine_displacement has weight of 4.282246898186513e-05
frame_damaged has weight of 2.9894079244211207e-06
height has weight of -0.00011599391734123565
highway_fuel_economy has weight of -7.198702431440321e-06
horsepower has weight of 0.000651767010988677
is_new has weight of -1.350836635398382e-07
length has weight of -0.00011599391734123565
maximum_seating has weight of -9.229635995469008e-06
mileage has weight of 7.938259602998602e-06
owner_count has weight of 3.3831162980863724e-05
price has weight of -3.2277191379042465e-05
seller_rating has weight of -1.2858786938423712e-05
width has weight of -8.766885179295116e-05
year has weight of -0.0009708500382388764


**Weights *after* normalization:**
<br>
Once the weights have been normalized as shown below 

In [26]:
# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_train)

X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test) # apply those means and std to the test set (without snooping at the test set values)

# train the model just as before
lr_clf = LogisticRegression(penalty='l2', C=0.05, solver='liblinear') # get object, the 'C' value is less (can you guess why??)
lr_clf.fit(X_train_scaled,y_train)  # train object

y_hat = lr_clf.predict(X_test_scaled) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf )

accuracy: 0.8776299763570797
[[183418    395]
 [ 25225    327]]


In [28]:
# sort these attributes and spit them out
zip_vars = zip(lr_clf.coef_.T,df_final.columns) # combine attributes
zip_vars = sorted(zip_vars) 

for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

price has weight of -0.402318886608593
year has weight of -0.10059202119938179
city_fuel_economy has weight of -0.08709642395116673
seller_rating has weight of -0.062478778628759626
width has weight of -0.05783510801209675
engine_displacement has weight of -0.04383185665717114
is_new has weight of -0.03939158028027939
daysonmarket has weight of 0.009779825039265097
body_type has weight of 0.014761435600783945
height has weight of 0.015202113070669501
length has weight of 0.015202113070669501
maximum_seating has weight of 0.01706636631521288
frame_damaged has weight of 0.08502806586538414
owner_count has weight of 0.10268263727046947
highway_fuel_economy has weight of 0.12472064979266269
horsepower has weight of 0.15883455698304672
mileage has weight of 0.20513927741025922


In [29]:
print(lr_clf.coef_.T)

[[ 0.01476144]
 [-0.08709642]
 [ 0.00977983]
 [-0.04383186]
 [ 0.08502807]
 [ 0.01520211]
 [ 0.12472065]
 [ 0.15883456]
 [-0.03939158]
 [ 0.01520211]
 [ 0.01706637]
 [ 0.20513928]
 [ 0.10268264]
 [-0.40231889]
 [-0.06247878]
 [-0.05783511]
 [-0.10059202]]


In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

## Support Vector Machines