<a href="https://colab.research.google.com/github/cmadding/MSDS_7333_QTW/blob/master/Case_Study_Unit_15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Study Unit 15  

Allen Ansari, Chris Ballenger, Shantanu Godbole, Chad Madding

DS 7333 Quantifying the World

August 15, 2020

#### Introduction
In this unit, we will be using an ensemble of categorical prediction methods to find the best accuracy and ROC score. A gridsearch will assist in removing low performing features from random forest and keep the optimal features. We will focus on metrics like the F1-Score, accuracy, recall, precision, and the ROC score to find the best prediction rate while finding the best balance between true positives and false negatives.

#### Methods
In this project, we will first work on data exploration, any needed categorical data conversion, and replacing missing values.
After exploration and cleaning, we will look at the best categorical predictors while removing lesser features. We will perform a gridsearch to assist with hyperparameter tuning for our random forest parameters. Finally, we will design an ensemble model with the highest ROC score, accuracy, precision, recall, and F1-Score.

In [1]:
#Connect Google Drive to Colab
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
# Load the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# visualize missing values
import missingno as msno

# model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# linear classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# non-linear classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# ensemble learners
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

# metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# model persistence
from pickle import dump
from pickle import load

# ignore deprecated warnings
import warnings
warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


In [3]:
# read in the dataset
data = pd.read_csv('/content/drive/My Drive/Colab/Data/final_project.csv')
data.shape

(160000, 51)

The dataset contains 51 variables and 160,000 rows of data.

We can look at the first few rows of data to see the makeup of the dataset.

In [4]:
#the first five rows of data
data.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,y
0,-0.166563,-3.961588,4.621113,2.481908,-1.800135,0.804684,6.718751,-14.789997,-1.040673,-4.20495,6.187465,13.251523,25.665413,-5.017267,10.503714,-2.517678,2.11791,5.865923,-6.666158,1.791497,-1.909114,-1.73794,-2.516715,3.553013,euorpe,-0.80134,1.14295,1.005131,-18.473784,July,tuesday,-3.851669,0.0%,-1.940031,-5.492063,0.627121,-0.873824,$1313.96,-1.353729,-5.186148,-10.6122,-1.497117,5.414063,-2.325655,1.674827,-0.264332,60.781427,-7.689696,0.151589,-8.040166,0
1,-0.149894,-0.585676,27.839856,4.152333,6.426802,-2.426943,40.477058,-6.725709,0.896421,0.330165,-11.708859,-2.352809,-25.014934,9.799608,-10.960705,1.504,-2.397836,-9.301839,-1.999413,5.045258,-5.809984,10.814319,-0.478112,10.590601,asia,0.818792,-0.642987,0.751086,3.749377,Aug,wednesday,1.391594,-0.02%,2.211462,-4.460591,1.035461,0.22827,$1962.78,32.816804,-5.150012,2.147427,36.29279,4.490915,0.762561,6.526662,1.007927,15.805696,-4.896678,-0.320283,16.719974,0
2,-0.321707,-1.429819,12.251561,6.586874,-5.304647,-11.31109,17.81285,11.060572,5.32588,-2.632984,1.572647,-4.170771,12.078602,-5.158498,7.30278,-2.192431,-4.065428,-7.675055,4.041629,-6.633628,1.700321,-2.419221,2.467521,-5.270615,asia,-0.718315,-0.566757,4.171088,11.522448,July,wednesday,-3.262082,-0.01%,0.419607,-3.804056,-0.763357,-1.612561,$430.47,-0.333199,8.728585,-0.863137,-0.368491,9.088864,-0.689886,-2.731118,0.7542,30.856417,-7.428573,-2.090804,-7.869421,0
3,-0.245594,5.076677,-24.149632,3.637307,6.505811,2.290224,-35.111751,-18.913592,-0.337041,-5.568076,-2.000255,-19.286668,10.99533,-5.914378,2.5114,1.292362,-2.496882,-15.722954,-2.735382,1.117536,1.92367,-14.179167,1.470625,-11.484431,asia,-0.05243,-0.558582,9.215569,30.595226,July,wednesday,-2.285241,0.01%,-3.442715,4.42016,1.164532,3.033455,$-2366.29,14.188669,-6.38506,12.084421,15.691546,-7.467775,2.940789,-6.424112,0.419776,-72.424569,5.361375,1.80607,-7.670847,0
4,-0.273366,0.306326,-11.352593,1.676758,2.928441,-0.616824,-16.505817,27.532281,1.199715,-4.309105,6.66753,1.965913,-28.106348,-1.25895,5.759941,0.472584,-1.150097,-14.118709,4.527964,-1.284372,-9.026317,-7.039818,-1.978748,-15.998166,asia,-0.223449,0.350781,1.811182,-4.094084,July,tuesday,0.921047,0.01%,-0.43164,12.165494,-0.167726,-0.341604,$-620.66,-12.578926,1.133798,30.004727,-13.911297,-5.229937,1.783928,3.957801,-0.096988,-14.085435,-0.208351,-0.894942,15.724742,1


In [5]:
print("Table 1: Basic Statistical Details")
data.describe()

Table 1: Basic Statistical Details


Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x25,x26,x27,x28,x31,x33,x34,x35,x36,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,y
count,159974.0,159975.0,159962.0,159963.0,159974.0,159963.0,159974.0,159973.0,159979.0,159970.0,159957.0,159970.0,159964.0,159969.0,159966.0,159965.0,159974.0,159973.0,159960.0,159965.0,159962.0,159971.0,159973.0,159953.0,159978.0,159964.0,159970.0,159965.0,159961.0,159959.0,159959.0,159970.0,159973.0,159969.0,159977.0,159964.0,159960.0,159974.0,159963.0,159960.0,159971.0,159969.0,159963.0,159968.0,159968.0,160000.0
mean,-0.001028,0.001358,-1.150145,-0.024637,-0.000549,0.013582,-1.67067,-7.692795,-0.03054,0.005462,0.002253,0.030232,-1.334402,0.007669,0.008104,0.001215,0.006223,0.01204,0.012694,0.024555,0.299074,-0.029137,0.0084,0.722028,-0.000806,-0.001066,-0.004159,0.031543,-0.005945,-0.006567,-0.000426,0.000936,0.006453,6.05913,0.004253,-2.316526,6.701076,-1.83382,-0.002091,-0.00625,0.000885,-12.755395,0.028622,-0.000224,-0.674224,0.401231
std,0.371137,6.340632,13.27348,8.065032,6.382293,7.670076,19.298665,30.542264,8.901185,6.35504,7.871429,8.769633,14.75099,8.953837,6.964097,3.271779,4.984065,7.569351,4.540714,7.595316,5.806203,9.409635,5.41201,14.909127,1.263656,0.843258,6.774047,14.439534,2.767508,1.747762,8.01418,2.379558,1.593183,16.891603,5.134322,17.043549,18.680196,5.110705,1.534952,4.164595,0.396621,36.608641,4.788157,1.935501,15.036738,0.490149
min,-1.592635,-26.278302,-59.394048,-35.476594,-28.467536,-33.822988,-86.354483,-181.506976,-37.691045,-27.980659,-36.306571,-38.092869,-64.197967,-38.723514,-30.905214,-17.002359,-26.042983,-34.395898,-20.198686,-35.633396,-26.677396,-43.501854,-23.644193,-66.640341,-6.364653,-3.857484,-32.003555,-72.896705,-12.289364,-7.451454,-36.116606,-10.008149,-6.866024,-74.297559,-22.101647,-74.059196,-82.167224,-27.93375,-6.876234,-17.983487,-1.753221,-201.826828,-21.086333,-8.490155,-65.791191,0.0
25%,-0.251641,-4.260973,-10.166536,-5.454438,-4.313118,-5.14813,-14.780146,-27.324771,-6.031058,-4.260619,-5.288196,-5.903274,-11.379492,-6.029945,-4.696755,-2.207774,-3.344027,-5.07147,-3.056131,-5.101553,-3.607789,-6.361115,-3.649766,-9.268532,-0.852784,-0.567293,-4.597919,-9.702464,-1.874206,-1.183681,-5.401084,-1.610337,-1.068337,-5.249882,-3.458716,-13.953629,-5.80408,-5.162869,-1.039677,-2.812055,-0.266518,-36.428329,-3.216016,-1.3208,-10.931753,0.0
50%,-0.002047,0.004813,-1.340932,-0.031408,0.000857,0.014118,-1.948594,-6.956789,-0.01684,0.006045,-0.018176,0.010941,-1.624439,-0.003473,0.002467,0.003535,0.012754,0.024541,0.015904,0.044703,0.433055,-0.026385,0.011144,1.029609,-0.003723,-0.001501,0.037138,0.24421,0.002013,-0.006079,-0.013089,-0.002399,0.003645,6.18441,0.019068,-2.701867,6.84011,-1.923754,-0.004385,-0.010484,0.001645,-12.982497,0.035865,-0.011993,-0.57441,0.0
75%,0.248532,4.28422,7.871676,5.445179,4.30666,5.190749,11.446931,12.217071,5.972349,4.305734,5.331573,5.935032,8.374524,6.041959,4.701299,2.21166,3.366853,5.101962,3.073002,5.164732,4.306566,6.316457,3.672678,11.028035,0.851765,0.567406,4.649773,9.936995,1.856369,1.17946,5.411667,1.603089,1.079895,17.420148,3.463308,8.981616,19.266367,1.453507,1.033275,2.783274,0.269049,11.445443,3.268028,1.317703,9.651072,1.0
max,1.600849,27.988178,63.545653,38.906025,26.247812,35.55011,92.390605,149.150634,39.049831,27.377842,37.945583,36.360443,73.279354,42.392177,32.54634,13.782559,21.961123,37.057048,19.652986,33.51555,27.81456,46.237503,24.863012,58.4905,5.314169,3.951652,28.645074,67.753845,12.279356,7.78712,34.841428,9.892426,6.999544,90.467981,21.545591,88.824477,100.050432,22.668041,6.680922,19.069759,1.669205,150.859415,20.836854,8.226552,66.877604,1.0


In [6]:
print("Table 2: Categorical Details")
data.describe(include=['object'])

Table 2: Categorical Details


Unnamed: 0,x24,x29,x30,x32,x37
count,159972,159970,159970,159969,159977
unique,3,12,5,12,129198
top,asia,July,wednesday,0.01%,$-415.46
freq,138965,45569,101535,40767,6


There are a few categorical varables that will need encoding. x 32 and x37 look to not be encoded correctley. x32 has a % sigh and x37 has a dollar amount. We can remove the $ sign and the % sigh then conver them to a float value.

In [7]:
# x32 has a % that needs to be removed and convert to float
def PerSign(var):
    var = var.str.replace('%', "")
    return var.astype("float")

data['x32'] = PerSign(data['x32'])
data.x32.head()

0    0.00
1   -0.02
2   -0.01
3    0.01
4    0.01
Name: x32, dtype: float64

In [8]:
# x37 has a $ that needs to be removed and convert to float
def DollarSign(var):
    var = var.str.replace('$', "")
    return var.astype("float")

data['x37'] = DollarSign(data['x37'])
data.x37.head()

0    1313.96
1    1962.78
2     430.47
3   -2366.29
4    -620.66
Name: x37, dtype: float64

In [12]:
# divide the variables based on their types
objects = data.loc[:, data.dtypes == object]
numerics = data.loc[:, data.dtypes == float]
responseVariable = data.y

There are only three catagorital variables left. With entries like asia, July and wednesday, these look to be variables we can encode propertaly.

In [13]:
# count the 'opjects'
objects.describe().T

Unnamed: 0,count,unique,top,freq
x24,159972,3,asia,138965
x29,159970,12,July,45569
x30,159970,5,wednesday,101535
