Using ONE of the following sources, complete the questions for only that source. 

Credit approval: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29

Cardiac Arrhythmia: https://archive.ics.uci.edu/ml/datasets/Arrhythmia 

Abalone age: https://archive.ics.uci.edu/ml/datasets/Abalone - this one is a bit harder since its not binary like the others, but if you really want to master these concepts, you should pick this one. 

Note: at least one of your models should have the most relevant performance metric above .90 . All performance metrics should be above .75 . You will partially be graded on model performance.

1) Preprocess your dataset. Indicate which steps worked and which didn’t. Include your thoughts on why certain steps worked and certain steps didn’t. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

In [2]:
arr_data = pd.read_csv("arrhythmia.csv", header=None)
arr_df = pd.DataFrame(arr_data)

arr_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,270,271,272,273,274,275,276,277,278,279
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,53,1,160,70,80,199,382,154,117,-37,...,0.0,4.3,-5.0,0.0,0.0,0.7,0.6,-4.4,-0.5,1
448,37,0,190,85,100,137,361,201,73,86,...,0.0,15.6,-1.6,0.0,0.0,0.4,2.4,38.0,62.4,10
449,36,0,166,68,108,176,365,194,116,-85,...,0.0,16.3,-28.6,0.0,0.0,1.5,1.0,-44.2,-33.2,2
450,32,1,155,55,93,106,386,218,63,54,...,-0.4,12.0,-0.7,0.0,0.0,0.5,2.4,25.0,46.6,1


In [27]:
#give columns headings to make them easier to manipulate
arr_df.columns = ['N'+str(x) for x in range(0,280)]
arr_df

Unnamed: 0,N0,N1,N2,N3,N4,N5,N6,N7,N8,N9,...,N270,N271,N272,N273,N274,N275,N276,N277,N278,N279
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,53,1,160,70,80,199,382,154,117,-37,...,0.0,4.3,-5.0,0.0,0.0,0.7,0.6,-4.4,-0.5,1
448,37,0,190,85,100,137,361,201,73,86,...,0.0,15.6,-1.6,0.0,0.0,0.4,2.4,38.0,62.4,10
449,36,0,166,68,108,176,365,194,116,-85,...,0.0,16.3,-28.6,0.0,0.0,1.5,1.0,-44.2,-33.2,2
450,32,1,155,55,93,106,386,218,63,54,...,-0.4,12.0,-0.7,0.0,0.0,0.5,2.4,25.0,46.6,1


In [32]:
#last column is the arrhythmia classification (aka y_actual), so we'll need that
arr_class_df = arr_df['N279']
arr_class_df

0       8
1       6
2      10
3       1
4       7
       ..
447     1
448    10
449     2
450     1
451     1
Name: N279, Length: 452, dtype: int64

In [67]:
#drop columns with extraneous/duplicate info
#probably not best practice, but all the columns after the first dozen or so are too subject-specific for me to get
#so let's drop them
arr_df_2 = arr_df.iloc[0:451, 0:13]
arr_df_2

Unnamed: 0,N0,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12
0,75,0,190,80,91,193,371,174,121,-16,13,64,-2
1,56,1,165,64,81,174,401,149,39,25,37,-17,31
2,54,0,172,95,138,163,386,185,102,96,34,70,66
3,55,0,175,94,100,202,380,179,143,28,11,-5,20
4,75,0,190,80,88,181,360,177,103,-16,13,61,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
446,20,1,157,57,81,151,363,166,80,43,42,72,42
447,53,1,160,70,80,199,382,154,117,-37,4,40,-27
448,37,0,190,85,100,137,361,201,73,86,66,52,79
449,36,0,166,68,108,176,365,194,116,-85,-19,-61,-70


In [68]:
#concatenate 1st dozen-ish columns with N279 (aka arrhythmia class, aka y_pred)
arr_df_3 = pd.concat([arr_df_2, arr_class_df], axis=1)
arr_df_3

Unnamed: 0,N0,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N279
0,75.0,0.0,190.0,80.0,91.0,193.0,371.0,174.0,121.0,-16.0,13,64,-2,8
1,56.0,1.0,165.0,64.0,81.0,174.0,401.0,149.0,39.0,25.0,37,-17,31,6
2,54.0,0.0,172.0,95.0,138.0,163.0,386.0,185.0,102.0,96.0,34,70,66,10
3,55.0,0.0,175.0,94.0,100.0,202.0,380.0,179.0,143.0,28.0,11,-5,20,1
4,75.0,0.0,190.0,80.0,88.0,181.0,360.0,177.0,103.0,-16.0,13,61,3,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,53.0,1.0,160.0,70.0,80.0,199.0,382.0,154.0,117.0,-37.0,4,40,-27,1
448,37.0,0.0,190.0,85.0,100.0,137.0,361.0,201.0,73.0,86.0,66,52,79,10
449,36.0,0.0,166.0,68.0,108.0,176.0,365.0,194.0,116.0,-85.0,-19,-61,-70,2
450,32.0,1.0,155.0,55.0,93.0,106.0,386.0,218.0,63.0,54.0,29,-22,43,1


In [69]:
#not sure where that NaN row came from??  anyway let's drop it
arr_df_4 = arr_df_3.iloc[0:451, 0:19]
arr_df_4

Unnamed: 0,N0,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N279
0,75.0,0.0,190.0,80.0,91.0,193.0,371.0,174.0,121.0,-16.0,13,64,-2,8
1,56.0,1.0,165.0,64.0,81.0,174.0,401.0,149.0,39.0,25.0,37,-17,31,6
2,54.0,0.0,172.0,95.0,138.0,163.0,386.0,185.0,102.0,96.0,34,70,66,10
3,55.0,0.0,175.0,94.0,100.0,202.0,380.0,179.0,143.0,28.0,11,-5,20,1
4,75.0,0.0,190.0,80.0,88.0,181.0,360.0,177.0,103.0,-16.0,13,61,3,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446,20.0,1.0,157.0,57.0,81.0,151.0,363.0,166.0,80.0,43.0,42,72,42,1
447,53.0,1.0,160.0,70.0,80.0,199.0,382.0,154.0,117.0,-37.0,4,40,-27,1
448,37.0,0.0,190.0,85.0,100.0,137.0,361.0,201.0,73.0,86.0,66,52,79,10
449,36.0,0.0,166.0,68.0,108.0,176.0,365.0,194.0,116.0,-85.0,-19,-61,-70,2


In [71]:
#replace ?s with zeroes
arr_df_5 = arr_df_4.replace("?", 0)
arr_df_5

Unnamed: 0,N0,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N279
0,75.0,0.0,190.0,80.0,91.0,193.0,371.0,174.0,121.0,-16.0,13,64,-2,8
1,56.0,1.0,165.0,64.0,81.0,174.0,401.0,149.0,39.0,25.0,37,-17,31,6
2,54.0,0.0,172.0,95.0,138.0,163.0,386.0,185.0,102.0,96.0,34,70,66,10
3,55.0,0.0,175.0,94.0,100.0,202.0,380.0,179.0,143.0,28.0,11,-5,20,1
4,75.0,0.0,190.0,80.0,88.0,181.0,360.0,177.0,103.0,-16.0,13,61,3,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446,20.0,1.0,157.0,57.0,81.0,151.0,363.0,166.0,80.0,43.0,42,72,42,1
447,53.0,1.0,160.0,70.0,80.0,199.0,382.0,154.0,117.0,-37.0,4,40,-27,1
448,37.0,0.0,190.0,85.0,100.0,137.0,361.0,201.0,73.0,86.0,66,52,79,10
449,36.0,0.0,166.0,68.0,108.0,176.0,365.0,194.0,116.0,-85.0,-19,-61,-70,2


In [81]:
X = arr_df_5.drop('N279', axis=1)
y = arr_df_5['N279']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

from sklearn.linear_model import LinearRegression

regression_df = LinearRegression()
regression_df.fit(X_train, y_train)

y_pred = regression_df.predict(X_test)

accuracy_score = regression_df.score(X_test, y_test)
print(accuracy_score)

0.1753525950556286


Not the best accuracy, I will admit. Truly, I spent a loooooong time trying to figure out better ways to preprocess this data, but I think my unfamiliarity with the information stored in the dataset (a cardiologist I am not) got the best of me in the end.

2) Create a decision tree model tuned to the best of your abilities. Explain how you tuned it.

In [101]:
from sklearn import tree

X = arr_df_5.drop('N279', axis=1)
y = arr_df_5['N279']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

# decision tree classifier
dt = tree.DecisionTreeClassifier(max_depth = 10, random_state=42)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.5384615384615384

In [102]:
dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth = 10, random_state=42)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.4725274725274725

In [103]:
dt = tree.DecisionTreeClassifier(max_depth = 50, random_state=42)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.45054945054945056

In [106]:
dt = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split=4, random_state=42)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.5274725274725275

In [116]:
dt = tree.DecisionTreeClassifier(max_depth = 10, random_state=42, max_leaf_nodes=10)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.6263736263736264

In [138]:
dt = tree.DecisionTreeClassifier(max_depth = 10, random_state=42, max_leaf_nodes=100)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.5274725274725275

In [139]:
dt = tree.DecisionTreeClassifier(max_depth = 10, random_state=42, max_leaf_nodes=20)

dt = dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.5714285714285714

Keeping the criterion on the default of 'gini' was the way to go.  Increasing the max_depth decreased the score, while upping the min_samples_split.  Decreasing the max_leaf_nodes also seemed to help (probably counteracting my lack of good preprocessing).

3) Create a random forest model tuned to the best of your abilities. Explain how you tuned it.

In [88]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
#estimators = models (here, decision trees)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6373626373626373

In [89]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6483516483516484

In [90]:
rf = RandomForestClassifier(n_estimators=50, random_state=42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6593406593406593

In [127]:
rf = RandomForestClassifier(n_estimators=50, random_state=42, bootstrap=False)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.5824175824175825

Turning off bootstrapping dropped the score substantially.  Dropping the number of estimators, on the other hand, improved the score, at least to a point.

4) Create an xgboost model tuned to the best of your abilities. Explain how you tuned it. 

In [131]:
X = arr_df_5.drop('N279', axis=1)
y = arr_df_5['N279']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

#XGBoost = extreme gradient boost
from xgboost import XGBClassifier

#fit model to training data
xgb = XGBClassifier(random_state=42)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
predictions = [round(value) for value in y_pred]

xgb.score(X_test, y_test)





0.5384615384615384

In [132]:
xgb = XGBClassifier(random_state=42, max_depth=10)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
predictions = [round(value) for value in y_pred]

xgb.score(X_test, y_test)





0.5824175824175825

In [134]:
xgb = XGBClassifier(random_state=42, max_depth=15)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
predictions = [round(value) for value in y_pred]

xgb.score(X_test, y_test)





0.5934065934065934

Not going to lie, I find the amount of parameters for this one overwhelming.  Also, I only just got it to run (couldn't get it to work during class when we went over it, had to install a bunch of things).  But I at least tuned it a little by upping the max depth?

5) Which model performed best? What is your performance metric? Why? 

Based on the accuracy score, RandomForest seemed to perform best overall.  I used accuracy score as it seemed the most straightforward/concrete to grasp - probably better metrics to go by, but at this point I'm just trying to get and stay caught up, so not the time to be a perfectionist!