### Project 2
### Group 22
Group Members: Ahmet Gungor, Blake Long

### Importing packages and CSV files in to pandas dataframes

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
auditrisk = pd.read_csv('audit_risk.csv')
trial = pd.read_csv('trial.csv')

Investigating the data of the two datasets

In [3]:
auditrisk.columns.intersection(trial.columns)

Index(['History', 'LOCATION_ID', 'Money_Value', 'PARA_A', 'PARA_B', 'Risk',
       'Score', 'Sector_score', 'TOTAL', 'numbers'],
      dtype='object')

### Data Preparation


Dropping Risk Columns for Regression Model Analysis 

In [4]:
ar = auditrisk.drop(['Risk'],axis=1)
tr = trial.drop(['Risk'], axis=1)

In [5]:
(ar['District_Loss'] == tr['District']).describe()

count      776
unique       1
top       True
freq       776
dtype: object

In [6]:
ar = ar.rename(columns={'District_Loss': 'District'})

In [7]:
tr.loc[ar['Money_Value'] != tr['Money_Value']]

Unnamed: 0,Sector_score,LOCATION_ID,PARA_A,SCORE_A,PARA_B,SCORE_B,TOTAL,numbers,Marks,Money_Value,MONEY_Marks,District,Loss,LOSS_SCORE,History,History_score,Score
642,55.57,4,0.23,2,0.0,2,0.23,5.0,2,,2,2,0,2,0,2,2.0


Merging the datasets

In [8]:
cols_to_use = ar.columns.difference(tr.columns)
reg = pd.merge(tr, ar[cols_to_use], left_index=True, right_index=True, how='outer')

Filling missing value by using the average Money_Value of rows with indentical Score values

In [9]:
a = reg[reg.Score == 2.0]
b = a.Money_Value.mean()
reg['Money_Value'].fillna((b), inplace=True)

Dropping Duplicate Columns that differ only by scale

Checking for identical values that differ by a scale of 10

In [10]:
dft = pd.DataFrame()
dft['History_score'] = [sum(reg['History_score'] == (reg['Prob']*10))]
dft['LOSS_SCORE'] = [sum(reg['LOSS_SCORE'] == (reg['PROB']*10))]
dft['Marks'] = [sum(reg['Marks'] == (reg['Score_B.1']*10))]
dft['MONEY_Marks'] = [sum(reg['MONEY_Marks'] == (reg['Score_MV']*10))]
dft['SCORE_A'] = [sum(reg['SCORE_A'] == (reg['Score_A']*10))]
dft['SCORE_B'] = [sum(reg['SCORE_B'] == (reg['Score_B']*10))]
dft

Unnamed: 0,History_score,LOSS_SCORE,Marks,MONEY_Marks,SCORE_A,SCORE_B
0,776,776,776,776,776,776


All Detection_Risk values are identical and were decided to be dropped

LOCATION_ID was dropped as the values were inconsistent by having both names of cities and numerical identifiers. A one hot vector of LOCATION_ID was tried, but this added significant time to running the code. A lasso feature selection plot was created to investigate the significance of the one hot vector of LOCATION_ID and compared with the a lasso feature selection plot of the data without LOCATION_ID. The features provided by the one hot vector of LOCATION_ID were not determined to be significant enough to require inclusion in the dataset. 

In [11]:
reg = reg.drop(['LOCATION_ID', 'History_score', 'LOSS_SCORE', 'Marks', 'MONEY_Marks', 'SCORE_A', 'SCORE_B', 'Detection_Risk'],axis=1)


In [12]:
reg.shape

(776, 25)

## Regression

Importing functions for scaling, regression, and data splitting

In [13]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from  sklearn.preprocessing  import PolynomialFeatures
from  sklearn.linear_model import Ridge
from  sklearn.linear_model import Lasso
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

In [14]:
regmodels = pd.DataFrame(columns = ['Model', 'Score','Parameters'])
regmodels

Unnamed: 0,Model,Score,Parameters


Audit Risk is the target value

In [15]:
y = reg['Audit_Risk'].values

In [16]:
reg = reg.drop(['Audit_Risk'],axis=1)

In [17]:
X = reg.values
X.shape

(776, 24)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [19]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


### Bagging Models

Ridge Regresssion and SVR with a 'rbf' kernel were selected as our two models for bagging as these two models tested best in project 1. The parameters used for these models are based on the best parameters found in project 1.

Ridge Bagging

In [20]:
from sklearn.ensemble import BaggingRegressor
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingRegressor(Ridge(alpha = 1), bootstrap = True, random_state=0), param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'max_samples': 300, 'n_estimators': 1000}
Best cross-validation score: -76.80


In [21]:
from sklearn.metrics import mean_squared_error
bag_reg = BaggingRegressor(Ridge(alpha = 1), bootstrap = True, random_state=0, max_samples=300, n_estimators=1000)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

2939.6998736665682

SVR with 'rbf' kernel Bagging

In [22]:
from sklearn.ensemble import BaggingRegressor
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), bootstrap = True, random_state=0), param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 300, 'n_estimators': 10}
Best cross-validation score: -77.90




In [23]:
bag_reg = BaggingRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), bootstrap = True, random_state=0, max_samples=300, n_estimators=10)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

4653.393195318713

### Pasting Models

Ridge Pasting

In [24]:
from sklearn.ensemble import BaggingRegressor
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingRegressor(Ridge(alpha = 1), bootstrap = False, random_state=0), param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'max_samples': 300, 'n_estimators': 1000}
Best cross-validation score: -75.70


In [25]:
from sklearn.metrics import mean_squared_error
bag_reg = BaggingRegressor(Ridge(alpha = 1), bootstrap = False, random_state=0, max_samples=300, n_estimators=1000)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

2936.5299668342323

SVR with 'rbf' kernel Pasting

In [26]:
from sklearn.ensemble import BaggingRegressor
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), bootstrap = False, random_state=0), param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 300, 'n_estimators': 10}
Best cross-validation score: -75.94




In [27]:
bag_reg = BaggingRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), bootstrap = False, random_state=0, max_samples=300, n_estimators=10)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

4621.078967806816

### AdaBoostRegressor Models

Ridge with AdaBoosting

In [28]:
from sklearn.ensemble import AdaBoostRegressor
ada_reg = AdaBoostRegressor(Ridge(alpha=1), random_state=0)
param_grid = {'n_estimators': [10, 50, 100],
              'learning_rate': [0.1, 0.5, 1]}
grid_search = GridSearchCV(ada_reg, param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))     

Best parameters: {'learning_rate': 0.1, 'n_estimators': 10}
Best cross-validation score: -75.34687800278823




In [29]:
ada_reg = AdaBoostRegressor(Ridge(alpha=1), learning_rate=0.1, n_estimators=10, random_state=0)
ada_reg.fit(X_train, y_train)
y_pred = ada_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

3060.744238803113

SVR with 'rbf' kernel AdaBoosting

In [30]:
from sklearn.ensemble import AdaBoostRegressor
ada_reg = AdaBoostRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), random_state=0)
param_grid = {'n_estimators': [10, 50, 100],
              'learning_rate': [0.1, 0.5, 1]}
grid_search = GridSearchCV(ada_reg, param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))  



Best parameters: {'learning_rate': 0.1, 'n_estimators': 100}
Best cross-validation score: -22.901069291027625


In [31]:
ada_reg = AdaBoostRegressor(SVR(C=100, epsilon=0.001, gamma=0.1, kernel = 'rbf'), learning_rate=0.1, n_estimators=100, random_state=0)
ada_reg.fit(X_train, y_train)
y_pred = ada_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

4541.312675514256

### Gradient Boosting Regressor

In [32]:
from  sklearn.ensemble import GradientBoostingRegressor
param_grid = {'n_estimators': [10, 50, 100, 1000],
              'learning_rate': [0.1, 0.5, 1],
              'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
grid_search = GridSearchCV(GradientBoostingRegressor(random_state=0), param_grid, cv=5, scoring = 'neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))             



Best parameters: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 1000}
Best cross-validation score: -35.40915835908603


In [33]:
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=1000, learning_rate=0.1, random_state=0)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mean_squared_error(y_test, y_pred)

3939.382859347095

### PCA Regression Models

In [34]:
from sklearn.decomposition import PCA

pca = PCA(n_components =0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
pca.n_components_

8

### K neighbors Regressor

In [35]:
param_grid = {'n_neighbors': [1,2,3,4,5,6,7,8,9,10]}
grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'n_neighbors': 3}
Best cross-validation score: -106.97




In [36]:
tdf = pd.DataFrame([['KNR', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)


### Linear Regression

In [37]:
lreg = LinearRegression()
lreg.fit(X_train, y_train)
print(lreg.score(X_train, y_train))
print(lreg.score(X_test, y_test))
bestscore = lreg.score(X_test, y_test)

0.600274367147537
0.6566662601583748


In [38]:
from sklearn.model_selection import cross_val_score

In [39]:
bestscore = np.mean(cross_val_score(LinearRegression(), X = X_train, y = y_train, scoring = 'neg_mean_squared_error', cv = 5))

In [40]:
tdf = pd.DataFrame([['Linear Reg', bestscore, 'None']], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Polynomial Regression

In [41]:
train_score_list = []
test_score_list = []

for n in range(1,4):
    poly = PolynomialFeatures(n)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    lreg.fit(X_train_poly, y_train)
    train_score_list.append(lreg.score(X_train_poly, y_train))
    test_score_list.append(lreg.score(X_test_poly, y_test))
print(train_score_list)
print(test_score_list)
bestscore = test_score_list[1]

[0.6002743671475369, 0.8835220543385736, 0.9995928862399858]
[0.6566662601583761, -0.4240396636601078, -3929.001897355965]


In [42]:
from sklearn.pipeline import make_pipeline
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

In [43]:
param_grid = {'polynomialfeatures__degree': [1, 2, 3, 4 ,5]}
grid = GridSearchCV(PolynomialRegression(), param_grid, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_test, y_test)
print("Best parameters: {}".format(grid.best_params_))
print("Best cross-validation score: {}".format(grid.best_score_))

Best parameters: {'polynomialfeatures__degree': 1}
Best cross-validation score: -2008.254639410068




In [44]:
tdf = pd.DataFrame([['Polynomial Reg', grid.best_score_, grid.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Ridge

In [45]:
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
print("Parameter grid:\n{}".format(param_grid))

Parameter grid:
{'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}


In [46]:
grid_search = GridSearchCV(Ridge(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'alpha': 0.1}
Best cross-validation score: -128.25




In [47]:
tdf = pd.DataFrame([['Ridge', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Lasso

In [48]:
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
print("Parameter grid:\n{}".format(param_grid))

Parameter grid:
{'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}


In [49]:
grid_search = GridSearchCV(Lasso(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'alpha': 0.001}
Best cross-validation score: -128.27




In [50]:
tdf = pd.DataFrame([['Lasso', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Simple SVM

In [51]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LinearSVR(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 10}
Best cross-validation score: -159.07




In [52]:
tdf = pd.DataFrame([['LinearSVR', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### SVM with Kernels

### RBF

In [53]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100],
              'epsilon': [.001, .01, .1, 1, 10, 100],
              'kernel': ['rbf']}
grid_search = GridSearchCV(SVR(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 100, 'epsilon': 0.1, 'gamma': 1, 'kernel': 'rbf'}
Best cross-validation score: -102.11




In [54]:
tdf = pd.DataFrame([['SVR-rbf', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Polynomial

In [55]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10],
              'degree': [1,2,3],
              'kernel': ['poly']}
grid_search = GridSearchCV(SVR(), param_grid,scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 10, 'degree': 1, 'kernel': 'poly'}
Best cross-validation score: -175.94




In [56]:
tdf = pd.DataFrame([['SVR-poly', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Linear

In [57]:
param_grid = {'C': [0.01, 0.1, 1, 10],
              'kernel': ['linear']}
grid_search = GridSearchCV(SVR(), param_grid, scoring = 'neg_mean_squared_error', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 10, 'kernel': 'linear'}
Best cross-validation score: -159.11


In [58]:
tdf = pd.DataFrame([['SVR-Linear', grid_search.best_score_, grid_search.best_params_]], columns = ['Model', 'Score','Parameters'])
regmodels = regmodels.append(tdf,ignore_index=True)

### Best Regressor

In [59]:
regmodels

Unnamed: 0,Model,Score,Parameters
0,KNR,-106.9655,{'n_neighbors': 3}
1,Linear Reg,-128.095981,
2,Polynomial Reg,-2008.254639,{'polynomialfeatures__degree': 1}
3,Ridge,-128.250929,{'alpha': 0.1}
4,Lasso,-128.265266,{'alpha': 0.001}
5,LinearSVR,-159.074016,{'C': 10}
6,SVR-rbf,-102.110629,"{'C': 100, 'epsilon': 0.1, 'gamma': 1, 'kernel..."
7,SVR-poly,-175.941548,"{'C': 10, 'degree': 1, 'kernel': 'poly'}"
8,SVR-Linear,-159.105581,"{'C': 10, 'kernel': 'linear'}"


In [60]:
regmodels.loc[regmodels['Score'].idxmax()]

Model                                                   SVR-rbf
Score                                                  -102.111
Parameters    {'C': 100, 'epsilon': 0.1, 'gamma': 1, 'kernel...
Name: 6, dtype: object

### Old Results Table and Comparison

In [61]:
proj1regresults = pd.read_csv('proj1reg.csv')
proj1regresults

Unnamed: 0,Model,Score,Parameters
0,KNR,-99.573875,{'n_neighbors': 1}
1,Linear Reg,-104.492864,
2,Polynomial Reg,-111.678277,{'polynomialfeatures__degree': 2}
3,Ridge,-79.327138,{'alpha': 1}
4,Lasso,-90.298255,{'alpha': 0.1}
5,LinearSVR,-92.277958,{'C': 100}
6,SVR-rbf,-79.673951,"{'C': 100, 'epsilon': 0.001, 'gamma': 0.1, 'ke..."
7,SVR-poly,-197.822911,"{'C': 10, 'degree': 1, 'kernel': 'poly'}"
8,SVR-Linear,-108.373189,"{'C': 10, 'kernel': 'linear'}"


In [148]:
proj1regresults.loc[proj1regresults['Score'].idxmax()]

Model                Ridge
Score             -79.3271
Parameters    {'alpha': 1}
Name: 3, dtype: object

In project 1, we found the Ridge regression model to provide us the lowest MSE. After performing PCA and reducing our feature count to account for 95% of the variance in the data, the SVM with rbf kernel regression provided us the lowest MSE among the models tested with our post PCA dataset. Overall, the MSE is higher for our PCA regression models which is expected as we have introduced more error into the dataset with PCA. The from Ridge regression to SVM with rbf kernel regression is also expected after PCA was performed. In project 1, SVM with rbf kernel regression performed second best to the Ridge regression model, and since Ridge regression is geared towards smoothing out features, it makes sense that the model would be less effective with a reduced feature set. 

Overall, our PCA based model scored worse for MSE except for the SVR with polynomial kernel which scored better with our PCA dataset. In addition, our polynomial regression model scored significantly worse with our PCA dataset. This is likely due to how polynomial regression favors feature expansions and would fair worse with a reduced featureset in its predictive power. 

### Deep learning Models with Regression

### DL Regression Data Preparation

In [None]:
y = reg['Audit_Risk'].values
reg = reg.drop(['Audit_Risk'],axis=1)
X = reg.values
X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = MinMaxScaler()
#scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from tensorflow.keras import metrics
X_train.shape

In [None]:
model1 = Sequential()
#input layer
model1.add(Dense(5, input_dim = 24, activation = 'relu'))
#model1.add(RBFLayer(10,initializer=InitCentersRandom(X_train),betas=5.0,input_shape=(24,)))
#hidden layer
#model1.add(Dense(5, activation = 'tanh'))
#output layer: no activation function
model1.add(Dense(1))

#sgd = optimizers.SGD(lr=0.01, nesterov=True)
model1.compile(loss = 'mse', optimizer ='sgd', metrics = ['mse'])

model1.fit(X_train,y_train, epochs = 30, batch_size = 75)
#30-50

In [None]:
model1.evaluate(X_train, y_train)

In [None]:
model1.evaluate(X_test, y_test)

In [None]:
from sklearn.metrics import r2_score
y_pred = model1.predict(X_train)
y_train_pred =  y_pred.reshape(-1,1)
r2_score(y_train, y_train_pred)

In [None]:
r2_score(y_test, model1.predict(X_test))

Getting good regression results with neural network was tough. We tried every combination of epoch and batch size values, different scalers, different values for neurons for different layers and different amounts of hidden layers. However, we could not get a better MSE result on the testing set than the result above. Our predictions were behaving so weird for regression that when we increased number of epochs and decreased the batch sizes, the testing MSE would still go up in some cases. Since we had a hard time understanding it, we guessed that maybe we should change the input layer to a rbf layer since rbf was our second best on the regression tasks from Project 1. We implemented a RBF layer through an external py file, however, it only made our results worse. Therefore, the above results are the best we could get even though we realize the MSE gap between the training and the test sets. These results are worse than most of our our Project 1 and Project 2 PCA results.

### Classification

In [62]:
sum(auditrisk['Risk'] == trial['Risk'])

595

Different Values for risk between the two datasets

In [63]:
clas = reg

We decided to concatenate the two values which give us four unique groups with one of the following combinations: TrueTrue, TrueFalse, FalseTrue, FalseFalse

In [64]:
ar['Risk'] = auditrisk['Risk'] == 1
tr['Risk'] = trial['Risk'] == 1
temp = pd.DataFrame()
temp['Risk'] = ar['Risk'].map(str) + tr['Risk'].map(str)

In [65]:
clas['Risk'] = temp['Risk']

In [66]:
temp['TT'] = sum(clas['Risk'] == 'TrueTrue')
temp['TF'] = sum(clas['Risk'] == 'TrueFalse')
temp['FT'] = sum(clas['Risk'] == 'FalseTrue')
temp['FF'] = sum(clas['Risk'] == 'FalseFalse')
temp.head()

Unnamed: 0,Risk,TT,TF,FT,FF
0,TrueTrue,305,0,181,290
1,FalseFalse,305,0,181,290
2,FalseFalse,305,0,181,290
3,TrueTrue,305,0,181,290
4,FalseFalse,305,0,181,290


Three groups exist within the data: TT, FT, FF

### Classification Evaluation Guidelines

In [67]:
clasmodels = pd.DataFrame(columns = ['Method','Model', 'Score','Parameters'])

In [68]:
y = clas['Risk'].values
clasx = clas.drop(['Risk'],axis=1)
X = clasx.values

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [70]:
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import recall_score

### Voting Classifiers

The top three recall scoring models from project 1 were selected as the models used in the voting classifiers

### Hard Voting

In [71]:
from sklearn.ensemble import VotingClassifier
dt_clf = DecisionTreeClassifier(max_depth= 3, random_state=0)
soft_clf = LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200)
rbf_clf = SVC(C=10, gamma=100, kernel='rbf', probability = True)

voting_clf = VotingClassifier(estimators=[('soft', soft_clf), ('rbf', rbf_clf), ('tree', dt_clf)], voting='hard')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9896907216494846

### Soft Voting

In [72]:
from sklearn.ensemble import VotingClassifier
dt_clf = DecisionTreeClassifier(max_depth= 3, random_state=0)
soft_clf = LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200)
rbf_clf = SVC(C=10, gamma=100, kernel='rbf', probability = True)

voting_clf = VotingClassifier(estimators=[('soft', soft_clf), ('rbf', rbf_clf), ('tree', dt_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9948453608247423

### Bagging Classifier Models

The top two recall scoring models from project 1 were selected for bagging and pasting classifier models

Decision tree bagging classifier

In [73]:
from sklearn.ensemble import BaggingClassifier
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), bootstrap = True, random_state=0), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 100, 'n_estimators': 100}
Best cross-validation score: 0.99


In [74]:
bag_reg = BaggingClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), bootstrap = True, random_state=0, max_samples=100, n_estimators=100)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9948453608247423

Softmax Regression Bagging Classifier

In [75]:
from sklearn.ensemble import BaggingClassifier
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200), bootstrap = True, random_state=0), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 300, 'n_estimators': 10}
Best cross-validation score: 0.97


In [76]:
bag_reg = BaggingClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs',max_iter= 200), bootstrap = True, random_state=0, max_samples=300, n_estimators=10)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9690721649484536

### Pasting Classifier Models

Decision tree pasting classifier

In [77]:
from sklearn.ensemble import BaggingClassifier
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), bootstrap = False, random_state=0), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 200, 'n_estimators': 10}
Best cross-validation score: 0.99


In [78]:
bag_reg = BaggingClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), bootstrap = False, random_state=0, max_samples=200, n_estimators=10)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9948453608247423

Softmax Regression Pasting Classifier

In [79]:
from sklearn.ensemble import BaggingClassifier
param_grid = {'n_estimators': [10, 100, 1000],
             'max_samples': [100, 200, 300]}

grid_search = GridSearchCV(BaggingClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200), bootstrap = False, random_state=0), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'max_samples': 300, 'n_estimators': 100}
Best cross-validation score: 0.98


In [80]:
bag_reg = BaggingClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200), bootstrap = False, random_state=0, max_samples=300, n_estimators=100)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9896907216494846

### AdaBoost Classifier Models

Decision Tree AdaBoosting

In [81]:
from sklearn.ensemble import AdaBoostClassifier
ada_clas = AdaBoostClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), random_state=0)
param_grid = {'n_estimators': [10, 50, 100],
              'learning_rate': [0.1, 0.5, 1]}
grid_search = GridSearchCV(ada_clas, param_grid, cv=5, scoring='recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))     

Best parameters: {'learning_rate': 0.1, 'n_estimators': 10}
Best cross-validation score: 0.9948453608247423


In [82]:
ada_clas = AdaBoostClassifier(DecisionTreeClassifier(max_depth= 3, random_state=0), learning_rate=0.1, n_estimators=10, random_state=0)
ada_clas.fit(X_train, y_train)
y_pred = ada_clas.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9948453608247423

Softmax Regression AdaBoosting

In [83]:
from sklearn.ensemble import AdaBoostClassifier
ada_clas = AdaBoostClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200), random_state=0)
param_grid = {'n_estimators': [10, 50, 100],
              'learning_rate': [0.1, 0.5, 1]}
grid_search = GridSearchCV(ada_clas, param_grid, cv=5, scoring='recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))   

Best parameters: {'learning_rate': 1, 'n_estimators': 10}
Best cross-validation score: 0.9501718213058419


In [84]:
ada_clas = AdaBoostClassifier(LogisticRegression(C=100, multi_class='multinomial', solver='lbfgs', max_iter= 200), learning_rate=1, n_estimators=10, random_state=0)
ada_clas.fit(X_train, y_train)
y_pred = ada_clas.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

0.9690721649484536

### Gradient Boosting Classifier Model

In [85]:
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {'n_estimators': [10, 50, 100, 1000],
              'learning_rate': [0.1, 0.5, 1],
              'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=0), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))  

Best parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 50}
Best cross-validation score: 0.9965635738831615


In [86]:
gbc = GradientBoostingClassifier(max_depth=5, n_estimators=50, learning_rate=0.1, random_state=0)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
recall_score(y_test, y_pred, average = 'micro')

1.0

### PCA Classification Models

In [87]:
from sklearn.decomposition import PCA

pca = PCA(n_components =0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
pca.n_components_

8

### KNN Grid Search

In [88]:
param_grid = {'estimator__n_neighbors': [1,2,3,4,5,6,7,8,9,10]}
grid_search = GridSearchCV(OneVsRestClassifier(KNeighborsClassifier()), param_grid, scoring = 'recall_micro', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'estimator__n_neighbors': 1}
Best cross-validation score: 0.9776632302405498


In [89]:
tdf = pd.DataFrame([['Recall_Micro', 'KNN', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [90]:
param_grid = {'n_neighbors': [1,2,3,4,5,6,7,8,9,10]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'n_neighbors': 1}
Best cross-validation score: 0.98


In [91]:
tdf = pd.DataFrame([['GridSearchCV', 'KNN', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Logistic Regression Grid Search

In [92]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
              'estimator__penalty': ['l1', 'l2']}
grid_search = GridSearchCV(OneVsRestClassifier(LogisticRegression()), param_grid, scoring = 'recall_micro', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 100, 'estimator__penalty': 'l1'}
Best cross-validation score: 0.98




In [93]:
tdf = pd.DataFrame([['Recall_Micro', 'Logistic Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [94]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 100, 'penalty': 'l1'}
Best cross-validation score: 0.98




In [95]:
tdf = pd.DataFrame([['GridSearchCV', 'Logistic Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Softmax Regression Grid Search

In [96]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(OneVsRestClassifier(LogisticRegression(multi_class="multinomial",solver="lbfgs")), param_grid, scoring = 'recall_micro', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 1000}
Best cross-validation score: 0.9742268041237113


In [97]:
tdf = pd.DataFrame([['Recall_Micro', 'Softmax Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [98]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'multi_class': ['multinomial'], 'solver':['lbfgs']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 100, 'multi_class': 'multinomial', 'solver': 'lbfgs'}
Best cross-validation score: 0.97




In [99]:
tdf = pd.DataFrame([['GridSearchCV', 'Softmax Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Linear SVM Grid Search

In [100]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(OneVsRestClassifier(LinearSVC()), param_grid, scoring = 'recall_micro', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 100}
Best cross-validation score: 0.9639175257731959




In [101]:
tdf = pd.DataFrame([['Recall_Micro', 'Linear SVC', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [102]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LinearSVC(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 100}
Best cross-validation score: 0.97




In [103]:
tdf = pd.DataFrame([['GridSearchCV', 'Linear SVC', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### SVC with kernel trick

### RBF Kernel Grid Search

In [104]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['rbf']}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 100, 'estimator__gamma': 100, 'estimator__kernel': 'rbf'}
Best cross-validation score: 0.98


In [105]:
tdf = pd.DataFrame([['Recall_Micro', 'SVC-rbf', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [106]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100],
              'kernel': ['rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 100, 'gamma': 100, 'kernel': 'rbf'}
Best cross-validation score: 0.98


In [107]:
tdf = pd.DataFrame([['GridSearchCV', 'SVC-rbf', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Linear Kernel Grid Search

In [108]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['linear']}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 100, 'estimator__kernel': 'linear'}
Best cross-validation score: 0.94


In [109]:
tdf = pd.DataFrame([['Recall_Micro', 'SVC-linear', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [110]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'kernel': ['linear']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 100, 'kernel': 'linear'}
Best cross-validation score: 0.97


In [111]:
tdf = pd.DataFrame([['GridSearchCV', 'SVC-linear', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Polynomial Kernel Grid Search

In [112]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['poly'],
               'estimator__degree': [1,2,3,4]}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 100, 'estimator__degree': 2, 'estimator__kernel': 'poly'}
Best cross-validation score: 0.93




In [113]:
tdf = pd.DataFrame([['Recall_Micro', 'SVC-poly', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [114]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'kernel': ['poly'],
               'degree': [1,2,3,4]}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'C': 10, 'degree': 1, 'kernel': 'poly'}
Best cross-validation score: 0.96




In [115]:
tdf = pd.DataFrame([['GridSearchCV', 'SVC-poly', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### Decision Tree Classifier Grid Search

In [116]:
param_grid = {'estimator__max_depth': [1,2,3,4,5,6,7,8,9,10], 'estimator__random_state': [0]}
grid_search = GridSearchCV(OneVsRestClassifier(DecisionTreeClassifier()), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'estimator__max_depth': 7, 'estimator__random_state': 0}
Best cross-validation score: 0.9707903780068728


In [117]:
tdf = pd.DataFrame([['Recall_Micro', 'Decision Tree', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [118]:
param_grid = {'max_depth': [1,2,3,4,5,6,7,8,9,10], 'random_state': [0]}
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring = 'recall_micro', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'max_depth': 6, 'random_state': 0}
Best cross-validation score: 0.9690721649484536


In [119]:
tdf = pd.DataFrame([['GridSearchCV', 'Decision Tree', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

In [120]:
from sklearn.dummy import DummyClassifier

dummy_stratified = DummyClassifier(strategy='stratified')
dummy_stratified.fit(X_train, y_train)

pred_strat = dummy_stratified.predict(X_test)

print("Unique predicted labels: {}".format(np.unique(pred_strat)))
print("Test score: {:.2f}".format(dummy_stratified.score(X_test, y_test)))
dummy_score = dummy_stratified.score(X_test, y_test)

Unique predicted labels: ['FalseFalse' 'FalseTrue' 'TrueTrue']
Test score: 0.31


In [121]:
tdf = pd.DataFrame([['Dummy', 'Dummy Classifier', dummy_score, 'None']], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)

### AUC Dataprep. 
This was needed as the roc_auc_score function does not work with multiclass datasets. 

In [122]:
y = clas['Risk'].values
clasx = clas.drop(['Risk'],axis=1)
X = clasx.values

In [123]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

y = label_binarize(y, classes=['TrueTrue', 'FalseTrue', 'FalseFalse'])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y

array([[1, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       ...,
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]])

In [124]:
pca = PCA(n_components =0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
pca.n_components_

8

### AUC  Evaluation

### KNN AUC

In [125]:
param_grid = {'estimator__n_neighbors': [1,2,3,4,5,6,7,8,9,10]}
grid_search = GridSearchCV(OneVsRestClassifier(KNeighborsClassifier()), param_grid, scoring = 'roc_auc', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'estimator__n_neighbors': 9}
Best cross-validation score: 0.988383096830991


In [126]:
tdf = pd.DataFrame([['AUC', 'KNN', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf,ignore_index=True)


### Logistic Regression AUC

In [127]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
              'estimator__penalty': ['l1', 'l2']}
grid_search = GridSearchCV(OneVsRestClassifier(LogisticRegression()), param_grid, scoring = 'roc_auc', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 1000, 'estimator__penalty': 'l1'}
Best cross-validation score: 0.9536404616463954




In [128]:
tdf = pd.DataFrame([['AUC', 'Logistic Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### Softmax Regression AUC

In [129]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(OneVsRestClassifier(LogisticRegression(multi_class="multinomial",solver="lbfgs")), param_grid, scoring = 'roc_auc', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 1000}
Best cross-validation score: 0.953537022865902


In [130]:
tdf = pd.DataFrame([['AUC', 'Softmax Regression', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)

### Linear SVM AUC

In [131]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(OneVsRestClassifier(LinearSVC()), param_grid, scoring = 'roc_auc', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 100}
Best cross-validation score: 0.9519227829137205




In [132]:
tdf = pd.DataFrame([['AUC', 'Linear SVM', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### RBF Kernel AUC

In [133]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['rbf']}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 100, 'estimator__gamma': 10, 'estimator__kernel': 'rbf'}
Best cross-validation score: 0.99


In [134]:
tdf = pd.DataFrame([['AUC', 'rbf kernel', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### Linear Kernel AUC

In [135]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['linear']}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'estimator__C': 100, 'estimator__kernel': 'linear'}
Best cross-validation score: 0.95


In [136]:
tdf = pd.DataFrame([['AUC', 'linear kernel', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### Polynomial Kernel AUC

In [137]:
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'estimator__kernel': ['poly'],
               'estimator__degree': [1,2,3,4]}
grid_search = GridSearchCV(OneVsRestClassifier(SVC()), param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))



Best parameters: {'estimator__C': 100, 'estimator__degree': 2, 'estimator__kernel': 'poly'}
Best cross-validation score: 0.98




In [138]:
tdf = pd.DataFrame([['AUC', 'polynomial kernel', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### Decision Tree AUC

In [139]:
param_grid = {'estimator__max_depth': [1,2,3,4,5,6,7,8,9,10], 'estimator__random_state': [0]}
grid_search = GridSearchCV(OneVsRestClassifier(DecisionTreeClassifier()), param_grid, scoring = 'roc_auc', cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {}".format(grid_search.best_score_))
best_clf = grid_search.best_estimator_

Best parameters: {'estimator__max_depth': 3, 'estimator__random_state': 0}
Best cross-validation score: 0.978151010115416


In [140]:
tdf = pd.DataFrame([['AUC', 'Decision Tree', grid_search.best_score_, grid_search.best_params_]], columns = ['Method','Model', 'Score','Parameters'])
clasmodels = clasmodels.append(tdf, ignore_index=True)


### Classification Model Evaluation and Selection

In [141]:
clasmodels

Unnamed: 0,Method,Model,Score,Parameters
0,Recall_Micro,KNN,0.977663,{'estimator__n_neighbors': 1}
1,GridSearchCV,KNN,0.977663,{'n_neighbors': 1}
2,Recall_Micro,Logistic Regression,0.975945,"{'estimator__C': 100, 'estimator__penalty': 'l1'}"
3,GridSearchCV,Logistic Regression,0.975945,"{'C': 100, 'penalty': 'l1'}"
4,Recall_Micro,Softmax Regression,0.974227,{'estimator__C': 1000}
5,GridSearchCV,Softmax Regression,0.967354,"{'C': 100, 'multi_class': 'multinomial', 'solv..."
6,Recall_Micro,Linear SVC,0.963918,{'estimator__C': 100}
7,GridSearchCV,Linear SVC,0.97079,{'C': 100}
8,Recall_Micro,SVC-rbf,0.9811,"{'estimator__C': 100, 'estimator__gamma': 100,..."
9,GridSearchCV,SVC-rbf,0.9811,"{'C': 100, 'gamma': 100, 'kernel': 'rbf'}"


In [142]:
clasmodels.loc[clasmodels['Score'].idxmax()]

Method                                                      AUC
Model                                                rbf kernel
Score                                                    0.9934
Parameters    {'estimator__C': 100, 'estimator__gamma': 10, ...
Name: 21, dtype: object

In [143]:
 clasmodels.groupby(['Method'], sort=False)['Score'].max()

Method
Recall_Micro    0.981100
GridSearchCV    0.981100
Dummy           0.365979
AUC             0.993400
Name: Score, dtype: float64

### Comparison With Old Results

In [144]:
proj1clasresults = pd.read_csv('proj1clas.csv')
proj1clasresults

Unnamed: 0,Method,Model,Score,Parameters
0,Recall_Micro,KNN,0.972509,{'estimator__n_neighbors': 1}
1,GridSearchCV,KNN,0.972509,{'n_neighbors': 1}
2,Recall_Micro,Logistic Regression,0.975945,"{'estimator__C': 1000, 'estimator__penalty': '..."
3,GridSearchCV,Logistic Regression,0.972509,"{'C': 100, 'penalty': 'l1'}"
4,Recall_Micro,Softmax Regression,0.975945,{'estimator__C': 1000}
5,GridSearchCV,Softmax Regression,0.977663,"{'C': 100, 'multi_class': 'multinomial', 'solv..."
6,Recall_Micro,Linear SVC,0.967354,{'estimator__C': 100}
7,GridSearchCV,Linear SVC,0.967354,{'C': 100}
8,Recall_Micro,SVC-rbf,0.975945,"{'estimator__C': 10, 'estimator__gamma': 100, ..."
9,GridSearchCV,SVC-rbf,0.975945,"{'C': 10, 'gamma': 100, 'kernel': 'rbf'}"


In [147]:
proj1clasresults.loc[proj1clasresults['Score'].idxmax()]

Method                                                      AUC
Model                                             Decision Tree
Score                                                  0.995011
Parameters    {'estimator__max_depth': 4, 'estimator__random...
Name: 24, dtype: object

In [146]:
proj1clasresults.groupby(['Method'], sort=False)['Score'].max()

Method
Recall_Micro    0.993127
GridSearchCV    0.991409
Dummy           0.371134
AUC             0.995011
Name: Score, dtype: float64

In [None]:
mean(proj1clasresults['Score'] - clasmodels['Score'])

With our previous set of models from project 1, we found the DecisionTreeClassifier model to be our best scoring scoring model for both AUC and recall scores. Our PCA dataset, which retains 8 of our original 24 features and covers 95% of the variance in the data, presents the SVC with rbf kernel as the best scoring model for both the recall and AUC scores. The SVC with rbf kernel model scored just under the DecisionTreeClassifier model in our project 1 testing, and it is reasonable to expect that PCA could shift our results to another high scoring model.

Overall, our PCA dataset scored slightly worse for recall and AUC for our models. This can be expected as some information is lost and contributes to error when performing dimensionality reduction. The difference between the PCA and non-PCA results was on average a value of (insert average difference here).

### Deep Learning Algorithms with Classification

In [None]:
y = clas['Risk']
y = pd.get_dummies(y).values
clasx = clas.drop(['Risk'],axis=1)
X = clasx.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train.shape

In [None]:
model2 = Sequential()
#input layer
model2.add(Dense(20, input_dim = 24, activation = 'relu'))
#hidden layers
model2.add(Dense(10, activation = 'relu'))
model2.add(Dense(5, activation = 'relu'))
#output layer
model2.add(Dense(3, activation = 'softmax'))

model2.compile(loss= 'binary_crossentropy' , optimizer = 'adam', metrics = [metrics.Recall()] )

model2.fit(X_train, y_train, epochs = 55, batch_size = 30)

In [None]:
model2.evaluate(X_train, y_train)

In [None]:
model2.evaluate(X_test, y_test)

For classification deep learning task, neural network had strong results compared to our regression tasks. We used recall for our metric, since recall is what we used for our classification tasks in the projects. We developed the above neural network and optimized it through epochs and batch sizes. Even though we could go up to 0.9950 for training, or test sets, we decided to leave it at a relatively high value where training and testing scores were aligned well. Overall, neural network for classification was a success and could produce the same results with our best models from the projects but much faster.