## Clasification Problems

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>
<div style="border-bottom: 3px solid black"></div>

# 1. Importing Required Libraries

The code block below will load all the datasets for classification problems.

**Run the code cell below** 

In [1]:
from scipy.io import arff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.svm               # For SVC
import sklearn.model_selection   # For GridSearchCV and RandomizedSearchCV
import scipy
import scipy.stats               # For reciprocal distribution
import warnings
import sklearn.linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report, f1_score, confusion_matrix, recall_score, precision_score,accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder,StandardScaler,MinMaxScaler
warnings.filterwarnings("ignore", category=DeprecationWarning)  # Ignore sklearn deprecation warnings
warnings.filterwarnings("ignore", category=FutureWarning)       # Ignore sklearn deprecation warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>
<div style="border-bottom: 3px solid black"></div>

# 2. Loading All Datasets

The code block below will load all the datasets for classification problems.


**Dataset Mapper**
1. Diabetic Retinopathy -> CP_1
2. Default of credit card clients -> CP_2
3. Breast Cancer Wisconsin -> CP_3
4. Statlog (Australian credit approval) -> CP_4
5. Statlog (German credit data) -> CP_5
6. Steel Plates Faults -> CP_6
7. Adult -> CP_7
8. Yeast -> CP_8
9. Thoracic Surgery Data -> CP_9
10. Seismic-Bumps -> CP_10

**Run the code cell below to load the data** 

In [2]:
np.random.seed(23)
"""
Splits the data into Features (X) and Labels (y)
"""
def splitData(data):
    X = data.iloc[:,:len(data.columns)-1]
    y = data.iloc[:,-1]
    #print(y.value_counts())
    return X,y

"""
Splits data into Training Set and Testing Set. 
Size Ratio of Train:Test is 80:20 
"""
def getTrainTestData(data):
    X,y = splitData(data)
    #if type(y[0]) is int:
    y = y.astype(int)
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y,test_size=0.2,random_state=0)
    return X_train, X_test, y_train, y_test
    

"""
Converts categorical features by encoding
"""
def convertCategorical(df):
    categorical_feature_mask = df.dtypes==object
    categorical_cols = df.columns[categorical_feature_mask].tolist()
    le = LabelEncoder()
    df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
    return df;

"""
Checks for ? in the data frame
"""
def check(df):
    dic = {}
    lst = df.columns[df.isin([' ?']).any()]
    for x in lst:
        dic = df[x].value_counts().to_dict()
        key_list = list(dic.keys()) 
        val_list = list(dic.values())
        maxi=key_list[val_list.index(max(val_list))]
        df[x]=df[x].replace(' ?', maxi)
        
    return df


"""
Returns min and max value of every column
"""
def minMax(x):
    return pd.Series(index=['min','max'],data=[x.min(),x.max()])

In [3]:
# Diabetic Retinopathy Data | 19 Features | 1151 Samples
CP_1 = arff.loadarff('CP_Data/messidor_features.arff')
CP_1 = pd.DataFrame(CP_1[0])
print('Class balance count for diabetic retinopathy data set')
print(CP_1.iloc[:,-1].value_counts())
CP_1_X_train, CP_1_X_test, CP_1_y_train, CP_1_y_test = getTrainTestData(CP_1)
print('------------------------------------------------------------')


# Default of Credit Card Clients Data| 23 Features | 30000 Samples
CP_2 = pd.read_excel ('CP_Data/credit.xls',header=None,skiprows=2)
CP_2 = CP_2.drop(0, 1)
CP_2.columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
print('Class balance count for default credit card clients data set')
print(CP_2.iloc[:,-1].value_counts())
CP_2_X_train, CP_2_X_test, CP_2_y_train, CP_2_y_test = getTrainTestData(CP_2)
print('------------------------------------------------------------')

# Breast Cancer Wisconsin Data | 10 Features | 699 Samples
CP_3 = pd.read_csv("CP_Data/breast-cancer-wisconsin.data", sep=",",header=None)
CP_3=CP_3.replace('?', 5.5)
#print(CP_3cp[6].value_counts())
CP_3[6]=CP_3[6].astype(int)
CP_3[10]= CP_3[10].replace(2,0)
CP_3[10]= CP_3[10].replace(4,1)
print('Class balance count for breast cancer data set')
print(CP_3.iloc[:,-1].value_counts())
CP_3_X_train, CP_3_X_test, CP_3_y_train, CP_3_y_test = getTrainTestData(CP_3)
print('------------------------------------------------------------')

# Australian Credit Approval Data | 14 Features | 690 Samples
CP_4 = pd.read_csv("CP_Data/australian.dat", sep="\s+",header=None)
print('Class balance count for australian credit approval data set')
print(CP_4.iloc[:,-1].value_counts())
CP_4_X_train, CP_4_X_test, CP_4_y_train, CP_4_y_test = getTrainTestData(CP_4)
print('------------------------------------------------------------')

# German Credit Data | 24 Features | 1000 Samples
CP_5 = pd.read_csv("CP_Data/german.data-numeric", sep="\s+",header=None)
print('Class balance count for german credit data set')
print(CP_5.iloc[:,-1].value_counts())
CP_5_X_train, CP_5_X_test, CP_5_y_train, CP_5_y_test = getTrainTestData(CP_5)
print('------------------------------------------------------------')

# Steel Plates Faults Data | 33 Features | 1941 Samples
CP_6 = pd.read_csv("CP_Data/Faults.NNA", sep="\s+",header=None)
print('Class balance count for steel plates faults data set')
print(CP_6.iloc[:,-1].value_counts())
CP_6_X_train, CP_6_X_test, CP_6_y_train, CP_6_y_test = getTrainTestData(CP_6)
print('------------------------------------------------------------')

#Adult Data | 14 Features | 49382 Samples
CP_7 = pd.read_csv("CP_Data/adult.data", sep=",",header=None)
if(' ?' in CP_7.values):
    CP_7=check(CP_7)
CP_7= convertCategorical(CP_7)
print('Class balance count for adult data set')
print(CP_7.iloc[:,-1].value_counts())
CP_7_X_train, CP_7_y_train = splitData(CP_7)
CP_7_test = pd.read_csv("CP_Data/adult.test", sep=",",header=None,skiprows=1)
if(' ?' in CP_7_test.values):
    CP_7_test=check(CP_7_test)
CP_7_test= convertCategorical(CP_7_test)
CP_7_X_test, CP_7_y_test = splitData(CP_7_test)
print('------------------------------------------------------------')

# Yeast Data | 9 Features | 1484 Samples
CP_8 = pd.read_csv("CP_Data/yeast.data", sep="\s+",header=None)
CP_8= convertCategorical(CP_8)
X1,y1 = splitData(CP_8)
print('Class balance count for yeast data set')
print(CP_8.iloc[:,-1].value_counts())
CP_8_X_train, CP_8_X_test, CP_8_y_train, CP_8_y_test = sklearn.model_selection.train_test_split(X1,y1,test_size=0.2,random_state=0)
#CP_8_X_train, CP_8_X_test, CP_8_y_train, CP_8_y_test = getTrainTestData(CP_8)
print('------------------------------------------------------------')


# Thoracic Surgery Data | 16 Features | 470 Samples
CP_9 = arff.loadarff('CP_Data/ThoraricSurgery.arff')
CP_9 = pd.DataFrame(CP_9[0])
CP_9.columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
CP_9= convertCategorical(CP_9)
print('Class balance count for thoracic surgery data set')
print(CP_9.iloc[:,-1].value_counts())
CP_9_X_train, CP_9_X_test, CP_9_y_train, CP_9_y_test = getTrainTestData(CP_9)
print()
#X2,y2 = splitData(CP_9)
#CP_9_X_train, CP_9_X_test, CP_9_y_train, CP_9_y_test = sklearn.model_selection.train_test_split(X2,y2,test_size=0.2,random_state=0)
#CP_9_X_train, CP_9_X_test, CP_9_y_train, CP_9_y_test = getTrainTestData(CP_9)
print('------------------------------------------------------------')


# Seismic-Bumps | 18 Features | 2584 Samples
# delete column 13 14 15 bcs of one same value
CP_10 = arff.loadarff('CP_Data/seismic-bumps.arff')
CP_10 = pd.DataFrame(CP_10[0])
#plt.matshow(CP_101.corr())
#plt.show()
CP_10.columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
del CP_10[13]
del CP_10[14]
del CP_10[15]
CP_10.columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
CP_10= convertCategorical(CP_10)
print('Class balance count for seismic bumps data set')
print(CP_10.iloc[:,-1].value_counts())
CP_10_X_train, CP_10_X_test, CP_10_y_train, CP_10_y_test = getTrainTestData(CP_10)
print('------------------------------------------------------------')

print('Classification Data Loaded Successfully.')

Class balance count for diabetic retinopathy data set
b'1'    611
b'0'    540
Name: Class, dtype: int64
------------------------------------------------------------
Class balance count for default credit card clients data set
0    23364
1     6636
Name: 23, dtype: int64
------------------------------------------------------------
Class balance count for breast cancer data set
0    458
1    241
Name: 10, dtype: int64
------------------------------------------------------------
Class balance count for australian credit approval data set
0    383
1    307
Name: 14, dtype: int64
------------------------------------------------------------
Class balance count for german credit data set
1    700
2    300
Name: 24, dtype: int64
------------------------------------------------------------
Class balance count for steel plates faults data set
0    1268
1     673
Name: 33, dtype: int64
------------------------------------------------------------
Class balance count for adult data set
0    24720
1

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>
<div style="border-bottom: 3px solid black"></div>

# 3. Classifiers and Hyper Parameter Search Helper Methods

<div style="border-bottom: 3px solid black"></div>

## Hyper Parameter Search Helper Methods

#### Logistic Regression Parameter Search

In [4]:
def randomSearchLRC(model,X_train,y_train):
    print('Randomized Search')
    param_distributions = {
        'C'     : scipy.stats.reciprocal(0.01, 1000.),
        'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga'],
    }
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_distributions,cv=5, n_iter=50,n_jobs=4,  random_state=23).fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [5]:
def gridSearchLRC(model,X_train,y_train):
    print('Grid Search')
    param_grid = { 
        'C' : np.logspace(-2, 3, 10),
        'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga'],
    }
    gridcv = sklearn.model_selection.GridSearchCV(model, param_grid,n_jobs=4,   cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_

#### Decision Tree Parameter Search


In [6]:
def randomSearchDTC(model,X_train,y_train):
    print('Randomized Search')
    param_grid = {
        'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2, 4],
        'min_samples_split': [2, 5, 10],
    }
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_grid, n_jobs=4, cv=5, n_iter=50,random_state=0).fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [7]:
def gridSearchDTC(model,X_train,y_train):
    print('Grid Search')
    param_grid = {
        'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2, 4],
        'min_samples_split': [2, 5, 10],
    }
    gridcv = sklearn.model_selection.GridSearchCV(model, param_grid, n_jobs=4,  cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_


#### K-Nearest Neighbours Parameter Search

In [8]:
def randomSearchKNN(model,X_train,y_train):
    print('Randomized Search')
    param_distributions = {
        'n_neighbors'     : [3,5,11,19],
        'weights' : ['uniform', 'distance'],
        'p' : [1,2],
        'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
    }
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_distributions,cv=5,n_jobs=4,  n_iter=50,  random_state=0).fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [9]:
def gridSearchKNN(model,X_train,y_train):
    print('Grid Search')
    param_grid = { 
        'n_neighbors' : [3,5,11,19], 
        'weights' : ['uniform', 'distance'], 
        'p': [1,2], 
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
    }
    gridcv = sklearn.model_selection.GridSearchCV(model, param_grid,n_jobs=4,   cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_

#### Random Forest Parameter Search


In [10]:
def randomSearchRFC(model,X_train,y_train):
    print('Randomized Search')
    param_grid = {
        'bootstrap': [True, False],
        'max_depth': [10, 90, 100, None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2, 4],
        'min_samples_split': [2, 5, 10],
        'n_estimators': [10,20,50,100,150,200,250]
    }
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_grid, cv=5,n_jobs=4,  n_iter=50, random_state=0).fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [11]:
def gridSearchRFC(model,X_train,y_train):
    print('Grid Search')
    param_grid = {
        'bootstrap': [True, False],
        'max_depth': [10, 90, 100, None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2, 4],
        'min_samples_split': [2, 5, 10],
        'n_estimators': [10,20,50,100,150,200,250]
    }   
    gridcv = sklearn.model_selection.GridSearchCV(model, param_grid,n_jobs=4,   cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_

#### AdaBoost Parameter Search

In [12]:
def randomSearchABC(model,X_train,y_train):
    print('Randomized Search')
    param_dist = {
        'n_estimators': [50, 100],
        'learning_rate' : [0.01,0.05,0.1,0.3,1],
    }
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_dist,n_jobs=4,  n_iter=50, random_state=0).fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [13]:
def gridSearchABC(model,X_train,y_train):
    print('Grid Search')
    param_dist = {
        'n_estimators': [50, 100],
        'learning_rate' : [0.01,0.05,0.1,0.3,1],
    }
    gridcv = sklearn.model_selection.GridSearchCV(model, param_dist,n_jobs=4,  cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_

#### Neural network classification Parameter Search

In [14]:
def randomSearchNNC(model,X_train,y_train):
    print('Randomized Search')
    
    x1 = range(3)
    param_dist = {        
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'learning_rate': ['constant','adaptive'],
    'batch_size': np.power(2, x1),
    'momentum': [0.3,0.4,0.6,0.7,0.9],
    }
    
    randcv = sklearn.model_selection.RandomizedSearchCV(model, param_dist,n_jobs=4,  n_iter=50, random_state=0)
    randcv.fit(X_train,y_train)
    print(randcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (randcv.best_score_*100))
    return randcv.best_estimator_

In [15]:
def gridSearchNNC(model,X_train,y_train):
    print('Grid Search')
    param_dist = {        
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'learning_rate': ['constant','adaptive'],
    'batch_size': [ 1, 2, 4],
    'momentum': [0.3,0.4,0.6,0.7,0.9],
    }
    
    gridcv = sklearn.model_selection.GridSearchCV(model, param_dist,n_jobs=4,  cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    print("%.1f%% accuracy on validation sets (average)" % (gridcv.best_score_*100))
    return gridcv.best_estimator_

#### Support Vector Machine Parameter Search

In [16]:
def randomSearchSVM(model,X_train,y_train):
    print('Randomized Search')
    param_distributions = {
        'C'     : scipy.stats.reciprocal(1.0, 1000.),
        'gamma' : scipy.stats.reciprocal(0.01, 10.),
    }
    random_search = sklearn.model_selection.RandomizedSearchCV(model, param_distributions,cv=5,n_jobs=4, n_iter=30, random_state=23).fit(X_train,y_train)
    print(random_search.best_params_)
    return random_search.best_estimator_

In [17]:
def gridSearchSVM(model,X_train,y_train):
    print('Grid Search')
    param_grid = {
        'C': [0.001, 0.01, 0.1, 1, 10],
        'gamma' : [0.001, 0.01, 0.1, 1]
    }
    gridcv =  sklearn.model_selection.GridSearchCV(model, param_grid,n_jobs=4, cv=5).fit(X_train,y_train)
    print(gridcv.best_params_)
    return gridcv.best_estimator_

<div style="border-bottom: 3px solid black"></div>

## Score Helper Method

In [18]:
def scoreHelper(clf, X_train, X_test, y_train, y_test, parity):
    print('Training Accuracy : ',clf.score(X_train,y_train))
    print('Testing Accuracy : ',clf.score(X_test,y_test))
    if parity == '1' or parity == '2' or parity == '3' or parity == '5':
        print('Training Recall score : ',recall_score(y_train,clf.predict(X_train)))
        print("Testing Recall score : ", recall_score(y_test,clf.predict(X_test)))
    elif parity == '4':
        print('Training Precision score : ',precision_score(y_train,clf.predict(X_train)))
        print("Testing Precision score : ", precision_score(y_test,clf.predict(X_test)))
    else:
        print("Training f1 score : ", f1_score(y_train,clf.predict(X_train),average=None))
        print("Testing f1 score : ", f1_score(y_test,clf.predict(X_test),average=None))

<div style="border-bottom: 3px solid black"></div>

## Classifiers

#### Logistic Regression (Classification)

In [19]:
def LRC(X_train, X_test, y_train, y_test, parity, hs, C=1, solver='newton-cg'):
    print('\nResult for Logistic Regression Classification')
    clf = LogisticRegression(C=C, solver=solver)    
    if hs:
        clf = gridSearchLRC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchLRC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

#### Gaussian Naive Bayes

In [20]:
def NBC(X_train, X_test, y_train, y_test, parity, hs):
    print('\nResult for Gaussian Naive Bayes Classification')
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

####  K-Nearest Neighbours

In [21]:
def KNN(X_train, X_test, y_train, y_test, parity, hs, n_neighbors=5, weights='uniform', p=2, algorithm='auto'):
    print('\nResult for K-Nearest Neighbours Classification')
    clf = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, p=p, algorithm=algorithm)
    if hs:
        clf = gridSearchKNN(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchKNN(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

#### Support Vector Machine

In [22]:
def SVM(X_train, X_test, y_train, y_test, parity, hs, C=1, gamma='scale'):
    print('\nResult for SVM Classification')
    clf = SVC(C=1, gamma='scale')
    if hs:
        clf = gridSearchSVM(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchSVM(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

#### Decision tree classification

In [23]:
def  DTC(X_train, X_test, y_train, y_test, parity, hs, max_depth=None, max_features=None,min_samples_leaf=1,min_samples_split=2):
    print('\nResult for Decision Tree Classification')
    clf = tree.DecisionTreeClassifier(random_state = 0, max_depth=max_depth, max_features=max_features,min_samples_leaf=min_samples_leaf,min_samples_split=min_samples_split)
    if hs:
        clf = gridSearchDTC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchDTC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)    

#### Random Forest

In [24]:
def RFC(X_train, X_test, y_train, y_test, parity, hs, max_depth=None, max_features='auto',min_samples_leaf=1,min_samples_split=2, bootstrap=True,n_estimators=100):
    print('\nResult for Random Forest Classification')
    clf = RandomForestClassifier(max_depth=max_depth, max_features=max_features,min_samples_leaf=min_samples_leaf,min_samples_split=min_samples_split, bootstrap=bootstrap, n_estimators=n_estimators, random_state=0)
    if hs:
        clf = gridSearchRFC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchRFC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

#### AdaBoost 

In [25]:
def ABC(X_train, X_test, y_train, y_test, parity, hs, n_estimators=50, learning_rate=1):
    print('\nResult for AdaBoost Classification')
    clf = AdaBoostClassifier(n_estimators=n_estimators,learning_rate=learning_rate,random_state=0)
    if hs:
        clf = gridSearchABC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchABC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

#### Neural Network (MLPClassifier)

In [26]:
def NNC(X_train, X_test, y_train, y_test, parity, hs,  activation='relu', solver='adam', learning_rate='constant',batch_size='auto', momentum=0.0):
    print('\nResult for Neural Network Classification')
    clf = MLPClassifier(activation=activation, solver=solver, learning_rate=learning_rate,batch_size=batch_size, momentum=momentum)
    if hs:
        clf = gridSearchNNC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
        clf = randomSearchNNC(clf,X_train,y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)
    else:
        clf.fit(X_train, y_train)
        scoreHelper(clf, X_train, X_test, y_train, y_test, parity)

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>
<div style="border-bottom: 3px solid black"></div>

# 4. Working with Datasets

In [27]:
def classification(X_train, X_test, y_train, y_test, parity, hs = False):
    LRC(X_train, X_test, y_train, y_test, parity, hs)
    NBC(X_train, X_test, y_train, y_test, parity, hs)
    DTC(X_train, X_test, y_train, y_test, parity, hs)
    RFC(X_train, X_test, y_train, y_test, parity, hs)
    KNN(X_train, X_test, y_train, y_test, parity, hs)
    ABC(X_train, X_test, y_train, y_test, parity, hs)
    NNC(X_train, X_test, y_train, y_test, parity, hs)
    SVM(X_train, X_test, y_train, y_test, parity, hs)

In [28]:
print('Diabetic Retinopathy Dataset')
classification(CP_1_X_train, CP_1_X_test, CP_1_y_train, CP_1_y_test,'1',hs = False)
print('-------------------------------------------------------')

print('\n\nDefault of Credit Card Clients Dataset')
classification(CP_2_X_train, CP_2_X_test, CP_2_y_train, CP_2_y_test,'2',hs = False)
print('-------------------------------------------------------')

print('\n\nBreast Cancer Wisconsin Dataset')
classification(CP_3_X_train, CP_3_X_test, CP_3_y_train, CP_3_y_test,'3',hs = False)
print('-------------------------------------------------------')

print('\n\nAustralian Credit Approval Dataset')
classification(CP_4_X_train, CP_4_X_test, CP_4_y_train, CP_4_y_test, '4',hs = False)
print('-------------------------------------------------------')

print('\n\nGerman Credit Dataset')
classification(CP_5_X_train, CP_5_X_test, CP_5_y_train, CP_5_y_test,'5',hs = False)
print('-------------------------------------------------------')

print('\n\nSteel Plates Faults Dataset')
classification(CP_6_X_train, CP_6_X_test, CP_6_y_train, CP_6_y_test,'6',hs = False)
print('-------------------------------------------------------')

print('\n\nAdult Dataset')
classification(CP_7_X_train, CP_7_X_test, CP_7_y_train, CP_7_y_test,'7',hs = False)
print('-------------------------------------------------------')

print('\n\nYeast Dataset')
classification(CP_8_X_train, CP_8_X_test, CP_8_y_train, CP_8_y_test,'8',hs = False)
print('-------------------------------------------------------')

print('\n\nThoracic Surgery Dataset')
classification(CP_9_X_train, CP_9_X_test, CP_9_y_train, CP_9_y_test,'9',hs = False)
print('-------------------------------------------------------')

print('\n\nSeismic-Bumps Dataset')
classification(CP_10_X_train, CP_10_X_test, CP_10_y_train, CP_10_y_test,'10',hs = False)
print('-------------------------------------------------------')

Diabetic Retinopathy Dataset

Result for Logistic Regression Classification
Training Accuracy :  0.7630434782608696
Testing Accuracy :  0.7359307359307359
Training Recall score :  0.7006237006237006
Testing Recall score :  0.676923076923077

Result for Gaussian Naive Bayes Classification
Training Accuracy :  0.6054347826086957
Testing Accuracy :  0.5367965367965368
Training Recall score :  0.3076923076923077
Testing Recall score :  0.23846153846153847

Result for Decision Tree Classification
Training Accuracy :  1.0
Testing Accuracy :  0.5887445887445888
Training Recall score :  1.0
Testing Recall score :  0.6076923076923076

Result for Random Forest Classification
Training Accuracy :  1.0
Testing Accuracy :  0.6623376623376623
Training Recall score :  1.0
Testing Recall score :  0.6307692307692307

Result for K-Nearest Neighbours Classification
Training Accuracy :  0.7804347826086957
Testing Accuracy :  0.6190476190476191
Training Recall score :  0.735966735966736
Testing Recall score

Training Accuracy :  0.9188144329896907
Testing Accuracy :  0.9357326478149101
Training f1 score :  [0.93936477 0.87719298]
Testing f1 score :  [0.95344507 0.89626556]

Result for Gaussian Naive Bayes Classification
Training Accuracy :  0.5038659793814433
Testing Accuracy :  0.46786632390745503
Training f1 score :  [0.3984375  0.57785088]
Testing f1 score :  [0.37837838 0.53483146]

Result for Decision Tree Classification
Training Accuracy :  1.0
Testing Accuracy :  1.0
Training f1 score :  [1. 1.]
Testing f1 score :  [1. 1.]

Result for Random Forest Classification
Training Accuracy :  1.0
Testing Accuracy :  1.0
Training f1 score :  [1. 1.]
Testing f1 score :  [1. 1.]

Result for K-Nearest Neighbours Classification
Training Accuracy :  0.759020618556701
Testing Accuracy :  0.6606683804627249
Training f1 score :  [0.82291667 0.62298387]
Testing f1 score :  [0.7509434  0.46774194]

Result for AdaBoost Classification
Training Accuracy :  1.0
Testing Accuracy :  1.0
Training f1 score :  

Training Accuracy :  1.0
Testing Accuracy :  0.9400386847195358
Training f1 score :  [1. 1.]
Testing f1 score :  [0.96903097 0.06060606]

Result for K-Nearest Neighbours Classification
Training Accuracy :  0.9332365747460087
Testing Accuracy :  0.9381044487427466
Training f1 score :  [0.96518668 0.18823529]
Testing f1 score :  [0.968      0.05882353]

Result for AdaBoost Classification
Training Accuracy :  0.9366231253023706
Testing Accuracy :  0.9342359767891683
Training f1 score :  [0.96699421 0.20606061]
Testing f1 score :  [0.966 0.   ]

Result for Neural Network Classification
Training Accuracy :  0.9308176100628931
Testing Accuracy :  0.9439071566731141
Training f1 score :  [0.96416938 0.        ]
Testing f1 score :  [0.97114428 0.        ]

Result for SVM Classification
Training Accuracy :  0.9308176100628931
Testing Accuracy :  0.9477756286266924
Training f1 score :  [0.96416938 0.        ]
Testing f1 score :  [0.97318769 0.        ]
--------------------------------------------

# Novelty 1. Heart Disease Classification

### Feature Information

1. age - in years
2. sex - (1 = male; 0 = female)
3. cp - chest pain type
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg - resting electrocardiographic results
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slope - the slope of the peak exercise ST segment
12. ca - number of major vessels (0-3) colored by flourosopy
13. thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target - 1 = heart disease or 0 = no heart disease

In [29]:
print('Novelty1. Heart Disease Classification')
heart_data = pd.read_csv("CP_Data/heart.csv", sep=",")
X_train, X_test, y_train, y_test = getTrainTestData(heart_data)
classification(X_train, X_test, y_train, y_test,'14',hs = False)
print('---------------------------------------------------------')

Novelty1. Heart Disease Classification

Result for Logistic Regression Classification
Training Accuracy :  0.8388429752066116
Testing Accuracy :  0.8524590163934426
Training f1 score :  [0.81516588 0.85714286]
Testing f1 score :  [0.82352941 0.87323944]

Result for Gaussian Naive Bayes Classification
Training Accuracy :  0.8347107438016529
Testing Accuracy :  0.8524590163934426
Training f1 score :  [0.81481481 0.85074627]
Testing f1 score :  [0.82352941 0.87323944]

Result for Decision Tree Classification
Training Accuracy :  1.0
Testing Accuracy :  0.7868852459016393
Training f1 score :  [1. 1.]
Testing f1 score :  [0.77192982 0.8       ]

Result for Random Forest Classification
Training Accuracy :  1.0
Testing Accuracy :  0.8852459016393442
Training f1 score :  [1. 1.]
Testing f1 score :  [0.87272727 0.89552239]

Result for K-Nearest Neighbours Classification
Training Accuracy :  0.78099173553719
Testing Accuracy :  0.639344262295082
Training f1 score :  [0.76444444 0.7953668 ]
Testi

<div style="border-bottom: 3px solid black; margin-bottom:5px"></div>
<div style="border-bottom: 3px solid black"></div>

## For Best Models
Use the code cell below

1. Find Dataset & Parameters from the Configs file (Be Careful with parity value, parity = Dataset number you're testing)
2. Give the Dataset and Parameters for the Model
3. Run and Enjoy the best result

Add the proper RP_#_ according to your requirements in parameters. Example CP_3_X_train, CP_3_X_test, ...

**Dataset Mapper**
1. Diabetic Retinopathy -> CP_1
2. Default of credit card clients -> CP_2
3. Breast Cancer Wisconsin -> CP_3
4. Statlog (Australian credit approval) -> CP_4
5. Statlog (German credit data) -> CP_5
6. Steel Plates Faults -> CP_6
7. Adult -> CP_7
8. Yeast -> CP_8
9. Thoracic Surgery Data -> CP_9
10. Seismic-Bumps -> CP_10

**Model Mapper**
1. Logistic Regression Classifier -> LRC()
2. Naive Bayes Classifier -> NBC()
3. Decision Tree Classifier -> DTC()
4. Random Forest Classifier -> RFC()
5. K-Nearest Neighbors -> KNN()
6. AdaBoost Classifier -> ABC()
7. Neural Network Classifier -> NNC()
8. Support Vector Machine Classifier -> SVM()


In [30]:
# LRC(CP_10_X_train, CP_10_X_test, CP_10_y_train, CP_10_y_test, parity=10, hs=False, C=0.01, solver='sag')
# NBC(X_train, X_test, y_train, y_test, parity=1, hs=False,)
# DTC(X_train, X_test, y_train, y_test, parity=1, hs=False, max_depth=10, max_features='sqrt',min_samples_leaf=1,min_samples_split=2)
# RFC(X_train, X_test, y_train, y_test, parity=1, hs=False, max_depth=10, max_features='sqrt',min_samples_leaf=1,min_samples_split=2, bootstrap=False,n_estimators=100)
# KNN(X_train, X_test, y_train, y_test, parity=1, hs=False, n_neighbors=3, weights='uniform', p=1, algorithm='auto')
# ABC(X_train, X_test, y_train, y_test, parity=1, hs=False, n_estimators=50, learning_rate=1)
# NNC(X_train, X_test, y_train, y_test, parity=1, hs=False, activation='tangh', solver='adam', learning_rate='adaptive', momentum=0.6)
# SVM(X_train, X_test, y_train, y_test, parity=1, hs=False, C=1, gamma='scale')