# Perfroming Multiple Types of Classification

This notebook will go over two ways to encode our categorical data and apply different types of classification algorithms, each algorithm has a link to the sklearn library to go more in depth on how these scikit learn algorithms are implemented and how they work

In [1]:
import pandas as pd

### German Credit Dataset
#### Source : https://www.mldata.io/dataset-details/german_credit_data/

checking_account_status	 - Status of existing checking account (A11: < 0 DM, A12: 0 <= x < 200 DM, A13 : >= 200 DM / salary assignments for at least 1 year, A14 : no checking account)

duration	-	Duration in month

credit_history	-	A30: no credits taken/ all credits paid back duly, A31: all credits at this bank paid back duly, A32: existing credits paid back duly till now, A33: delay in paying off in the past, A34 : critical account/ other credits existing (not at this bank)

purpose	- 	Purpose of Credit (A40 : car (new), A41 : car (used), A42 : furniture/equipment, A43 : radio/television, A44 : domestic appliances, A45 : repairs, A46 : education, A47 : (vacation - does not exist?), A48 : retraining, A49 : business, A410 : others)

savings	- Savings in accounts/bonds (A61 : < 100 DM, A62 : 100 <= x < 500 DM, A63 : 500 <= x < 1000 DM, A64 : >= 1000 DM, A65 : unknown/ no savings account

credit_amount	- savings	string	Savings in accounts/bonds (A61 : < 100 DM, A62 : 100 <= x < 500 DM, A63 : 500 <= x < 1000 DM, A64 : >= 1000 DM, A65 : unknown/ no savings account

present_employment	- 	A71 : unemployed, A72 : < 1 year, A73 : 1 <= x < 4 years, A74 : 4 <= x < 7 years, A75 : .. >= 7 years

installment_rate	-	Installment Rate in percentage of disposable income

personal	-	Personal Marital Status and Sex (A91 : male : divorced/separated, A92 : female : divorced/separated/married, A93 : male : single, A94 : male : married/widowed, A95 : female : single)

other_debtors	-	A101 : none, A102 : co-applicant, A103 : guarantor

present_residence	-	Present residence since

property	-	A121 : real estate, A122 : if not A121 : building society savings agreement/ life insurance, A123 : if not A121/A122 : car or other, not in attribute 6, A124 : unknown / no property

age	-	Age in years

other_installment_plans	-	A141 : bank, A142 : stores, A143 : none

customer_type	-	Predictor Class: 1=Good, 2=Bad

In [2]:
data = pd.read_csv('datasets/german_credit_data_dataset.csv')

data.head()

Unnamed: 0,checking_account_status,duration,credit_history,purpose,credit_amount,savings,present_employment,installment_rate,personal,other_debtors,...,property,age,other_installment_plans,housing,existing_credits,job,dependents,telephone,foreign_worker,customer_type
0,A11,6,A34,A43,1169.0,A65,A75,4.0,A93,A101,...,A121,67.0,A143,A152,2.0,A173,1,A192,A201,1
1,A12,48,A32,A43,5951.0,A61,A73,2.0,A92,A101,...,A121,22.0,A143,A152,1.0,A173,1,A191,A201,2
2,A14,12,A34,A46,2096.0,A61,A74,2.0,A93,A101,...,A121,49.0,A143,A152,1.0,A172,2,A191,A201,1
3,A11,42,A32,A42,7882.0,A61,A74,2.0,A93,A103,...,A122,45.0,A143,A153,1.0,A173,2,A191,A201,1
4,A11,24,A33,A40,4870.0,A61,A73,3.0,A93,A101,...,A124,53.0,A143,A153,2.0,A173,2,A191,A201,2


In [3]:
data.shape

(1000, 21)

In [4]:
data.columns

Index(['checking_account_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings', 'present_employment', 'installment_rate',
       'personal', 'other_debtors', 'present_residence', 'property', 'age',
       'other_installment_plans', 'housing', 'existing_credits', 'job',
       'dependents', 'telephone', 'foreign_worker', 'customer_type'],
      dtype='object')

In [5]:
data = data.drop(['telephone', 'personal', 'present_residence', 'other_installment_plans'], axis=1)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   checking_account_status  1000 non-null   object 
 1   duration                 1000 non-null   int64  
 2   credit_history           1000 non-null   object 
 3   purpose                  1000 non-null   object 
 4   credit_amount            1000 non-null   float64
 5   savings                  1000 non-null   object 
 6   present_employment       1000 non-null   object 
 7   installment_rate         1000 non-null   float64
 8   other_debtors            1000 non-null   object 
 9   property                 1000 non-null   object 
 10  age                      1000 non-null   float64
 11  housing                  1000 non-null   object 
 12  existing_credits         1000 non-null   float64
 13  job                      1000 non-null   object 
 14  dependents               

In [7]:
data['savings'].unique()

array(['A65', 'A61', 'A63', 'A64', 'A62'], dtype=object)

### Label encoding and one hot encoding

In [8]:
from sklearn.preprocessing import LabelEncoder

savings_dict = {"A65" : 0, "A61" : 1, "A62" : 2, "A63" : 3, "A64" : 4}

data['savings'].replace(savings_dict, inplace=True)

data.head()

Unnamed: 0,checking_account_status,duration,credit_history,purpose,credit_amount,savings,present_employment,installment_rate,other_debtors,property,age,housing,existing_credits,job,dependents,foreign_worker,customer_type
0,A11,6,A34,A43,1169.0,0,A75,4.0,A101,A121,67.0,A152,2.0,A173,1,A201,1
1,A12,48,A32,A43,5951.0,1,A73,2.0,A101,A121,22.0,A152,1.0,A173,1,A201,2
2,A14,12,A34,A46,2096.0,1,A74,2.0,A101,A121,49.0,A152,1.0,A172,2,A201,1
3,A11,42,A32,A42,7882.0,1,A74,2.0,A103,A122,45.0,A153,1.0,A173,2,A201,1
4,A11,24,A33,A40,4870.0,1,A73,3.0,A101,A124,53.0,A153,2.0,A173,2,A201,2


In [9]:
data = pd.get_dummies(data, columns=['checking_account_status', 
                                     'credit_history', 
                                     'purpose',
                                     'present_employment',
                                     'property', 
                                     'housing', 
                                     'other_debtors',
                                     'job', 
                                     'foreign_worker'])

data.shape

(1000, 48)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 48 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     1000 non-null   int64  
 1   credit_amount                1000 non-null   float64
 2   savings                      1000 non-null   int64  
 3   installment_rate             1000 non-null   float64
 4   age                          1000 non-null   float64
 5   existing_credits             1000 non-null   float64
 6   dependents                   1000 non-null   int64  
 7   customer_type                1000 non-null   int64  
 8   checking_account_status_A11  1000 non-null   uint8  
 9   checking_account_status_A12  1000 non-null   uint8  
 10  checking_account_status_A13  1000 non-null   uint8  
 11  checking_account_status_A14  1000 non-null   uint8  
 12  credit_history_A30           1000 non-null   uint8  
 13  credit_history_A31 

In [11]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Naive Baye's Classifier
https://scikit-learn.org/stable/modules/naive_bayes.html

In [12]:
def naive_bayes(x_train, y_train):

    classifier = GaussianNB()
    classifier.fit(x_train, y_train)
    
    return classifier

### K-nearest-neighbors Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [13]:
def k_nearest_neighbors(x_train, y_train):

    classifier = KNeighborsClassifier(n_neighbors=10)
    classifier.fit(x_train, y_train)
    
    return classifier

### Support Vector Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [14]:
def svc(x_train, y_train):

    classifier = SVC(kernel='rbf', gamma='scale')
    classifier.fit(x_train, y_train)
    
    return classifier

### Decision Tree Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [15]:
def decision_tree(x_train, y_train):

    classifier = DecisionTreeClassifier(max_depth=6)
    classifier.fit(x_train, y_train)
    
    return classifier

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [17]:
X = data.drop('customer_type', axis=1)
Y = data['customer_type']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [18]:
def build_and_train_classifier(x_train, y_train, classfication_fn):

    model = classfication_fn(x_train, y_train)
    y_pred = model.predict(x_test)
    
    train_score = model.score(x_train, y_train)
    test_score = accuracy = accuracy_score(y_test, y_pred)

    print("Training Score : ", train_score)
    print("Testing Score : ", test_score)

In [19]:
build_and_train_classifier(x_train, y_train, naive_bayes)

Training Score :  0.72875
Testing Score :  0.745


In [20]:
build_and_train_classifier(x_train, y_train, k_nearest_neighbors)

Training Score :  0.72125
Testing Score :  0.7


In [21]:
build_and_train_classifier(x_train, y_train, svc)

Training Score :  0.70625
Testing Score :  0.72


In [22]:
build_and_train_classifier(x_train, y_train, decision_tree)

Training Score :  0.80625
Testing Score :  0.685


### Split the training data further into 2 parts to test warm_start

In [23]:
x_train_1, x_train_2, y_train_1, y_train_2 = train_test_split(x_train, y_train, test_size=0.5)

### Random Forest Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [24]:
rfc = RandomForestClassifier(max_depth=4, n_estimators=2, warm_start=True)

In [25]:
rfc.fit(x_train_1, y_train_1)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=4, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=2,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=True)

In [26]:
y_pred = rfc.predict(x_test)

In [27]:
test_score = accuracy = accuracy_score(y_test, y_pred)
print("Testing Score : ", test_score)

Testing Score :  0.705


In [28]:
rfc.n_estimators += 2

rfc.fit(x_train_2, y_train_2)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=4, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=4,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=True)

In [29]:
y_pred = rfc.predict(x_test)

In [30]:
test_score = accuracy = accuracy_score(y_test, y_pred)

print("Testing Score : ", test_score)

Testing Score :  0.75
