# LAB5

Name: AKSHAY KEKUDA

In [1]:
import matplotlib.pyplot as plt
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import (
    OneHotEncoder, LabelEncoder, Binarizer, KBinsDiscretizer,
    MaxAbsScaler, StandardScaler, MinMaxScaler
)

In [2]:
source_df = pd.read_csv('bank_customer_turnover.csv')
df = source_df.copy(True)

In [3]:
df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

The data set does not have any missing values.

In [4]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
df.shape

(10000, 14)

In [6]:
source_df.shape

(10000, 14)

In [7]:
source_df.CreditScore.describe()

count    10000.000000
mean       650.528800
std         96.653299
min        350.000000
25%        584.000000
50%        652.000000
75%        718.000000
max        850.000000
Name: CreditScore, dtype: float64

In [8]:
source_df.Geography.value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [9]:
source_df.Age.describe()
source_df.EstimatedSalary.describe()

count     10000.000000
mean     100090.239881
std       57510.492818
min          11.580000
25%       51002.110000
50%      100193.915000
75%      149388.247500
max      199992.480000
Name: EstimatedSalary, dtype: float64

##Data Dictionary of preprocessed data

|#|Attribute|Data Type|Range|Attribute Type|Description|
|-:|:-|:-|:-|:-|:-|
|1|RowNumber|int64|[1,10001]| integer| Row number of data. This data is not of much importance for preprocessing
|2|CustomerId|int64|random integers|integer| Customer ID. This data is not of much importance for preprocessing 
|3|Surname|object|String|word| This gives the surname. This data is not of much importance for preprocessing 
|4|CreditScore|int64|[350,850]| integer| Gives the credit score of the customer
|5|Geography|Object|France, Germany, Spain|categorical, nominal|Region. This is one hot encoded
|6|Gender|Object|Male or female|categorical, nominal| Gives the gender. THis can be coded as 1/0
|7|Age|int64|[18,92]|numeric| Age of customer
|8|Tenure|int64|[0,10]| numeric| Number of years the customer has been with the bank
|9|Balance|float64|[62397.40, 250898.09]| numeric| Balance of customer in the bank
|10|NumOfProducts|int64|[1,4]| numeric| Number of products the customer has in the bank
|11|HasCrCard|int64|0 or 1| categorical, nominal| 1 signifies that the customer has credit card and 0 signifies doesnot have credit card
|12|IsActiveMember|int64|0 or 1| categorical, nominal| 1 means the customer is an active member and 0 means he is not
|13|EstimatedSalary|float64|[11.58,199992.48]| numeric| Estimated salary of the customer
|14|Exited|int64|0 or 1| categorical| 0 signifies that the customer is with the bank and 1 signifies he has exited the bank

In [10]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

The data types of the data set are as expected

In [11]:
df['CustomerId'].is_unique

True

Since the Customer ID column has retured True for is_unique check, we can say that there are no duplicate data.

Let us now analyze the Gender attribute

In [12]:
df['Gender'].value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

In [13]:
df['Gender'].describe()

count     10000
unique        2
top        Male
freq       5457
Name: Gender, dtype: object

In [None]:
data = df.Gender.value_counts(normalize = 'True') * 100
sns.set_style("whitegrid")
fig = plt.figure(figsize=(12,4), dpi=75)
ax1 = fig.add_subplot(111)
sns.barplot(x=data.keys(), y=data, ci=None, palette="muted",orient='v', ax=ax1)
ax1.set_title("Distribution of gender of customers", fontsize=15)
ax1.set_xlabel ("Gender")
ax1.set_ylabel ("% of customers")
plt.show()

Let us now look at the Age of the customers`

In [None]:
df.Age.describe()

In [None]:
df.Age.hist()
plt.show()

In [None]:
print("Max customers are of age {} years".format(df.Age.mode()[0]))
print("Median age of customers is {} years".format(int(df.Age.median())))

In [None]:
bp = sns.boxplot(data = df['Age'])
plt.show()

The age data has some outliers. So we will remove these outliers

In [None]:
def iqr_fence(x):
    Q1 = x.quantile(0.25)
    Q3 = x.quantile(0.75)
    IQR = Q3 - Q1
    Lower_Fence = Q1 - (1.5 * IQR)
    Upper_Fence = Q3 + (1.5 * IQR)
    return [Upper_Fence,Lower_Fence]

In [None]:
temp = source_df[(source_df['Age']>61)]
temp.Exited.value_counts()

In [None]:
max_age = iqr_fence(df['Age'])[0]
print(max_age)
print("no of outliers = {}".format((df['Age']>max_age).value_counts()/100))
df=df[df['Age']<=max_age]
sns.boxplot(data = df['Age'])
plt.show()

In [None]:
temp = source_df[(source_df['Age']>58)]
temp.Exited.value_counts()

Let us look at credit score`

In [None]:
sns.boxplot(data = df['CreditScore'])
plt.show()

In [None]:
min_score = iqr_fence(df['CreditScore'])[1]
(df['CreditScore']>min_score).value_counts()/100

We have 0.15% of outliers for credit score. We can afford to drop these outliers

In [None]:
df=df[df['CreditScore']>= min_score]
df.shape

In [None]:
sns.boxplot(data = df['CreditScore'])
plt.show()

Let us now look at the NumOfProducts attribute

In [None]:
bp = sns.boxplot(data = df['NumOfProducts'])
plt.show()

In [None]:
temp = df[(df['NumOfProducts']==4)]
temp.Exited.value_counts()

All customers with 4 products have exited the bank. So we will remove these points


In [None]:
df=df[df['NumOfProducts']<= 3]
bp = sns.boxplot(data = df['NumOfProducts'])
plt.show()

Let us look at the salary attribute of the customers.

In [None]:
sns.boxplot(data = df['EstimatedSalary'])
plt.show()

In [None]:
df.EstimatedSalary.describe()

In [None]:
print("Median salary of customers is ${:.3f}".format((df.EstimatedSalary.median())))
print("Mean salary of customers is ${:.3f}".format((df.EstimatedSalary.mean())))


Let us look at the class Exited

In [None]:
df.Exited.value_counts()

Since the data doesn't describe what 1 and 0 stands for , I assume 1 signifies the customer has exited the bank and 0 has not exited the bank. Since we have only 20% of data that corresponds to not exited class, we may have a data imbalance problem here. 

In [None]:
count_class_0, count_class_1 = df.Exited.value_counts()
df_class_0 = df[df['Exited'] == 0]
df_class_1 = df[df['Exited'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df = pd.concat([df_class_1_over, df_class_0], axis=0)

print('Random over-sampling:')
print(df.Exited.value_counts(normalize=True))

In [None]:
df.shape

In [None]:
# sns.lmplot(data=df, x="Age", fit_reg=False, hue="Exited")
# plt.show()

In [None]:
# fig = plt.figure(figsize=(14,14), dpi=75)
# sns.pairplot(df[df.columns[1:]], kind="scatter", markers=["o", "D"], hue="Exited")

In [None]:
# sns.relplot(x="EstimatedSalary", y="Age", hue="Exited", col="Gender", data=df);

In [None]:
fig = plt.figure(figsize=(14,5), dpi=75)
plt.subplot(121)
sns.histplot(df, x = 'EstimatedSalary', hue = 'Gender')
plt.subplot(122)
sns.histplot(df, x = 'Age', hue = 'Gender')
plt.show()

The above graph shows that in the data set has same number of customers in all salary ranges. ALso many of the customers is around 40 yrs of age

Before we proceeed we can encode Male and Female gender as 1 and 0 respectively.

In [None]:
df.head()

In [None]:
for index, rows in df.iterrows():
  # print(index)
  df.at[index, 'Gender'] = 1 if df.iloc[index] ['Gender'] == 'Male' else 0

In [None]:
df.Gender.value_counts()

In [None]:
df['Gender'] = df['Gender'].astype(int)
df.dtypes

We have now converted the Gender data type to type int

In [None]:
df.boxplot(column = ["EstimatedSalary", "Age"])
plt.show()

As credit score, age, tenure, balance , NumOfProducts and estimated salary are in a different scale, we will normalize these columns


In [None]:
scaled_df = df.copy(deep=True)
scaler = MinMaxScaler()
cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaled_df[cols] = scaler.fit_transform(scaled_df[cols])
scaled_df.head()

In [None]:
df.Geography.value_counts()

In [None]:
scaled_df=pd.get_dummies(scaled_df, columns=['Geography'])
scaled_df.head()

Here the attributes Rownumber, Customer ID and Surname are not necessary for our analysis. We can drop them.

In [None]:
scaled_df = scaled_df.drop(['RowNumber', 'CustomerId', 'Surname'], 1)

In [None]:
scaled_df.dtypes

In [None]:
fig = plt.figure(figsize=(12,5), dpi=75)
sns.boxplot(data=scaled_df[cols])
plt.show()

From the above graph, we can see that the data has no outliers. 

In [None]:
fig = plt.figure(figsize=(10,5), dpi=75)
plt.subplot(121)
sns.violinplot(x=scaled_df['Gender'], y=scaled_df['EstimatedSalary'])
plt.subplot(122)
sns.violinplot(x=scaled_df['Gender'], y=scaled_df['Age'])
plt.show()

The above vioilin plot emphasizes on the point that distribution of males and females are not totally similar. There are more males( represented by 0) around the median salary and age.

In [None]:
scaled_df.describe()

In [None]:
scaled_df.dtypes

In [None]:
cols = ['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'EstimatedSalary']

In [None]:
df_corr= scaled_df[cols].corr()
df_corr

In [None]:
cols.append('Exited')
scaled_df[cols].head()

## Preprocessing

1.   Attributes RowNumber, Customer ID, and Surname were removed
2.   Numerical attributes Credit Score, Age, Tenure, Balance, NumofProducts and Estimated Salarywere scaled by min-max scaler
3. Geography atrribute was one hot encoded as 3 atrributes France, Spai, Germany using get dummies
4. Outliers in Age, NumofProducts and Credit Score was removed by analyzing box plots
5. Since the the exited class had 80% not exited and 20% exited class split, this means there's class imbalance. To tackle this I tried both undersampling and over smapling the data. This gave varying results




In [None]:
fig = plt.figure(figsize=(8,8), dpi=75)
mask = np.triu(np.ones_like(df_corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(df_corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

Here we see that age and balance are more corrlated. Similalry credit score and tenure are next closely correlated

We will now start with building the classifiers

In [None]:
scaled_df.sample(5)

In [None]:
df_cols = scaled_df.columns.to_list()
att = df_cols.pop(9)
df_cols.append(att)
scaled_df = scaled_df[df_cols]

In [None]:
df_cols

In [None]:
scaled_df.head(5)

In [None]:
scaled_df.shape

In [None]:
X = scaled_df[df_cols[0:-1]]
y = scaled_df[df_cols[-1]]

## k-fold cross validation using accuracy score

In [None]:
kf = KFold(shuffle=True, random_state=0)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    knn.fit(X_train, y_train)
    
    y_hat_tr = knn.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = knn.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print("KNN err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    lr.fit(X_train, y_train)
    
    y_hat_tr = lr.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = lr.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print(" Logistic Regression err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    dtc.fit(X_train, y_train)
    
    y_hat_tr = dtc.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = dtc.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print("Decision Tree Classifiers err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    nb.fit(X_train, y_train)
    
    y_hat_tr = nb.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = nb.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print(" Naive Bayes err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))

In [None]:
from sklearn.svm import SVC
svc = SVC()
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    svc.fit(X_train, y_train)
    
    y_hat_tr = svc.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = svc.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print("SVC err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
err_train = []
err_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    
    y_hat_tr = rfc.predict(X_train)
    
    err_tr = 1 - accuracy_score(y_train, y_hat_tr)
    err_train.append(err_tr)
    
    y_hat_te = rfc.predict(X_test)
    
    err_te = 1 - accuracy_score(y_test, y_hat_te)
    
    err_test.append(err_te)
    
    print('Fold {}: err.train={:0.4f}, err.test={:0.4f}'.format(fold+1, err_tr, err_te))

print(" Random Forest Classifier err_train(avg)={:0.4f}, err_test(avg)={:0.4f}".format(np.mean(err_train),np.mean(err_test)))


Summarizing the above results, we have the following generalization errors using the accuracy score measure on 5 fold runs:
<br></br>

|#|Classifier Type|Generalization error
|-:|:-|:-|
|1|LogisticRegression|0.26|0.16|0.17|0.18
|2|KNeighborsClassifier|0.16|0.08|0.08|0.09
|3|DecisionTreeClassifier|0.07|0.12|0.00|0.14
|4|GaussianNB|0.43|0.09|0.10|0.11
|5|SVC|0.20|0.12|0.15|0.16
|6|RandomForestClassifier|0.04|0.08|0.00|0.11|

As can be seen from the above table, Decison Tree classifier gives the least generalizition error rates on multiple test runs followed by RFC. 



## k-folds cross validation using F1 measure


In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    knn.fit(X_train, y_train)
    
    y_hat_tr = knn.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = knn.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("KNN f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    lr.fit(X_train, y_train)
    
    y_hat_tr = lr.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = lr.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("Logistic Regression f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    dtc.fit(X_train, y_train)
    
    y_hat_tr = dtc.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = dtc.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("Decision Tree Classifier f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    nb.fit(X_train, y_train)
    
    y_hat_tr = nb.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = nb.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("Naive Bayes f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    svc.fit(X_train, y_train)
    
    y_hat_tr = svc.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = svc.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("SVC f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    
    y_hat_tr = rfc.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = rfc.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("Random Forest Classifier f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
kf = KFold(n_splits = 10, shuffle=True, random_state=0)

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    
    y_hat_tr = rfc.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = rfc.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("Random Forest Classifier f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

In [None]:
f1_train = []
f1_test = []
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    knn.fit(X_train, y_train)
    
    y_hat_tr = knn.predict(X_train)
    
    f1_tr = f1_score(y_train, y_hat_tr)
    f1_train.append(f1_tr)
    
    y_hat_te = knn.predict(X_test)
    
    f1_te = f1_score(y_test, y_hat_te)
    
    f1_test.append(f1_te)
    
    print('Fold {}: f1.train={:0.2f}, f1.test={:0.2f}'.format(fold+1, f1_tr, f1_te))

print("KNN f1_train(avg)={:0.2f}, f1_test(avg)={:0.2f}".format(np.mean(f1_train),np.mean(f1_test)))

## HyperParameter Tuning

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
grid_params = {
    'n_neighbors' : [3,5,9,15, 50],
    'weights' : ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}
gs = GridSearchCV(knn, grid_params, n_jobs=-1)
gs.fit(X_train, y_train)
y_hat_tr = gs.predict(X_train)
err_tr = 1 - accuracy_score(y_train, y_hat_tr)
y_hat_te = gs.predict(X_test)
err_te = 1 - accuracy_score(y_test, y_hat_te)
print("KNN Training error: {:.2f} Test Error: {:.2f}".format(err_tr, err_te))
print(gs.best_params_)

In [None]:
svc.get_params().keys()

In [None]:
grid_params = {'kernel':['linear', 'rbf'], 'C':[1, 10]}
gs = GridSearchCV(svc, grid_params, n_jobs=-1)
gs.fit(X_train, y_train)
y_hat_tr = gs.predict(X_train)
err_tr = 1 - accuracy_score(y_train, y_hat_tr)
y_hat_te = gs.predict(X_test)
err_te = 1 - accuracy_score(y_test, y_hat_te)
print("SVC Training error: {:.2f} Test Error: {:.2f}".format(err_tr, err_te))
print(gs.best_params_)


In [None]:
grid_params = {'criterion': ['gini', 'entropy'],
               'max_depth': [1,50,250,1000, 2000, None],
               'min_samples_split': range(1,10),
               'min_samples_leaf': range(1,5)}
gs = GridSearchCV(dtc, grid_params, n_jobs=-1)
gs.fit(X_train, y_train)
y_hat_tr = gs.predict(X_train)
err_tr = 1 - accuracy_score(y_train, y_hat_tr)
y_hat_te = gs.predict(X_test)
err_te = 1 - accuracy_score(y_test, y_hat_te)
print("DTC Training error: {:.2f} Test Error: {:.2f}".format(err_tr, err_te))
print(gs.best_params_)

In [None]:
grid_params = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'criterion' :['gini', 'entropy']
}
gs = GridSearchCV(rfc, grid_params, n_jobs=-1)
gs.fit(X_train, y_train)
y_hat_tr = gs.predict(X_train)
err_tr = 1 - accuracy_score(y_train, y_hat_tr)
y_hat_te = gs.predict(X_test)
err_te = 1 - accuracy_score(y_test, y_hat_te)
print("RFC Training error: {:.2f} Test Error: {:.2f}".format(err_tr, err_te))
print(gs.best_params_)


Summarizing the above results, we have the following observations:
<br></br>

|#|Classifier Type|g_err for default hyperparam| g_err for opt hyperparam
|-:|:-|:-|:-|
|1|KNeighborsClassifier|0.16|0.10|
|2|DecisionTreeClassifier|0.07|0.09|
|3|RandomForestClassifier|0.04|0.05|

We see that for KNN, the perfomance has increased when using GridSearch. 

For DecisionTreeClassifier and Random classifier there really isn't much differnce in performance. 

For KNN we got 
{'metric': 'manhattan', 'n_neighbors': 50, 'weights': 'distance'} as the best hyperparam . We see that default for knn was minkowski

For DTC, I observed that the best hyperparam changes every time i run, the maxdepth varies

For RFC {'criterion': 'gini', 'max_features': 'auto', 'n_estimators': 200} gave the best results.

---

**Performance of Classifiers with undersampled vs oversampled data**

I noticed that when undersampling the data to taclkle class imbalance problem, all classifiers had high generalization error. Even after using gridsearch the error was at 20%

This probably has to do with the biuas variance tradeoff, and the way to mitigate class imbalance is to be be some where between undersampling and over smapling