# Assignment 1 - Algorithmic Bias 

---

## Author: 
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Eoghan Hogan
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Eoghan.Hogan@ucdconnect.ie
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 17335293

---

# Index:
    
- i - Imports
- ii - Constants
- iii - Data
---
- Question 1
    - 1 - Exploring Data
    - 2 - Data Classes
    - 3 - Classifiers
        - 3.1 - KNN (w/bias discussion)
        - 3.2 - D-Trees (w/bias discussion)
        - 3.3 - Log Regression (w/bias discussion)
        - 3.4 - Gradient Boost (w/bias discussion)
    - 4 - Scaling Data
    - 5 - Exploring Scaled Data
    - 6 - Scaled Classifiers
        - 6.1 - KNN (w/bias discussion)
        - 6.2 - D-Trees (w/bias discussion)
        - 6.3 - Log Regression (w/bias discussion)
        - 6.4 - Gradient Boost (w/bias discussion)
    - 7 - Scaled Results vs Non-Scaled Results
        - Non-Scaled Results
        - Scaled Results
        - All Results
- Question 2
    - 8 - Rectifying Sample Bias Strategy
    - 9 - SMOTE
    - 10 - Condensed Nearest Neighbour
- Question 3
    - 11 - Strategy Testing (New Dataset)
    - 12 - Comparison of all methods
- Conclusion
---
- References

# i - Imports

In [None]:
import numpy as np
import pandas as pd
from collections import Counter

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import CondensedNearestNeighbour

import imblearn

from matplotlib import pyplot as plt

%matplotlib inline

# ii - Constants

In [None]:
cv_n = 10
kf = KFold(n_splits=cv_n, random_state=42, shuffle=True)

# iii - Data 

In [None]:
surv = pd.read_csv('survival.csv')
surv['Survived'] = 'GE5'
surv.loc[surv['Class']==2,'Survived']='L5'
surv.head()

## What is Bias?

Bias is when a data set is heavily Unbalanced in that one class dominates another by a certain factor. Biased Dataset can be 10:1 or 10000:1 if there are not alot of features even a small imbalance can bias towards the majoiry class. 

Bias can be a real problem as alot of really Interesting datasets have biased data.


Bias in a Classifaction model is when a Classification model is Biased toward the Class that appears most inside the Sample Training data. That is the model predicts the Majoirty class most of the time becuase during training it will achieve a higher score by predicting the majority calss most of the time. Also during training getting the majority class wrong will not have a big impact on the training algorithm due to the amount of times sees a minority class sample.

---
---
# Question 1
---
---

# 1 - Exploring Data

#### Dataframe Summary Statistics

In [None]:
surv.describe()

From the summary stats I cannot Identify anything "odd" the data is all clearly on different scales but for the time being I am leaving it. I will come back to scaling the data Later on.

## Plots

### Age vs Class plot

In [None]:
surv[["Age", "Class"]].plot()

### Year vs Class plot

In [None]:
surv[["Year", "Class"]].plot()

### NNodes vs Class plot

In [None]:
surv[["NNodes", "Class"]].plot()

## Plots
The plots as they stand did not provide any useful insight but it is always good practice to atleast try and explore the data in a bit of depth before doing anything else so although nothing major was revealed it was still worth the effort 

# 2 - Data Classes

Here we look at the amount of Bias in the samples.
we also pull out the lables and classes into their own arrays. 
We also take the features out into their own array to use for Training

In [None]:
y = surv["Class"].values
labels = surv['Survived'].values

In [None]:
c_y = Counter(labels)
X = surv[["Age", "Year", "NNodes"]].values

print(f"Shape of features: {X.shape}")
print(f"Shape of output: {y.shape}")
print(f"Classes:\n\tGE5(survived):\t{c_y['GE5']}\n\tL5: {c_y['L5']}")
print(f"Minority class : {round((c_y['L5']/len(y)), 3)*100}%")

## Setting up data for hold out

For hold out we want to split the rraining and test data as we have done below. 
I also just wanted to check the Class bias in the Training and Testing Samples. 

In [None]:
c2_ydist = round((Counter(y)[2]/len(y)) * 100, 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
c1_tr = round(Counter(y_train)[1]/(len(y_train)) * 100, 4)
c2_tr = round(Counter(y_train)[2]/(len(y_train)) *100, 4)
c1_te = round(Counter(y_test)[1]/(len(y_test))* 100, 4)
c2_te = round(Counter(y_test)[2]/(len(y_test))* 100, 4)
print(f"Class 1 in the Training samples: {c1_tr}%")
print(f"Class 2 in the Training samples: {c2_tr}%")
print(f"Class 1 in the Testing samples: {c1_te}%")
print(f"Class 2 in the Testing samples: {c2_te}%")

# 3 - Classifiers

# 3.1 - K-NN

In [None]:
kNN = KNeighborsClassifier(n_neighbors=3)

### 3.1 - Hold Out

In [None]:
y_pred = kNN.fit(X_train, y_train).predict(X_test)
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
ho_knn_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_knn_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 3.1 - Cross Validation 

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(kNN, X, y, cv=kf)
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {c2_ydist}%")
cv_knn_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_knn_predmin}%\n")
scores = cross_val_score(kNN, X, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### KNN RESULTS
The KNN heavily underpredicted the Minority class which is not surprising given when it does its distance calculations there would not be many of the moinority class 

# 3.2 - Decision trees

In [None]:
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)

### 3.2 - Hold Out

In [None]:
y_pred = Dtree.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
ho_dtree_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_dtree_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 3.2 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(Dtree, X, y, cv=kf)
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {c2_ydist}%")
cv_dtree_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_dtree_predmin}%\n")
scores = cross_val_score(Dtree, X, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Decision Tree RESULTS
The Decision Tree underpredicted the Minority class when we done hold out testing and then surprisingly it overpredicted the minority class and I have a theory that since we only have 3 features that the Tree was able to learn a good split on the 3 Features so the imbalance was not so much of a problem

# 3.3 Logistic Regression

In [None]:
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000)

### 3.3 Hold out

In [None]:
y_pred = lreg.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
ho_lreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_lreg_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 3.3 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(lreg, X, y, cv=kf)
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {c2_ydist}%")
cv_lreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_lreg_predmin}%\n")
scores = cross_val_score(lreg, X, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Linear Regression RESULTS
Linear Regression underestimates The Minority Class by a massive margin.

# 3.4 - Gradient Boosting

In [None]:
gbreg = GradientBoostingRegressor(random_state=42)

### 3.4 Hold out

In [None]:
gbreg.fit(X_train, y_train)
y_pred = gbreg.predict(X_test).round()
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
ho_gbreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_gbreg_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 3.4 - Cross Validation

In [None]:
kf =  KFold(n_splits=cv_n, random_state=42, shuffle=True)
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(gbreg, X, y, cv=kf)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {c2_ydist}%")
cv_gbreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_gbreg_predmin}%\n")
scores = cross_val_score(gbreg, X, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Gradient Boosting Regression RESULTS
Gradient Boosting Regression The Minority Class by a smaller margin than linear regression but its still further off than the other methods. Regression as a whole underestimates and this isn't surprising given that the formulas rely heavily on the y variable. 

# 4 - Scaling Data

### As we saw at the start all our data is varying alot in magnitude

While this is not always a problem some Classifiers will Bias themselves towards the feature with the largest magnitude and as such it can be good practice to normalize the features. Here I am discretizing the ages in 6 bins that represent age ranges between 30 and 90. I a normailising the year with a MinMax scaler to bound it between 1 and 2. Finally I use a standar Scaler to Normailes the NNodes feature

In [None]:
surv_scal = surv.copy()
a = surv_scal["Age"]
yr = surv_scal["Year"]
nn = surv_scal["NNodes"]
print("*" * 10, "Before Scaling", "*" * 10)
print("Age Range:", min(a), max(a))
print("Year Range:", min(yr), max(yr))
print("NNodes Range:", min(nn), max(nn), "\n")
kb = KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='uniform')
ss = StandardScaler()
mms = MinMaxScaler((1,2))
surv_scal["Age"] = kb.fit_transform(surv_scal["Age"].values.reshape(-1, 1)).flatten()
surv_scal["Year"] = mms.fit_transform(yr.values.reshape(-1, 1)).flatten()
surv_scal["NNodes"] = ss.fit_transform(nn.values.reshape(-1, 1)).flatten()
a = surv_scal["Age"]
yr = surv_scal["Year"]
nn = surv_scal["NNodes"]
print("*" * 10, "After Scaling", "*" * 10)
print("Age Range:", min(a), max(a))
print("Year Range:", min(yr), max(yr))
print("NNodes Range:", min(nn), max(nn))

# 5 - Exploring Scaled data

In which we see if we can draw anything useful from the Scaled dataframe

In [None]:
surv_scal.describe()

The summary statistics of the scaled data do not tell us much about the data again However we can clearly see that the Features are all much closer together in magnitude which should help in stopping classifiers bias one feature.

## plots (Scaled)

Lets see if the scaled data lends itself to more insightful plots

### Age vs Class (Scaled)

In [None]:
surv_scal[["Age", "Class"]].plot()

Besides seeing that our ages were correctly binned we are not provided with any insights from this plot

### Year vs Class (Scaled)

In [None]:
surv_scal[["Year", "Class"]].plot()

This plot is very messy and does not provide us with any sort of inference. Even if we use a scatter plot we don't see anything insightful.

### NNodes vs Class (Scaled)

In [None]:
surv_scal.plot("NNodes", "Class", kind="scatter")

We see that there is not much variance however the bigest NNodes does lie inside class 2

### Plots - While not lending themseleves to any insight it is still good practice.

# 6 - Scaled Classifiers

we created new hold-out sets

In [None]:
X_scale = surv_scal[["Age", "Year", "NNodes"]].values
s_c2_ydist = round((Counter(y)[2]/len(y)) * 100, 2) 
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.25, random_state=42)

# 6.1 - K-NN (Scaled)

In [None]:
kNN = KNeighborsClassifier(n_neighbors=3)

### 6.1 - Hold out

In [None]:
y_pred = kNN.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
sho_knn_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_knn_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 6.1 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(kNN, X_scale, y, cv=kf)
print(classification_report(y, y_pred))
print(classification_report(y, y_pred)) 
print(f"Minority class 2 in y: {s_c2_ydist}%")
scv_knn_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
scores = cross_val_score(kNN, X_scale, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X_scale, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X_scale, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(kNN, X_scale, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# 6.2 - Decision Trees (Scaled)

In [None]:
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)

### 6.2 - Hold out

In [None]:
y_pred = Dtree.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
sho_dtree_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_dtree_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 6.2 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(Dtree, X_scale, y, cv=kf)
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {s_c2_ydist}%")
scv_dtree_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_dtree_predmin}%\n")
scores = cross_val_score(Dtree, X_scale, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X_scale, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X_scale, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(Dtree, X_scale, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# 6.3 - Logistic Regression (Scaled)

In [None]:
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(X_train, y_train)

### 6.3 - Hold out

In [None]:
y_pred = lreg.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
sho_lreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_lreg_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 6.3 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(lreg, X_scale, y, cv=kf)
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {s_c2_ydist}%")
scv_lreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_lreg_predmin}%\n")
scores = cross_val_score(lreg, X_scale, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X_scale, y, cv=kf, scoring='f1')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X_scale, y, cv=kf, scoring='precision')
print("precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(lreg, X_scale, y, cv=kf, scoring='recall')
print("recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# 6.4 - Gradient Boosting (Scaled)

In [None]:
gbreg = GradientBoostingRegressor(random_state=42)

### 6.4 - Hold out

In [None]:
gbreg.fit(X_train, y_train)
y_pred = gbreg.predict(X_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
sho_gbreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {ho_gbreg_predmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### 6.4 - Cross Validation

In [None]:
print(f"Number of Cross Validation Folds: {cv_n}")
y_pred = cross_val_predict(gbreg, X, y, cv=kf)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y, y_pred))
print(f"Minority class 2 in y: {s_c2_ydist}%")
scv_gbreg_predmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cv_gbreg_predmin}%\n")
scores = cross_val_score(gbreg, X, y, cv=kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

## 7 - Q1 Results

### 7.1 Non-Scaled Results

In [None]:
data1 = [c2_ydist] * 4
data2 = [ho_knn_predmin, ho_dtree_predmin, ho_lreg_predmin, ho_gbreg_predmin]
data3 = [cv_knn_predmin, cv_dtree_predmin, cv_lreg_predmin, cv_gbreg_predmin]
labels = ["k-NN", "Decision Tree", "Linear Regression", "Gradient Boost"]
width =0.3
plt.figure(figsize=(15,10))
plt.bar(np.arange(len(data1)), data1, width=width)
plt.bar(np.arange(len(data2))+ width, data2, width=width)
plt.bar(np.arange(len(data3))+ width + width, data3, width=width)
plt.xticks(range(len(data1)), labels)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2 predicted')
plt.title('Non-Scaled data - Classifier Bias')
plt.legend(["C2 in test", "c2 predicted for Hold Out", "C2 predicted for Cross Validation" ],loc="best",fontsize="x-large")
nsrplt = plt
plt.show()

### 7.2 Scaled Results

In [None]:
data1  = [s_c2_ydist] * 4
data2  = [sho_knn_predmin, sho_dtree_predmin, sho_lreg_predmin, sho_gbreg_predmin]
data3  = [scv_knn_predmin, scv_dtree_predmin, scv_lreg_predmin, scv_gbreg_predmin]
labels = ["k-NN", "Decision Tree", "Linear Regression", "Gradient Boost"]
width  = 0.3
plt.figure(figsize=(15,10))
plt.bar(np.arange(len(data1)), data1, width=width)
plt.bar(np.arange(len(data2))+ width, data2, width=width)
plt.bar(np.arange(len(data3))+ width + width, data3, width=width)
plt.xticks(range(len(data1)), labels)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2 predicted')
plt.title('Scaled data - Classifier Bias')
plt.legend(["C2 in test", "c2 predicted for Hold Out", "C2 predicted for Cross Validation" ],loc="best",fontsize="x-large")
srplt = plt
plt.show()

### 7.3 All Results

In [None]:
data1 = [c2_ydist] * 4
data2 = [ho_knn_predmin, ho_dtree_predmin, ho_lreg_predmin, ho_gbreg_predmin]
data3 = [cv_knn_predmin, cv_dtree_predmin, cv_lreg_predmin, cv_gbreg_predmin]
data4 = [sho_knn_predmin, sho_dtree_predmin, sho_lreg_predmin, sho_gbreg_predmin]
data5 = [scv_knn_predmin, scv_dtree_predmin, scv_lreg_predmin, scv_gbreg_predmin]
labels = ["k-NN", "Decision Tree", "Linear Regression", "Gradient Boost"]
width = 0.15
plt.figure(figsize=(15,10))
l = np.arange(len(data1))
plt.bar(l, data2, width=width)
plt.bar(l + (width), data3, width=width)
plt.bar(l + (width * 2), data1, width=width)
plt.bar(l + (width * 3), data4, width=width)
plt.bar(l + (width * 4), data5, width=width)
plt.xticks(range(len(data1)), labels)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2 predicted')
plt.title('Scaled data - Classifier Bias')
labels = ["C2 predicted for Hold Out", "C2 predicted for Cross Validation","C2 in test"
 ,"C2 predicted for Hold Out (scaled)", "C2 predicted for Cross Validation (scaled)"]
plt.legend(labels,loc="best",fontsize="x-large")
nsandsplt = plt
plt.show()

All the reults together show us that scaling the data does not actually change the general pattern. infact in most cases we get ther same results withing a few percent. Nonetheless what we can conclude is that the KNN does the best at classifying correctly while being trained on a imbalanced dataset.

---
---
# Question 2
---
---

# 8

For Question 2 I am going to Suggest Two Strategies Firstly we will Test _SMOTE_ and then _Condensed Nearest Neighbour Undersampling_. I choose CNN Undersampling because they made the most sense to me. 

Before we can continue I want to talk about Hyper-Parameter Tuning and why I did not approach the problem using parameter Tuning.

While we have already seen Hyperparameters in this notebook and technially have already decided on some such as the values we plug into out k-folds and our random seeds. 

On reflection we could have gone a step further and then done a basic bayse search for hyperparams But if we are acknowledging that the Training samples are inherently Biased no amount of model parameter tuning is going to alliviate the Bias and thats why Methods of fixing Dataset Bias Exists and that is what we go onto explore with SMOTE and CNN undersampling.

___

## 9 - Smote

When we have imbalanced Data we will come up with the simplest approach to rectify it and that probably  involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique -> SMOTE

An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.

In [None]:
surv = pd.read_csv('survival.csv')
surv['Survived'] = 'GE5'
surv.loc[surv['Class']==2,'Survived']='L5'
labels = surv["Survived"]
y = surv["Class"]
X = surv[["Age", "Year", "NNodes"]].values

A quick look again at the class imbalance

In [None]:
c_y = Counter(labels)
classimba = [(i, (c_y[i] / sum(c_y.values())) ) for i in c_y]
for i in classimba:
    print(f"{i[0]} - {c_y[i[0]]} samples - {round(i[1] * 100, 2)}% of total")

Class Labels are split about 3/4 to 1/4 with GE5 containing most of the samples.

In [None]:
over = SMOTE(sampling_strategy=0.8)
X, y = over.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
classimba = [(i, (counter[i] / sum(counter.values())) ) for i in counter]
for i in classimba:
    print(f"{i[0]} - {counter[i[0]]} samples - {round(i[1] * 100, 2)}% of total")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
c = Counter(y_test)
c2_te = round(c[2] / sum(c.values()) * 100, 2)
print("{}% of test set is class 2".format(c2_te))

c = Counter(y_train)
c2_tr = round(c[2] / sum(c.values()) * 100, 2)
print("{}% of train set is class 2".format(c2_tr))

In [None]:
kNN = KNeighborsClassifier(n_neighbors=3)
y_pred = kNN.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
knnpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

In [None]:
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)
y_pred = Dtree.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
dtreepredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

In [None]:
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(X_train, y_train)
y_pred = lreg.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
lregpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

In [None]:
gbreg = GradientBoostingRegressor(random_state=42)
gbreg.fit(X_train, y_train)
y_pred = gbreg.predict(X_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
gbregpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### SMOTE Results

In [None]:
data1 = [testmin] * 4
data2 = [knnpredmin, dtreepredmin, lregpredmin, gbregpredmin]
labels = ["k-NN", "D-Tree", "Lin. Reg.", "Grad. Boost"]
width =0.3
plt.bar(np.arange(len(data1)), data1, width=width)
plt.bar(np.arange(len(data2))+ width, data2, width=width)
plt.xticks(range(len(data1)), labels)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2 predicted')
plt.title('SMOTE: Test class 2 (blue) vs Predicted Class 2(Orange)')
smoteresplt = plt
plt.show()

We can see how SMOTE has effected our Results in comparison to before. Now the only one that still suffers from the imbalance is Linear Regression the rest of the methods do well. 

## 10 - Condensed Nearest Neighbour Undersampling

Condensed Nearest Neighbors, or CNN for short, is an undersampling technique that seeks a subset of a collection of samples that results in no loss in model performance, referred to as a minimal consistent set.

Loading in the Dataset. Splitting off Classes and Class Labels. Summarising what happened after Application of Condensed nearest Neighbours

In [None]:
surv = pd.read_csv('survival.csv')
surv['Survived'] = 'GE5'
surv.loc[surv['Class']==2,'Survived']='L5'
labels = surv["Survived"]
y = surv["Class"]
X = surv[["Age", "Year", "NNodes"]].values
undersample = CondensedNearestNeighbour(n_neighbors=1)
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
classimba = [(i, (counter[i] / sum(counter.values())) ) for i in counter]
for i in classimba:
    print(f"{i[0]} - {counter[i[0]]} samples - {round(i[1] * 100, 2)}% of total")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
c = Counter(y_test)
c2_te = round(c[2] / sum(c.values()) * 100, 2)
print("percent of class 2 in test set {}".format(c2_te))

We can see that when We applied CNN that the samples became equal in in length the Condensed nearest Neighbour algoirthm was able to maintain the same variance in our samples but reducede what we needed down to 81 samples in each class. 

In [None]:
kNN = KNeighborsClassifier(n_neighbors=3)
y_pred = kNN.fit(X_train, y_train).predict(X_test)
class_report = classification_report(y_test, y_pred)
print(class_report)
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
knnpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpredmin}%")
rocauc = roc_auc_score(y_test, y_pred)
print(f"roc auc score: {rocauc}")

In [None]:
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)
y_pred = Dtree.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
dtreepredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {dtreepredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

In [None]:
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(X_train, y_train)
y_pred = lreg.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
lregpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {lregpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

In [None]:
gbreg = GradientBoostingRegressor(random_state=42)
gbreg.fit(X_train, y_train)
y_pred = gbreg.predict(X_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y_test, y_pred))
testmin = c2_te
print(f"Minority class in test set: {testmin}%")
gbregpredmin = round((Counter(y_pred)[2]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {gbregpredmin}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

### cnn Results on BIAS

In [None]:
data1 = [testmin] * 4
data2 = [knnpredmin, dtreepredmin, lregpredmin, gbregpredmin]
labels = ["k-NN", "D-Tree", "Lin. Reg.", "Grad. Boost"]
width =0.3
plt.bar(np.arange(len(data1)), data1, width=width)
plt.bar(np.arange(len(data2))+ width, data2, width=width)
plt.xticks(range(len(data1)), labels)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2 predicted')
plt.title('Condensed Nearest Neighbours \n Test class 2 (blue) vs Predicted Class 2(Orange)')
cnnresplt = plt
plt.show()

We can see that Using Condensed Nearest Neighbour Undersampling that the Bias for class predictions changes comapred to no rectifying strategy. the k-NN actually is now biased towards picking Class 2. this is very interesting. The Undersampling brought the number of samples down to  82 and 81 respectivley and then we train test split at 50% meaning we are only using 40 samples really to test. This could be an explanation for these results. the classes are now as balanced as they can be but also (and this is a very important point) due to the small testing sample we cannot draw anything conclusive.

---
---
# Question 3
---
---
## 11 - New Dataset
We are going to use The Diabetes Data set which is imbalanced and see how the Methods Compare.

In [None]:
dataset = './diabetes.csv'

In [None]:
raw_df = pd.read_csv(dataset)
print("Original Class Balance")
y = raw_df.pop("Outcome")
X = raw_df.values
counter = Counter(y)
classimba = [(i, (counter[i] / sum(counter.values())) ) for i in counter]
for i in classimba:
    print(f"{i[0]} - {counter[i[0]]} samples - {round(i[1] * 100, 2)}% of total")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
c = Counter(y_train)
trminc = round(c[1] / sum(c.values()) * 100, 4)
print("train minority: ", trminc)
c = Counter(y_test)
teminc = round(c[1] / sum(c.values()) * 100, 4)
print("test minority: ", teminc)
print("X_train, X_test, y_train, y_test", len(X_train), len(X_test), len(y_train), len(y_test))

In [None]:
print("SMOTE Class Balance")
raw_df = pd.read_csv(dataset)
y = raw_df.pop("Outcome")
X = raw_df.values
over = SMOTE(sampling_strategy=0.8, random_state=42)
X, y = over.fit_resample(X, y)
counter = Counter(y)
classimba = [(i, (counter[i] / sum(counter.values())) ) for i in counter]
for i in classimba:
    print(f"{i[0]} - {counter[i[0]]} samples - {round(i[1] * 100, 2)}% of total")
OX_train, OX_test, Oy_train, Oy_test = train_test_split(X, y, test_size=0.25, random_state=42)
c = Counter(Oy_train)
trOminc = round(c[1] / sum(c.values()) * 100, 4)
print("train minority: ", trOminc)
c = Counter(Oy_test)
teOminc = round(c[1] / sum(c.values()) * 100, 4)
print("test minority: ", teOminc)
print("OX_train {}, OX_test {}, Oy_train {}, Oy_test{}".format(len(OX_train), len(OX_test), len(Oy_train), len(Oy_test)))

In [None]:
print("Condensed NN Undersample Class Balance")
raw_df = pd.read_csv(dataset)
y = raw_df.pop("Outcome")
X = raw_df.values
undersample = CondensedNearestNeighbour(n_neighbors=1, random_state=42)
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
classimba = [(i, (counter[i] / sum(counter.values())) ) for i in counter]
for i in classimba:
    print(f"{i[0]} - {counter[i[0]]} samples - {round(i[1] * 100, 2)}% of total")
CX_train, CX_test, Cy_train, Cy_test = train_test_split(X, y, test_size=0.25, random_state=0)
c = Counter(Cy_train)
trCminc = round(c[1] / sum(c.values()) * 100, 4)
print("train minority: ", trCminc)
c = Counter(Cy_test)
teCminc = round(c[1] / sum(c.values()) * 100, 4)
print(f"test minority: {teCminc}")
print("CX_train {}, CX_test {}, Cy_train {}, Cy_test {}".format(len(CX_train), len(CX_test), len(Cy_train), len(Cy_test)))

# 12 - Comparisons

### 12.1 K Neares Neighbours (k = 3)

In [None]:
print("Original Data")
kNN = KNeighborsClassifier(n_neighbors=3)
y_pred = kNN.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
print(f"Minority class in test set: {teminc}%")
knnpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {knnpred}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")
print("\nSMOTE")
kNN = KNeighborsClassifier(n_neighbors=3)
y_pred = kNN.fit(OX_train, Oy_train).predict(OX_test)
print(classification_report(Oy_test, y_pred))
print(f"Minority class in test set: {teOminc}%")
oknnpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {oknnpred}%")
print(f"roc auc score: {roc_auc_score(Oy_test, y_pred)}")
print("\nCNN")
kNN = KNeighborsClassifier(n_neighbors=3)
y_pred = kNN.fit(CX_train, Cy_train).predict(CX_test)
print(classification_report(Cy_test, y_pred))
print(f"Minority class in test set: {teCminc}%")
cknnpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cknnpred}%")
print(f"roc auc score: {roc_auc_score(Cy_test, y_pred)}")

### 12.2 - Decision Tree

In [None]:
print("Original Data")
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)
y_pred = Dtree.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
print(f"Minority class in test set: {teminc}%")
dtreepred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {dtreepred}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")
print("\nSMOTE")
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)
y_pred = Dtree.fit(OX_train, Oy_train).predict(OX_test)
print(classification_report(Oy_test, y_pred))
print(f"Minority class in test set: {teOminc}%")
odtreepred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {odtreepred}%")
print(f"roc auc score: {roc_auc_score(Oy_test, y_pred)}")
print("\nCNN")
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=42)
y_pred = Dtree.fit(CX_train, Cy_train).predict(CX_test)
print(classification_report(Cy_test, y_pred))
print(f"Minority class in test set: {teCminc}%")
cdtreepred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {cdtreepred }%")
print(f"roc auc score: {roc_auc_score(Cy_test, y_pred)}")

### 12.3 Logistic Regression

In [None]:
print("Original Data")
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(X_train, y_train)
y_pred = lreg.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))
print(f"Minority class in test set: {teminc}%")
lregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {lregpred}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

print("\nSMOTE")
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(OX_train, Oy_train)
y_pred = lreg.fit(OX_train, Oy_train).predict(OX_test)
print(classification_report(Oy_test, y_pred))
print(f"Minority class in test set: {teOminc}%")
olregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : { olregpred}%")
print(f"roc auc score: {roc_auc_score(Oy_test, y_pred)}")

print("\nCNN")
lreg = LogisticRegression(solver='lbfgs', random_state=42, max_iter=10000).fit(CX_train, Cy_train)
y_pred = lreg.fit(CX_train, Cy_train).predict(CX_test)
print(classification_report(Cy_test, y_pred))
print(f"Minority class in test set: {teCminc}%")

clregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : {clregpred}%")
print(f"roc auc score: {roc_auc_score(Cy_test, y_pred)}")

### 12.4 - Gradient Boosting

In [None]:
print("Original Data")
gbreg = GradientBoostingRegressor(random_state=42)
gbreg.fit(X_train, y_train)
y_pred = gbreg.predict(X_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(y_test, y_pred))
print(f"Minority class in test set: {teminc}%")
gbregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2)
print(f"Predicted minority class : { gbregpred}%")
print(f"roc auc score: {roc_auc_score(y_test, y_pred)}")

print("\nSMOTE")
gbreg = GradientBoostingRegressor(random_state=42)
gbreg.fit(OX_train, Oy_train)
y_pred = gbreg.predict(OX_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(Oy_test, y_pred))
print(f"Minority class in test set: {teOminc}%")
ogbregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {ogbregpred}%")
print(f"roc auc score: {roc_auc_score(Oy_test, y_pred)}")

print("\nCNN")
gbreg = GradientBoostingRegressor(random_state=42)
gbreg.fit(CX_train, Cy_train)
y_pred = gbreg.predict(CX_test)
y_pred = np.array(list(map(lambda x: round(x), y_pred)))
print(classification_report(Cy_test, y_pred))
print(f"Minority class in test set: {teCminc}%")
cgbregpred = round((Counter(y_pred)[1]/len(y_pred)) * 100, 2) 
print(f"Predicted minority class : {cgbregpred}%")
print(f"roc auc score: {roc_auc_score(Cy_test, y_pred)}")

In [None]:
from matplotlib.lines import Line2D
data1 = [teminc] * 4
data2 = [teOminc] * 4
data3 = [teCminc] * 4
data1 = data1+data2+data3
data2 = [knnpred, dtreepred, lregpred, gbregpred,
        oknnpred, odtreepred, olregpred, ogbregpred,
        cknnpred, cdtreepred, clregpred, cgbregpred]
labels1 = ["k-NN", "D-Tree", "Lin. Reg.", "Grad. Boost"]
labels2 = ["SMOTE k-NN", "SMOTE D-Tree", "SMOTE Lin. Reg.", "SMOTE Grad. Boost"]
labels3 = ["CNN k-NN", "CNN D-Tree", "CNN Lin. Reg.", "CNN Grad. Boost"]
labels = labels1+labels2+labels3
width = 0.15
plt.figure(figsize=(20,15))
l = np.arange(len(data1))
predbars = plt.bar(l + (width * 1.5), data2, width=width)
testbars = plt.bar(l, data1, width=(width*2))
plt.xticks(l, labels, rotation=20)
plt.xlabel('ML methood')
plt.ylabel('percentage class 2')
plt.title('Scaled data - Classifier Bias')

for i in range(len(testbars)):
    if i < 4:
        testbars[i].set_color('#ed3528')
        predbars[i].set_color('#4245f4')
    if i >=4 and i < 8:
        testbars[i].set_color('#c2dd27')
        predbars[i].set_color('#1fcce2')
    if i >=8 and i < 12:
        testbars[i].set_color('#20e21d')
        predbars[i].set_color('#a13aea')
        
legend_elements = [
                    Line2D([0], [0], color='#ed3528', lw=6, label='Original Test Minority Class'),
                    Line2D([0], [0], color='#4245f4', lw=6, label='Original Preicted Minority Class'),
                    Line2D([0], [0], color='#c2dd27', lw=6, label='SMOTE Test Minority Class'),
                    Line2D([0], [0], color='#1fcce2', lw=6, label='Smote Preicted Minority Class'),
                    Line2D([0], [0], color='#20e21d', lw=6, label='CNN Test Minority Class'),
                    Line2D([0], [0], color='#a13aea', lw=6, label='CNN Preicted Minority Class')
]
plt.legend(handles=legend_elements,loc="best",fontsize="x-large")
x = plt.figure
plt.show()

## RESULTS 

Above we have a bar-chart showing us all the comparisons of using nothing to rectify the Bias, using Smote and then using Condensed Nearest Neighbour Undersampling.

When we used SMOTE we can see that our predictions are overall much closer to the true number of thre minority class in the test set. Also an Interesting note is that the smote predictions seem to follow the same pattern as the no sampling except closer to the actual predictions but knn is stil more than all, dtree and gradient boost are roughly the same and the Linear regression still predicts the minority the least. 

The Condensed Nearest Neighbour undersampling on the data produces some very Interesting results. The KNN and D-Tree classifiers to very well and the bias is seemingly removed however Linear Regression and Gradient Boost infact are biased towards the original minority class. I suppose The reason that the Linear Regression and Gradient Boost are biased towards the original minority class is because once we apply CNN Undersampling the Minority Class becomes the Majoirty by 7%.

While in the previous Dataset we saw that CNN undersampling gave us great results here we see that it actually ends up being inferior to smote.

# _Conclusion_

We haves seen throughout this notebook what Bias is and how we can attempt to remove it. We see that using Over Sampling and Undersampling Strategies can remove Bias from our Samples and thus remove it from our Models. 

Something else Interesting we can draw from our experiements here is that the Regression Models are much more sensitive to Bias We can see this as they often underclassify the minority class much more than a k-NN and a Decision Tree. we also see that the best model out of the 4 for dealing with bias with theses small feature datasets. 

Overall we can clearly conclude that Bias in a sample set leads to Machine Learning Models being Biased. which in itself is a great conclusion to be able to draw as it arms us with the knowledge goign forward how Bias effects Models. 