# Business Analytics Final Project
## by Florian Veitz (IBAIT21) designated for Prof. Dr. Mohebi University of West Florida

## Neural Networks

Also known as Multi-Layer-Perceptrons (MLP). Hence for this assignment you will use the MLPClassifier class from Sklearn.

Take a look at the documentation to learn more about the default parameterisation (which activation function it uses, which optimizer/solver it uses, number and size of hidden layers, etc.) of the MLPClassifer:

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html.

### Task 1: Neural Network Classifier on MNIST

In [None]:
# load required libraries
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics._plot.confusion_matrix import ConfusionMatrixDisplay
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The task will be to perform classification on handwritten digits from 0 to 9 (MNIST dataset).

In [None]:
# download dataset from https://www.openml.org/ which contains many sample datasets for machine learning
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

**Recap**: The dataset contains 70000 examples of which each example has 784 values (pixels). These pixels are in a flat array but represent a 28 by 28 pixel gray-scale image. Values range from 0 to 255 which is common in the RGB value range. A value of 0 represents a black pixel whereas 255 represents a white pixel. Different shades of gray are any value larger than 0 but smaller than 255.

In [None]:
# if we want to plot a single example we need to reshape the array
X_numpy = X.to_numpy()
def show_image(i):
  first_image = np.array(X_numpy[i], dtype='float').reshape((28, 28))
  plt.imshow(first_image, cmap='gray')

show_image(0)

#### Introduction

**Table of content for ML model:**


*   Data preparation:
 *   Performing a 80/20 train/test split
 *   Performing feature scaling
*   Traininng the model
 *   Use of the  `MLPClassifier` from `sklearn.neural_network`
*   Evaluation of the model performance
 *   Calculatinng the accuracy
 *   Ploting the confusion matrix
 *   Ploting some misclassified instances
* Comparition of the model performance with the results of the softmax regression on MNIST in the previous assignment



### Data Preparation

Perform a 80/20 train/test split

In [None]:
np.random.seed(0)
train_data_x, test_data_x, train_data_y, test_data_y = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 100)

Perform feature scaling

In [None]:
scaler = MinMaxScaler()

scaled_train_data_x = train_data_x.copy()
scaled_test_data_x = test_data_x.copy()

scaled_train_data_x = scaler.fit_transform(train_data_x)
scaled_test_data_x = scaler.transform(test_data_x)

### Train the model

In [None]:
mlp = MLPClassifier()
# limiting the training set in order to keep the training time low
size = 3000
train_data_x = train_data_x.values[:size].reshape(-1, len(X.columns))
train_data_y = train_data_y.values[:size].reshape(-1)
test_data_x = test_data_x.values.reshape(-1, len(X.columns))
test_data_y = test_data_y.values.reshape(-1)

mlp.fit(train_data_x, train_data_y)

### Evaluation of the model performance

Calculating the accuracy and ploting the confusion matrix


In [None]:
predictions = mlp.predict(test_data_x)

# Performance metrics
print(f"Accuracy: {accuracy_score(test_data_y, predictions)}")

labels = range(0, 10)

conf_matrix = confusion_matrix(test_data_y, predictions)
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

In [None]:
count_miss = 0

for i, y in enumerate(test_data_y):
    pred_y = predictions[i]

    if (pred_y != y):

        count_miss = count_miss + 1
        print(f"Predicted value: {pred_y}. Actual value: {y}" )
        show_image(i)
        plt.show()

        if count_miss is 10:
            break

### Task 2: Neural Network Classifier


In [None]:
# load required libraries
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from pandas.io.stata import relativedelta
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics._plot.confusion_matrix import ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

#### Dataset

This is a classic marketing bank dataset uploaded originally in the UCI Machine Learning Repository and contains >41k records. You can find more information about the features (attributes) on the official UCI website:
https://archive.ics.uci.edu/ml/datasets/bank+marketing

The dataset gives you information about a marketing campaign of a financial institution in which can be analysed in order to find ways to look for future strategies in order to improve future marketing campaigns for the bank.

The target variable is called 'deposit' which describes if a person has subscribed to a term deposit (more information: https://www.investopedia.com/terms/t/termdeposit.asp).


In [None]:
# Import the data
df = pd.read_csv('')
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X.head()

#### Instructions

This task will combine a lot of different aspect of what we have discussed in class over the past weeks.

**You are expected to do:**

*   **Data exploration** (6 points):
 *   Check which features are available.
 > *   Can some features directly be discarded?
 *   Check if data is messy (e.g. missing values)
 *   Check for correlation with target variable
 *   Look for outliers
 *   Class distribution
*   **Data preparation** (6 points):
 *   Perform some data cleaning e.g.
  >  *   Replace missing values
  >  *   Outlier handling
  >  *   Removal of duplicates
 *   Convert non-numeric features to numeric features
 *   Perform a 80/20 train/test split
 *   Perform feature scaling
 *   In case of class imbalance, think about how you want to deal with it. Please briefly explain your decision.
*   **Training and model evaluation** (10 points):
 *   Please use `MLPClassifier` from `sklearn.neural_network`
 > *   The model should have 4 hidden layers with sizes hidden_layer_size=(10, 1) (parameter hidden_layer_sizes)
 > *   Set the batch_size to 64
 *   Evaluate the model performance
 > *   Calculate the accuracy and other metrics which might be helpful to evaluate the model's performance
 > *   Based on you findings, describe some measures you could take to improve the model's performance even further.
 > *   Try to analyse if you see indications of underfitting or overfitting and which countermeasures you could take.
 *   Please train another model using one of the techniques we have discussed in the lectures and compare the performance to the performance achieved with the neural network.


**For each decision you make, briefly explain your reasoning.**



###Data exploration (6 points):


*   Check which features are available.
  *  Can some features directly be discarded?
*   Check if data is messy (e.g. missing values)
*   Check for correlation with target variable
*   Look for outliers
*   Class distribution

In [None]:
#available features
X.head()

As mentioned in the description of the data set, the feature duration should not be used for a realistic predictive model.

In [None]:
df = df.drop(['duration'], axis=1)
X = X.drop(['duration'], axis=1)
df.head()
X.head()

In [None]:
#class distribution
for name, series in X.items():
  plot = sns.histplot(data=X, x=name)
  plt.xticks(rotation=60)
  plt.figure()

In [None]:
#closer look at specific values of some features
print(f"Length of dataset: {len(df)}")
print(df['job'].value_counts()['unknown'], 'x unknown at job')
print(df['education'].value_counts()['unknown'], 'x unknown at education')
print(df['contact'].value_counts()['unknown'], 'x unknown at contact')
print(df['poutcome'].value_counts()['unknown'], 'x unknown at poutcome')
# print(df['pdays'].value_counts()[-1], 'x -1 days contacted')
# print(df['previous'].value_counts()[0], 'x not prev contacted')

The dataset contains some features where the values are unknown. As the percentage of unknown values in the contact and especially the poutcome column is rather high, one cannot delete these rows.

In [None]:
#label encoding for non-numeric columns
le = LabelEncoder()

# get all str columns
str_columns = df.select_dtypes(include="object")
# encode all str columns
for c in str_columns:
  df[c] = le.fit_transform(df[c])

#first rows of encoded data
df.head()

In [None]:
#For getting an overview of how much influece which single feature has on the income
#Kernel density estimations (KDE) of each feature
for c in df.columns:
  sns.displot(df, x=c, hue='deposit', kind='kde', fill=True)

In [None]:
#correlation
#last column is relevant for our use case
print(df.corr())

By taking a closer look at the features' distribution, KDE and correlation, one can see that there is no feature which does not seem to have any influence on the deposit. Of course, some features have a larger influece on the outcome than others, but from our point of view one should not directly discard any feature.

In [None]:
for name, series in df.items():
  print(name)
  plot = plt.boxplot(df[name])
  plt.xticks(rotation=60)
  print(plt.show())

---

 ### Data preparation (6 points):
 *   Perform some data cleaning e.g.
  >  *   Replace missing values
  >  *   Outlier handling
  >  *   Removal of duplicates
 *   Convert non-numeric features to numeric features
 *   Perform a 80/20 train/test split
 *   Perform feature scaling
 *   In case of class imbalance, think about how you want to deal with it. Please briefly explain your decision.

####Data cleaning
* We cannot replace missing values ('unknown') with useful, for example median, data as we only have unkown values in nominal features. We decided to let 'unknown' be a separate category


In [None]:
print(len(df))
df_clean = df.drop_duplicates()
print(len(df_clean))
df = df_clean

* There was one duplicates which was removed

We cut outliers by means of the following outlier detection function which is based on the z-standardization.

In [None]:
def remove_outlier(df, columns_to_remove, z_threshold):
  outliers = []
  data_1 = df.copy()

  for column in columns_to_remove:
    # calculate mean and std of every column
    mean_1 = np.mean(df[column])
    std_1 = np.std(df[column])

    # z_score = (y - mean_1) / std_1
    # =>
    y_max = z_threshold * std_1 + mean_1
    y_min = (-z_threshold) * std_1 + mean_1

    print(column)
    print(f"Max: {y_max}")
    print(f"Min: {y_min}")

    data_1 = data_1[ (data_1[column] > y_min ) & (data_1[column] < y_max) ]

  return data_1

We decided to only focus on three columns for outlier detection as they contain numeric values and clear outliers can be seen in the boxplots above.

In [None]:
outlier_columns = ['age', 'balance', 'campaign']
df_filtered = remove_outlier(df, outlier_columns, 3)

print(len(df.drop_duplicates()))
print(len(df_filtered))

print(df_filtered.head)

---
Non-numeric data have already been converted by encoding in the data exploration part.

In [None]:
#New X and y due to outlier removal
X = df_filtered.iloc[:, :-1]
y = df_filtered.iloc[:, -1]
#checking balance of the dataset
print(y.value_counts())

The dataset is not balanced. To counter this we decided to perform a stratified train test split.

In [None]:
#train test split
np.random.seed(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 100, stratify = y)

In [None]:
#checking balance
print('Training data')
print(y_train.value_counts())
print('Test data')
print(y_test.value_counts())

Now we ensured that the relation between the positive and negative outcome is the same in the train and the test data set as in the overall data set.

In [None]:
#feature scaling
scaler = MinMaxScaler()

X_train[:] = scaler.fit_transform(X_train)
X_test[:] = scaler.transform(X_test)

print(X_train.head())
print(X_test.head())

## Training and model evaluation (10 points):
*   **Training and model evaluation** (10 points):
 *   Please use `MLPClassifier` from `sklearn.neural_network`
 > *   The model should have 4 hidden layers with sizes hidden_layer_size=(10, 1) (parameter hidden_layer_sizes)
 > *   Set the batch_size to 64
 *   Evaluate the model performance
 > *   Calculate the accuracy and other metrics which might be helpful to evaluate the model's performance
 > *   Based on you findings, describe some measures you could take to improve the model's performance even further.
 > *   Try to analyse if you see indications of underfitting or overfitting and which countermeasures you could take.
 *   Please train another model using one of the techniques we have discussed in the lectures and compare the performance to the performance achieved with the neural network.

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10, 10) , batch_size = 64, max_iter = 500)
X_train = X_train.values.reshape(-1, len(X.columns))
y_train = y_train.values.reshape(-1)
X_test = X_test.values.reshape(-1, len(X.columns))
y_test = y_test.values.reshape(-1)

mlp.fit(X_train, y_train)

In [None]:
predictions = mlp.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print(f"Precision: {precision_score(y_test, predictions)}")
print(f"Recall: {recall_score(y_test, predictions)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, predictions, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

The model has an accuracy of about 68%, precision of ~70% and recall of ~59%. 59% are true-positive predictions, 77% true-negative. The false-positive rate is 23% and false-negative rate is 41%. Our dummy prediction would be to always predict the majority class - here it is the value 0. This would achieve an accuracy of 1121/(1121+1011) ~ 52,53% in our test data set. In comparision, our model has a moderate performance, but still is significantly better than our baseline accuracy.

In our opinion, the precision is the most relevant metric with our use-case.
It describes how many of the positively predicted labels are actually true, so how many of the predicted customers doing deposits did actually deposit money after calls.

There are several ways to improve the model's performance:
- On the one hand, one could collect more data in general which is then used for a new larger training set.
- On the other hand, one could think of another distribution of the size of our training and test set.
- Another option would be to handle outliers differently. For example by cutting more our less outlieres (change of z-threshold) or change the features of which outliers rows are cut.
- Furthermore, not to use all features but to cut the features with the fewest correlation might lead to an improvement. But an decrease in features might also lead to underfitting.
- We can also try out another activation function and optimization function.
- We could adapt the model by changing the number of hidden layers and nodes of each layer.
- Decrease the learning rate of the model for a finer/better fit. We could also use different modes of the learning rate, for example an adaptive rate that decreases over time. This can lead to a better fit with the same training time used.
- Decrease the batch size which makes the model training slower, but can achieve a better result.
- Increase the training epochs to achieve a better model fit.


**Over-/Underfitting:**

There are several features that have only a very small correlation with our dependent variable y "deposit" (e.g. corr(age, deposit) = 0,0349).
We could cut some of these features and train the model again. If the model then has a better accuracy it was probably overfitted before.
We can also compare the training accuracy to the test accuracy. A higher training accuracy than the test accuracy can be a sign of overfitting.

In [None]:
# Detect overfitting:
## OF when training_acc is much greater than test_acc


train_pred = mlp.predict(X_train)

print(f"Train Accuracy: {accuracy_score(y_train, train_pred)}")
print(f"Test Accuracy: {accuracy_score(y_test, predictions)}")



Our model's training accuracy is ~3% better than its test accuracy. This indicates a slight overfitting. To counter this we can use the following methods:

* Cut some features (with a low correlation to y) and see if the model performs better.
* Increase the amount of regularization used.

###Another Model training

Logistic Regression Model:

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

log_pred = logreg.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, log_pred)}")
print(f"Precision: {precision_score(y_test, log_pred)}")
print(f"Recall: {recall_score(y_test, log_pred)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, log_pred, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

KNN Model:

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

knn_pred = knn.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, knn_pred)}")
print(f"Precision: {precision_score(y_test, knn_pred)}")
print(f"Recall: {recall_score(y_test, knn_pred)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, knn_pred, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

Decision Tree:

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

dt_pred = dt.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, dt_pred)}")
print(f"Precision: {precision_score(y_test, dt_pred)}")
print(f"Recall: {recall_score(y_test, dt_pred)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, dt_pred, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

We trained 3 different models for comparison, where the best model of these was the LogisticRegression with these metrics:

Accuracy: 0.667

Precision: 0.647

Recall: 0.654

Our Neural Network has these metrics:

Accuracy: 0.683

Precision: 0.700

Recall: 0.589

So our model has a better accuracy (+1,6%) and even better precision (+5,3%), but a worse recall (-6,5%).

We would still use the Neural Network model, as for us it would be the most important that we have the best succession rate for our selected calls (determined by the precision).





In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10, 10) , batch_size = 64, activation = 'tanh', max_iter = 500)

mlp.fit(X_train, y_train)

In [None]:
predictions = mlp.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, predictions)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, predictions, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()

# Sandbox Hyperparameter Tuning

We did some hyperparameter tuning for our neural network.
This needs VEEEEERYYY long for computing, so we have documented our results here.

We had to realize that our hyperparameters below do not improve our model.

'Best' hyperparameters:

{'activation': 'identity', 'alpha': 0.05, 'batch_size': 20, 'hidden_layer_sizes': (20, 20, 20, 20), 'learning_rate': 'constant', 'solver': 'lbfgs'}

In [None]:
# mlp_gs = MLPClassifier(max_iter=500)
# parameter_space = {
#     'hidden_layer_sizes': [(10,10,10,10),(5,5,5,5),(20,20,20,20), (10,10,10), (10,10)],
#     'activation': ['tanh', 'relu', 'identity', 'logistic'],
#     'solver': ['lbfgs', 'sgd', 'adam'],
#     'alpha': [0.0001, 0.05],
#     'learning_rate': ['constant','adaptive', 'invscaling'],
#     'batch_size': [64, 20, 100],
# }
# from sklearn.model_selection import GridSearchCV
# clf = GridSearchCV(mlp_gs, parameter_space, n_jobs=-1, cv=5)
# clf.fit(X, y) # X is train samples and y is the corresponding labels

In [None]:
mlp2 = MLPClassifier(activation='identity', alpha=0.05, batch_size=20, hidden_layer_sizes= (20, 20, 20, 20), learning_rate='constant', solver='lbfgs', max_iter = 500)
X_train = X_train.reshape(-1, len(X.columns))
y_train = y_train.reshape(-1)
X_test = X_test.reshape(-1, len(X.columns))
y_test = y_test.reshape(-1)

mlp2.fit(X_train, y_train)

pred2 = mlp2.predict(X_test)

# Performance metrics
print(f"Accuracy: {accuracy_score(y_test, pred2)}")
print(f"Precision: {precision_score(y_test, pred2)}")
print(f"Recall: {recall_score(y_test, pred2)}")

labels = range(0, 2)

conf_matrix = confusion_matrix(y_test, pred2, normalize='true')
cm_display = ConfusionMatrixDisplay(conf_matrix, display_labels=labels)
cm_display.plot()
plt.show()


#### Further tips for working on the assignment:

When analyzing the model's performance, please think about what the baseline performance of the task would be and if your model performs better or not. It is quite unlikely the model will get a perfect score with the given parametrization. You can try to improve the performance by varying several hyperparameters of the model (e.g. number of hidden layers and number of neurons in a hidden layer, batch_size, training epochs, etc.).

Please be aware that too many hidden layers and neurons and a large number of epochs will cause the model to train longer. If the model is too complex you might encounter time-outs in Colab.

If the number of epochs is too low, sklearn will show a warning that the model has not yet converged.