## **Term Project**
- Instructor: [Jaeung Sim](https://www.business.uconn.edu/person/jaeung-sim/) (University of Connecticut)
- Course: OPIM 5512 Data Science Using Python
- Submission Deadline: March 31 (Sun), 2024
- Presentation Day: April 4 (Thu), 2024

**Team Members**

* [Lixi Yang] [MS FinTech]
> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/UCONN_academic_logo.png/800px-UCONN_academic_logo.png" width="200" height="58">

**Objectives**
* Building a predictive model and maximize the accuracy of the given test set.

**Evaluation**
* Predictive performance (400 points)
* Presentation of your model (100 points)

**Things to submit**
1. **(Revise the current file)** Python notebook with your model training and prediction process
1. Prediction results on the train set (CSV file)
1. Prediction results on the test set (CSV file)
1. Power Point slides for presentation (PPTX file)


### **Part 1. Data Loading and Processing Stage**

In [18]:
# Set Google Drive directory
import os
os.getcwd()

from google.colab import drive
drive.mount('/content/drive')

os.chdir('/content/drive/My Drive/Colab Notebooks/OPIM 5512')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
# Import common modules
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense

import numpy as np
import os
import sys

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import cv2
import IPython
from six.moves import urllib

print(tf.__version__)

2.15.0


In [62]:
# Load the datasets
train_df = pd.read_csv('train_set.csv')
test_df = pd.read_csv('test_set.csv')

In [63]:
# Handle missing values if any
train_df.fillna(train_df.mean(), inplace=True)
test_df.fillna(test_df.mean(), inplace=True)

In [64]:
train_df.head(10)

Unnamed: 0,train_ids,outcome,pred_outcome,x1,x2,x3,x4,x5,x6,x7,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,1,0,,12.455142,-0.726902,0.132204,0.021137,0.375811,0.849215,-11.551916,...,0.019066,5.067406,0.061547,-0.057606,-0.920825,2.090509,0.656039,0.379908,0.003946,0.295916
1,2,1,,12.689968,-0.695152,0.854335,0.186314,0.580607,0.631155,-12.150918,...,0.951556,15.611707,0.042158,-0.2313,-1.60545,1.11636,0.959253,0.464886,0.514108,0.177314
2,3,1,,12.617456,-1.092101,0.667993,0.996318,0.003047,0.891401,-11.512059,...,0.664515,16.988228,0.131478,-1.593457,-1.137651,1.700194,0.345309,0.837884,0.220993,-0.165809
3,4,1,,13.286695,-0.809891,0.473316,0.633564,1.792703,0.791155,-17.873219,...,0.54982,22.169784,0.634384,-1.211246,-0.667244,1.342179,0.137132,0.343911,0.583459,-0.220375
4,5,0,,12.795946,-0.879515,0.503944,0.633704,0.039258,0.509893,-15.337241,...,0.364394,22.919292,0.129454,-1.832095,-1.521731,2.505429,0.102728,0.474297,0.114105,0.457079
5,6,1,,12.606099,-0.870836,0.136574,0.533365,1.66085,0.206676,-8.510825,...,0.882047,21.581377,0.677712,-2.051231,-1.250059,1.152854,0.873212,0.072503,0.55795,0.496189
6,7,1,,12.736353,-0.356584,0.83426,0.463782,0.021484,0.44657,-11.559809,...,0.048353,16.218598,0.727762,-2.51947,-0.609035,1.584525,0.167062,0.433042,0.946915,-0.528994
7,8,1,,12.876908,-0.242602,0.588064,0.491945,0.147027,0.186708,-12.898948,...,0.7363,19.662901,0.739841,-1.31637,-0.743059,1.165816,0.12554,0.237227,0.479306,0.269989
8,9,1,,13.314789,-0.312973,0.915521,0.88772,0.165026,0.200054,-20.12762,...,0.271457,20.822864,0.800611,-1.687842,-2.312733,1.754037,0.524364,0.622769,0.635082,0.555477
9,10,1,,12.634204,-1.036817,0.562269,0.329655,0.680303,0.531583,-19.43686,...,0.819362,20.838756,0.208523,-2.10043,-2.808996,2.32142,0.998407,0.950286,0.129001,0.574724


In [65]:
test_df.head(10)

Unnamed: 0,test_ids,pred_outcome,x1,x2,x3,x4,x5,x6,x7,x8,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,1,,12.498463,-0.646779,0.78595,0.91077,1.731347,0.682239,-10.221844,0.07174,...,0.220086,20.017854,0.329739,-0.385504,-1.265012,2.109177,0.355305,0.863437,0.731918,-0.262751
1,2,,13.100022,-0.544496,0.214328,0.176274,1.768621,0.325016,-20.261994,0.387404,...,0.938708,4.604821,0.8318,-1.66694,-2.392454,1.759878,0.257287,0.692118,0.147551,-0.792381
2,3,,12.729651,-0.913639,0.342797,0.3574,0.031701,0.447554,-12.219009,0.708057,...,0.765379,21.774929,0.897186,-1.174316,-0.611716,1.02852,0.14978,0.361348,0.907778,0.840932
3,4,,13.326329,-1.184101,0.328942,0.649397,0.490836,0.111597,-12.428663,0.788841,...,0.238402,11.547741,0.407835,-1.752565,-2.763918,1.57935,0.411979,0.181892,0.167979,0.21436
4,5,,13.025427,-0.640031,0.811635,0.379645,0.116359,0.513888,-14.010096,0.570646,...,0.973849,12.794935,0.445175,-1.572456,-0.940902,1.239223,0.382799,0.338374,0.41813,-0.194136
5,6,,13.049361,-0.59989,0.218481,0.530789,0.096108,0.980475,-8.73254,0.957685,...,0.651577,8.947086,0.267441,-2.327667,-0.599476,2.21985,0.486613,0.67512,0.1752,0.471237
6,7,,13.138126,-1.03737,0.314835,0.095242,0.345522,0.292107,-13.988552,0.647613,...,0.99178,4.15075,0.880012,-1.954702,-0.918427,1.598132,0.488881,0.224602,0.89,-0.514479
7,8,,12.744857,-0.629873,0.383764,0.920447,1.602793,0.072955,-11.114146,0.732761,...,0.762695,16.969214,0.636586,-1.359622,0.444983,1.968079,0.676495,0.08871,0.541081,-0.609453
8,9,,12.580849,-0.736689,0.426734,0.50165,0.243313,0.814972,-12.104569,0.828754,...,0.305391,20.241162,0.264136,-2.401171,-4.365939,1.810088,0.974319,0.021955,0.581049,-0.45999
9,10,,12.981449,-0.423168,0.289255,0.760818,0.689598,0.107961,-12.120379,0.453518,...,0.975265,2.786237,0.820555,-1.507574,-1.793304,1.431971,0.151807,0.936106,0.834047,-0.545817


In [66]:
# Splitting the training data into features and target variable
X_train = train_df.drop(['outcome', 'train_ids', 'pred_outcome'], axis=1)
Y_train = train_df['outcome']

In [67]:
X_train.head(10)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,12.455142,-0.726902,0.132204,0.021137,0.375811,0.849215,-11.551916,0.180976,0.918221,0.085364,...,0.019066,5.067406,0.061547,-0.057606,-0.920825,2.090509,0.656039,0.379908,0.003946,0.295916
1,12.689968,-0.695152,0.854335,0.186314,0.580607,0.631155,-12.150918,0.453926,0.752773,0.216626,...,0.951556,15.611707,0.042158,-0.2313,-1.60545,1.11636,0.959253,0.464886,0.514108,0.177314
2,12.617456,-1.092101,0.667993,0.996318,0.003047,0.891401,-11.512059,0.231615,0.366196,0.971985,...,0.664515,16.988228,0.131478,-1.593457,-1.137651,1.700194,0.345309,0.837884,0.220993,-0.165809
3,13.286695,-0.809891,0.473316,0.633564,1.792703,0.791155,-17.873219,0.696212,0.564825,0.35082,...,0.54982,22.169784,0.634384,-1.211246,-0.667244,1.342179,0.137132,0.343911,0.583459,-0.220375
4,12.795946,-0.879515,0.503944,0.633704,0.039258,0.509893,-15.337241,0.563707,0.47012,0.780886,...,0.364394,22.919292,0.129454,-1.832095,-1.521731,2.505429,0.102728,0.474297,0.114105,0.457079
5,12.606099,-0.870836,0.136574,0.533365,1.66085,0.206676,-8.510825,0.535948,0.759629,0.625043,...,0.882047,21.581377,0.677712,-2.051231,-1.250059,1.152854,0.873212,0.072503,0.55795,0.496189
6,12.736353,-0.356584,0.83426,0.463782,0.021484,0.44657,-11.559809,0.458226,0.949339,0.736243,...,0.048353,16.218598,0.727762,-2.51947,-0.609035,1.584525,0.167062,0.433042,0.946915,-0.528994
7,12.876908,-0.242602,0.588064,0.491945,0.147027,0.186708,-12.898948,0.786294,0.290779,0.335637,...,0.7363,19.662901,0.739841,-1.31637,-0.743059,1.165816,0.12554,0.237227,0.479306,0.269989
8,13.314789,-0.312973,0.915521,0.88772,0.165026,0.200054,-20.12762,0.237465,0.023705,0.687848,...,0.271457,20.822864,0.800611,-1.687842,-2.312733,1.754037,0.524364,0.622769,0.635082,0.555477
9,12.634204,-1.036817,0.562269,0.329655,0.680303,0.531583,-19.43686,0.29723,0.136939,0.063443,...,0.819362,20.838756,0.208523,-2.10043,-2.808996,2.32142,0.998407,0.950286,0.129001,0.574724


In [68]:
Y_train.head(10)

0    0
1    1
2    1
3    1
4    0
5    1
6    1
7    1
8    1
9    1
Name: outcome, dtype: int64

In [69]:
X_test = test_df.drop(['test_ids', 'pred_outcome'], axis=1)
X_test.head(10)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,12.498463,-0.646779,0.78595,0.91077,1.731347,0.682239,-10.221844,0.07174,0.623391,0.929441,...,0.220086,20.017854,0.329739,-0.385504,-1.265012,2.109177,0.355305,0.863437,0.731918,-0.262751
1,13.100022,-0.544496,0.214328,0.176274,1.768621,0.325016,-20.261994,0.387404,0.560405,0.883938,...,0.938708,4.604821,0.8318,-1.66694,-2.392454,1.759878,0.257287,0.692118,0.147551,-0.792381
2,12.729651,-0.913639,0.342797,0.3574,0.031701,0.447554,-12.219009,0.708057,0.204432,0.231154,...,0.765379,21.774929,0.897186,-1.174316,-0.611716,1.02852,0.14978,0.361348,0.907778,0.840932
3,13.326329,-1.184101,0.328942,0.649397,0.490836,0.111597,-12.428663,0.788841,0.701492,0.246072,...,0.238402,11.547741,0.407835,-1.752565,-2.763918,1.57935,0.411979,0.181892,0.167979,0.21436
4,13.025427,-0.640031,0.811635,0.379645,0.116359,0.513888,-14.010096,0.570646,0.26305,0.717662,...,0.973849,12.794935,0.445175,-1.572456,-0.940902,1.239223,0.382799,0.338374,0.41813,-0.194136
5,13.049361,-0.59989,0.218481,0.530789,0.096108,0.980475,-8.73254,0.957685,0.654378,0.948077,...,0.651577,8.947086,0.267441,-2.327667,-0.599476,2.21985,0.486613,0.67512,0.1752,0.471237
6,13.138126,-1.03737,0.314835,0.095242,0.345522,0.292107,-13.988552,0.647613,0.192534,0.731026,...,0.99178,4.15075,0.880012,-1.954702,-0.918427,1.598132,0.488881,0.224602,0.89,-0.514479
7,12.744857,-0.629873,0.383764,0.920447,1.602793,0.072955,-11.114146,0.732761,0.592781,0.344443,...,0.762695,16.969214,0.636586,-1.359622,0.444983,1.968079,0.676495,0.08871,0.541081,-0.609453
8,12.580849,-0.736689,0.426734,0.50165,0.243313,0.814972,-12.104569,0.828754,0.021356,0.532294,...,0.305391,20.241162,0.264136,-2.401171,-4.365939,1.810088,0.974319,0.021955,0.581049,-0.45999
9,12.981449,-0.423168,0.289255,0.760818,0.689598,0.107961,-12.120379,0.453518,0.003652,0.277529,...,0.975265,2.786237,0.820555,-1.507574,-1.793304,1.431971,0.151807,0.936106,0.834047,-0.545817


### **Part 2. Model Training Stage**

Start with some common binary classification models.

In [28]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression().fit(X_train, Y_train)
Y_pred1 = logit.predict(X_train)

from sklearn.metrics import classification_report

# Print the result
print(classification_report(Y_train, Y_pred1))

              precision    recall  f1-score   support

           0       0.76      0.69      0.72     19449
           1       0.85      0.89      0.87     37351

    accuracy                           0.82     56800
   macro avg       0.80      0.79      0.79     56800
weighted avg       0.82      0.82      0.82     56800



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:
# Support Vector Machines
from sklearn.svm import SVC

svm = SVC().fit(X_train, Y_train)
Y_pred2 = svm.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred2))

              precision    recall  f1-score   support

           0       0.78      0.66      0.71     19449
           1       0.84      0.90      0.87     37351

    accuracy                           0.82     56800
   macro avg       0.81      0.78      0.79     56800
weighted avg       0.82      0.82      0.81     56800



In [30]:
# Gradient Boosting Trees
from sklearn.ensemble import GradientBoostingClassifier

gbt = GradientBoostingClassifier().fit(X_train, Y_train)
Y_pred3 = gbt.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred3))

              precision    recall  f1-score   support

           0       0.81      0.69      0.74     19449
           1       0.85      0.91      0.88     37351

    accuracy                           0.84     56800
   macro avg       0.83      0.80      0.81     56800
weighted avg       0.83      0.84      0.83     56800



In [31]:
# K Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier().fit(X_train, Y_train)
Y_pred4 = knn.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred4))

              precision    recall  f1-score   support

           0       0.77      0.65      0.71     19449
           1       0.83      0.90      0.87     37351

    accuracy                           0.82     56800
   macro avg       0.80      0.78      0.79     56800
weighted avg       0.81      0.82      0.81     56800



In [32]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

# Use Gaussian Naive Bayes because it works for continuous data
gnb = GaussianNB().fit(X_train, Y_train)
Y_pred5 = gnb.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred5))

              precision    recall  f1-score   support

           0       0.73      0.61      0.67     19449
           1       0.81      0.88      0.85     37351

    accuracy                           0.79     56800
   macro avg       0.77      0.75      0.76     56800
weighted avg       0.79      0.79      0.79     56800



In [33]:
# Perceptron
from sklearn.linear_model import Perceptron

perceptron = Perceptron().fit(X_train, Y_train)
Y_pred6 = perceptron.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred6))

              precision    recall  f1-score   support

           0       0.47      0.99      0.63     19449
           1       0.99      0.41      0.58     37351

    accuracy                           0.61     56800
   macro avg       0.73      0.70      0.61     56800
weighted avg       0.81      0.61      0.60     56800



In [34]:
# Neural Network
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(max_iter=1000)
mlp.fit(X_train, Y_train)
Y_pred7 = mlp.predict(X_train)

# Print the result
print(classification_report(Y_train, Y_pred7))

              precision    recall  f1-score   support

           0       0.79      0.75      0.77     19449
           1       0.87      0.90      0.89     37351

    accuracy                           0.85     56800
   macro avg       0.83      0.82      0.83     56800
weighted avg       0.85      0.85      0.85     56800



Scikit-learn provides a relatively simple and intuitive way to build and train neural network models. It is designed for traditional machine learning tasks and provides a number of tools for rapid implementation of algorithms.

Then I will go through what we have learned about deep learning in the previous few weeks to build a neural network model using Keras. Keras provides more flexibility and control by allowing you to build the model in a layer-by-layer parameter setup, including adding dropout layers to prevent overfitting, specifying the optimizer, loss function, and evaluation metrics. This provides more customization options for model training.

In [35]:
# Install keras-tuner library for tuning parameters
pip install keras-tuner

Collecting keras-tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5


In [47]:
# First step is to find the optimal learning rate, number of hidden layers, number of neurons, and dropout rate through a random search
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch

# Number of neural nodes is randomly selected from 10 to 100 in steps of 10
# Number of hidden layers is randomly selected from 1 to 3
# Dropout rate is randomly selected from 0 to 0.3 in steps of 0.1
# Activation function for hidden layer(s) use relu
# Activation function for output layer uses sigmoid as it is good at handling binary classification problems
# Use adam as optimizer
# Learning rate is randomized from chosen values
def build_model(hp):
    model = Sequential()
    model.add(Dense(units=hp.Int('input_units', min_value=10, max_value=100, step=10), activation='relu', input_shape=(30,)))
    for i in range(hp.Int('n_layers', 1, 3)):
        model.add(Dense(units=hp.Int(f'units_layer_{i}', min_value=10, max_value=100, step=10), activation='relu'))
        model.add(Dropout(rate=hp.Float(f'dropout_layer_{i}', min_value=0.0, max_value=0.3, step=0.1)))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(
          hp.Choice('learning_rate', values=[0.05, 0.025, 0.01, 0.005, 0.001, 0.0005, 0.0001])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

# Define random search rules
# Due to time constraints, I will try 100 parameter combinations, executing each once
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=100,
    executions_per_trial=1,
    directory='my_dir',
    project_name='keras_tuner_6'
)

# Default epochs and batch size
# Three tenths of the original training set is divided into a validation set
tuner.search(X_train, Y_train, epochs=100, batch_size=1000, validation_split=0.3, verbose=2)

# Return the best performing set of hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

# Print the result
print(f"Best learning rate: {best_hps.get('learning_rate')}")
print(f"Best number of layers: {best_hps.get('n_layers')}")
for i in range(best_hps.get('n_layers')):
    print(f" - Layer {i+1} neurons: {best_hps.get(f'units_layer_{i}')}")
    print(f" - Layer {i+1} dropout: {best_hps.get(f'dropout_layer_{i}')}")

Reloading Tuner from my_dir/keras_tuner_6/tuner0.json
Best learning rate: 0.01
Best number of layers: 2
 - Layer 1 neurons: 70
 - Layer 1 dropout: 0.2
 - Layer 2 neurons: 90
 - Layer 2 dropout: 0.2


In [50]:
# Then use early stopping function to find the best combination of epochs and batch size
from tensorflow.keras.callbacks import EarlyStopping

# Apply the optimal hyperparameter settings returned previously
def create_model():
    model = Sequential([
        Dense(70, activation='relu', input_shape=(30,)),
        Dropout(0.2),
        Dense(90, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

model = create_model()

# Define early stopping criteria as maximum validation accuracy
# If no val_accuracy improvement is observed after 50 epochs, training is stopped
early_stopping = EarlyStopping(monitor='val_accuracy', patience=50, mode='max', restore_best_weights=True)

# Test the performance of chosen batch sizes one by one
# Initialize the best values
batch_sizes = [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000]
best_batch_size = 0
best_val_accuracy = 0

# The maximum epochs is set to 1000
# Record best validation accuracy, best batch size, training stopped epoch (best epochs)
for batch_size in batch_sizes:
    print(f"\nTraining with batch size: {batch_size}")
    history = model.fit(X_train, Y_train,
                        epochs=1000,
                        batch_size=batch_size,
                        validation_split=0.2,
                        callbacks=[early_stopping],
                        verbose=2)
    val_accuracy = max(history.history['val_accuracy'])

    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        best_batch_size = batch_size
        best_epochs = early_stopping.stopped_epoch

# Print the result
print(f"Best batch size: {best_batch_size}")
print(f"Best epochs: {best_epochs}")
print(f"Best val accuracy: {best_val_accuracy}")


Training with batch size: 200
Epoch 1/1000
228/228 - 3s - loss: 0.5600 - accuracy: 0.7459 - val_loss: 0.3947 - val_accuracy: 0.8157 - 3s/epoch - 14ms/step
Epoch 2/1000
228/228 - 1s - loss: 0.4058 - accuracy: 0.8074 - val_loss: 0.3906 - val_accuracy: 0.8140 - 1s/epoch - 6ms/step
Epoch 3/1000
228/228 - 1s - loss: 0.4003 - accuracy: 0.8084 - val_loss: 0.3888 - val_accuracy: 0.8150 - 1s/epoch - 6ms/step
Epoch 4/1000
228/228 - 1s - loss: 0.3966 - accuracy: 0.8111 - val_loss: 0.3922 - val_accuracy: 0.8129 - 818ms/epoch - 4ms/step
Epoch 5/1000
228/228 - 1s - loss: 0.3884 - accuracy: 0.8130 - val_loss: 0.3713 - val_accuracy: 0.8194 - 769ms/epoch - 3ms/step
Epoch 6/1000
228/228 - 1s - loss: 0.3873 - accuracy: 0.8140 - val_loss: 0.3865 - val_accuracy: 0.8156 - 1s/epoch - 5ms/step
Epoch 7/1000
228/228 - 1s - loss: 0.3821 - accuracy: 0.8175 - val_loss: 0.3737 - val_accuracy: 0.8185 - 1s/epoch - 6ms/step
Epoch 8/1000
228/228 - 1s - loss: 0.3779 - accuracy: 0.8193 - val_loss: 0.3663 - val_accuracy:

In [52]:
# Apply all optimal hyperparameters to the final model
def final_model():
    model = Sequential([
        Dense(70, activation='relu', input_shape=(30,)),
        Dropout(0.2),
        Dense(90, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

model = final_model()

final = model.fit(X_train, Y_train, epochs=127, batch_size=3000, verbose=2)

loss, accuracy = model.evaluate(X_train, Y_train, verbose=0)
print(f'Training Accuracy: {accuracy:.4f}')

Epoch 1/127
19/19 - 2s - loss: 0.9308 - accuracy: 0.5930 - 2s/epoch - 88ms/step
Epoch 2/127
19/19 - 0s - loss: 0.6064 - accuracy: 0.6822 - 262ms/epoch - 14ms/step
Epoch 3/127
19/19 - 0s - loss: 0.5287 - accuracy: 0.7294 - 244ms/epoch - 13ms/step
Epoch 4/127
19/19 - 0s - loss: 0.4686 - accuracy: 0.7713 - 257ms/epoch - 14ms/step
Epoch 5/127
19/19 - 0s - loss: 0.4309 - accuracy: 0.7952 - 244ms/epoch - 13ms/step
Epoch 6/127
19/19 - 0s - loss: 0.4113 - accuracy: 0.8048 - 260ms/epoch - 14ms/step
Epoch 7/127
19/19 - 0s - loss: 0.4008 - accuracy: 0.8093 - 253ms/epoch - 13ms/step
Epoch 8/127
19/19 - 0s - loss: 0.3989 - accuracy: 0.8100 - 243ms/epoch - 13ms/step
Epoch 9/127
19/19 - 0s - loss: 0.3903 - accuracy: 0.8141 - 247ms/epoch - 13ms/step
Epoch 10/127
19/19 - 0s - loss: 0.3883 - accuracy: 0.8148 - 255ms/epoch - 13ms/step
Epoch 11/127
19/19 - 0s - loss: 0.3837 - accuracy: 0.8166 - 237ms/epoch - 12ms/step
Epoch 12/127
19/19 - 0s - loss: 0.3823 - accuracy: 0.8182 - 246ms/epoch - 13ms/step
Epoc

### **Part 3. Prediction and Exporting the Results for the Test Set**

In [72]:
# Predict using mlp neural network model
pred_outcome = mlp.predict(X_test)

# Add predictions to the test dataframe
test_df['pred_outcome'] = pred_outcome

# Export to CSV file
test_df.to_csv('test_set.csv', index=False)

In [74]:
# Check test dataframe
test_df.head(10)

Unnamed: 0,test_ids,pred_outcome,x1,x2,x3,x4,x5,x6,x7,x8,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,1,0,12.498463,-0.646779,0.78595,0.91077,1.731347,0.682239,-10.221844,0.07174,...,0.220086,20.017854,0.329739,-0.385504,-1.265012,2.109177,0.355305,0.863437,0.731918,-0.262751
1,2,1,13.100022,-0.544496,0.214328,0.176274,1.768621,0.325016,-20.261994,0.387404,...,0.938708,4.604821,0.8318,-1.66694,-2.392454,1.759878,0.257287,0.692118,0.147551,-0.792381
2,3,1,12.729651,-0.913639,0.342797,0.3574,0.031701,0.447554,-12.219009,0.708057,...,0.765379,21.774929,0.897186,-1.174316,-0.611716,1.02852,0.14978,0.361348,0.907778,0.840932
3,4,1,13.326329,-1.184101,0.328942,0.649397,0.490836,0.111597,-12.428663,0.788841,...,0.238402,11.547741,0.407835,-1.752565,-2.763918,1.57935,0.411979,0.181892,0.167979,0.21436
4,5,1,13.025427,-0.640031,0.811635,0.379645,0.116359,0.513888,-14.010096,0.570646,...,0.973849,12.794935,0.445175,-1.572456,-0.940902,1.239223,0.382799,0.338374,0.41813,-0.194136
5,6,0,13.049361,-0.59989,0.218481,0.530789,0.096108,0.980475,-8.73254,0.957685,...,0.651577,8.947086,0.267441,-2.327667,-0.599476,2.21985,0.486613,0.67512,0.1752,0.471237
6,7,0,13.138126,-1.03737,0.314835,0.095242,0.345522,0.292107,-13.988552,0.647613,...,0.99178,4.15075,0.880012,-1.954702,-0.918427,1.598132,0.488881,0.224602,0.89,-0.514479
7,8,0,12.744857,-0.629873,0.383764,0.920447,1.602793,0.072955,-11.114146,0.732761,...,0.762695,16.969214,0.636586,-1.359622,0.444983,1.968079,0.676495,0.08871,0.541081,-0.609453
8,9,1,12.580849,-0.736689,0.426734,0.50165,0.243313,0.814972,-12.104569,0.828754,...,0.305391,20.241162,0.264136,-2.401171,-4.365939,1.810088,0.974319,0.021955,0.581049,-0.45999
9,10,1,12.981449,-0.423168,0.289255,0.760818,0.689598,0.107961,-12.120379,0.453518,...,0.975265,2.786237,0.820555,-1.507574,-1.793304,1.431971,0.151807,0.936106,0.834047,-0.545817


In [75]:
# Check CSV file
test_df_final = pd.read_csv('test_set.csv')
test_df_final.head(10)

Unnamed: 0,test_ids,pred_outcome,x1,x2,x3,x4,x5,x6,x7,x8,...,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30
0,1,0,12.498463,-0.646779,0.78595,0.91077,1.731347,0.682239,-10.221844,0.07174,...,0.220086,20.017854,0.329739,-0.385504,-1.265012,2.109177,0.355305,0.863437,0.731918,-0.262751
1,2,1,13.100022,-0.544496,0.214328,0.176274,1.768621,0.325016,-20.261994,0.387404,...,0.938708,4.604821,0.8318,-1.66694,-2.392454,1.759878,0.257287,0.692118,0.147551,-0.792381
2,3,1,12.729651,-0.913639,0.342797,0.3574,0.031701,0.447554,-12.219009,0.708057,...,0.765379,21.774929,0.897186,-1.174316,-0.611716,1.02852,0.14978,0.361348,0.907778,0.840932
3,4,1,13.326329,-1.184101,0.328942,0.649397,0.490836,0.111597,-12.428663,0.788841,...,0.238402,11.547741,0.407835,-1.752565,-2.763918,1.57935,0.411979,0.181892,0.167979,0.21436
4,5,1,13.025427,-0.640031,0.811635,0.379645,0.116359,0.513888,-14.010096,0.570646,...,0.973849,12.794935,0.445175,-1.572456,-0.940902,1.239223,0.382799,0.338374,0.41813,-0.194136
5,6,0,13.049361,-0.59989,0.218481,0.530789,0.096108,0.980475,-8.73254,0.957685,...,0.651577,8.947086,0.267441,-2.327667,-0.599476,2.21985,0.486613,0.67512,0.1752,0.471237
6,7,0,13.138126,-1.03737,0.314835,0.095242,0.345522,0.292107,-13.988552,0.647613,...,0.99178,4.15075,0.880012,-1.954702,-0.918427,1.598132,0.488881,0.224602,0.89,-0.514479
7,8,0,12.744857,-0.629873,0.383764,0.920447,1.602793,0.072955,-11.114146,0.732761,...,0.762695,16.969214,0.636586,-1.359622,0.444983,1.968079,0.676495,0.08871,0.541081,-0.609453
8,9,1,12.580849,-0.736689,0.426734,0.50165,0.243313,0.814972,-12.104569,0.828754,...,0.305391,20.241162,0.264136,-2.401171,-4.365939,1.810088,0.974319,0.021955,0.581049,-0.45999
9,10,1,12.981449,-0.423168,0.289255,0.760818,0.689598,0.107961,-12.120379,0.453518,...,0.975265,2.786237,0.820555,-1.507574,-1.793304,1.431971,0.151807,0.936106,0.834047,-0.545817
