# Neural Network Final Solution Experiments - Classification Tasks

In the following notebook we will test our final solution on the three following datasets with a classification prediction task:
1. Adult
2. Spotify
3. Titanic

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import sys
import os
root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(''))))
sys.path.insert(0, root) 
from main import main
import datetime

## 1. Adult

In [3]:
data_path = '../../../datasets/converted_datasets/adult_converted.csv'
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,age,fnlwgt,education,educational-num,race,gender,capital-gain,capital-loss,hours-per-week,income,workclass,marital-status,occupation,native-country,relationship
0,25,226802,6,7,Black,Male,0,0,40,0,4,4,7,39,5
1,38,89814,8,9,White,Male,0,0,50,0,4,2,5,39,4
2,28,336951,10,12,White,Male,0,0,40,1,2,2,11,39,4
3,44,160323,12,10,Black,Male,7688,0,40,1,4,2,7,39,4
4,18,103497,12,10,White,Female,0,0,30,0,0,4,0,39,5


- The target column to predict is the income column. 



- We know that the following columns have less than 50 unique INT values - therefore, they are actually categorical: education, educational_num, workclass, marital-status, occupation, native-country and relationship.


- Additionaly, we know that education and aducational-num are ordinal, while all the other are nominal.

We apply our model on the adults dataset:

In [8]:
warnings.filterwarnings("ignore")
start = datetime.datetime.now()
main(data_path, 'income')
end = datetime.datetime.now()
print("\n Model run time: {}".format(end-start))

The following columns have less than 50 unique INT values, therefore they will be checked:
education
educational-num
workclass
marital-status
occupation
native-country
relationship

Training embedding model start.


Epoch 1/10
1527/1527 - 4s - loss: 0.3745 - accuracy: 0.8269 - 4s/epoch - 2ms/step
Epoch 2/10
1527/1527 - 3s - loss: 0.3642 - accuracy: 0.8317 - 3s/epoch - 2ms/step
Epoch 3/10
1527/1527 - 3s - loss: 0.3631 - accuracy: 0.8310 - 3s/epoch - 2ms/step
Epoch 4/10
1527/1527 - 3s - loss: 0.3625 - accuracy: 0.8315 - 3s/epoch - 2ms/step
Epoch 5/10
1527/1527 - 3s - loss: 0.3619 - accuracy: 0.8323 - 3s/epoch - 2ms/step
Epoch 6/10
1527/1527 - 3s - loss: 0.3616 - accuracy: 0.8328 - 3s/epoch - 2ms/step
Epoch 7/10
1527/1527 - 3s - loss: 0.3608 - accuracy: 0.8336 - 3s/epoch - 2ms/step
Epoch 8/10
1527/1527 - 3s - loss: 0.3611 - accuracy: 0.8335 - 3s/epoch - 2ms/step
Epoch 9/10
1527/1527 - 3s - loss: 0.3602 - accuracy: 0.8340 - 3s/epoch - 2ms/step
Epoch 10/10
1527/1527 - 3s - loss: 0.3601 - ac

## 2. Spotify

In [9]:
data_path = '../../../datasets/converted_datasets/spotify_converted.csv'
data = pd.read_csv(data_path, low_memory=False)
data.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,type,id,uri,track_href,analysis_url,duration_ms,time_signature,song_name,0,genre
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,...,audio_features,2Vc6NJ9PW9gD9q343XFRKx,spotify:track:2Vc6NJ9PW9gD9q343XFRKx,https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD...,https://api.spotify.com/v1/audio-analysis/2Vc6...,124539,4,Trap,Retrograde,0
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,...,audio_features,7pgJBLVz5VmnL7uGHmRj6p,spotify:track:7pgJBLVz5VmnL7uGHmRj6p,https://api.spotify.com/v1/tracks/7pgJBLVz5Vmn...,https://api.spotify.com/v1/audio-analysis/7pgJ...,224427,4,Trap,,0
2,0.85,0.893,5,-4.783,1,0.0623,0.0138,4e-06,0.372,0.0391,...,audio_features,0vSWgAlfpye0WCGeNmuNhy,spotify:track:0vSWgAlfpye0WCGeNmuNhy,https://api.spotify.com/v1/tracks/0vSWgAlfpye0...,https://api.spotify.com/v1/audio-analysis/0vSW...,98821,4,Trap,,0
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,...,audio_features,0VSXnJqQkwuH2ei1nOQ1nu,spotify:track:0VSXnJqQkwuH2ei1nOQ1nu,https://api.spotify.com/v1/tracks/0VSXnJqQkwuH...,https://api.spotify.com/v1/audio-analysis/0VSX...,123661,3,Trap,(Prod.,0
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,...,audio_features,4jCeguq9rMTlbMmPHuO7S3,spotify:track:4jCeguq9rMTlbMmPHuO7S3,https://api.spotify.com/v1/tracks/4jCeguq9rMTl...,https://api.spotify.com/v1/audio-analysis/4jCe...,123298,4,Trap,,0


- The target column to predict is the mode column. 



- We know that the following columns have less than 50 unique INT values - therefore, they are actually categorical: key, genre and time_signature (means number of bits in a given time period).


- Additionaly, we know that time_signature is ordinal, while all the other are nominal.

We apply our model on the spotify dataset:

In [10]:
warnings.filterwarnings("ignore")
start = datetime.datetime.now()
main(data_path, 'mode')
end = datetime.datetime.now()
print("\n Model run time: {}".format(end-start))

The following columns have less than 50 unique INT values, therefore they will be checked:
key
time_signature
genre

Training embedding model start.


Epoch 1/10
1323/1323 - 3s - loss: 0.5873 - accuracy: 0.7049 - 3s/epoch - 2ms/step
Epoch 2/10
1323/1323 - 2s - loss: 0.5770 - accuracy: 0.7124 - 2s/epoch - 2ms/step
Epoch 3/10
1323/1323 - 2s - loss: 0.5752 - accuracy: 0.7114 - 2s/epoch - 2ms/step
Epoch 4/10
1323/1323 - 2s - loss: 0.5741 - accuracy: 0.7118 - 2s/epoch - 2ms/step
Epoch 5/10
1323/1323 - 2s - loss: 0.5741 - accuracy: 0.7113 - 2s/epoch - 2ms/step
Epoch 6/10
1323/1323 - 2s - loss: 0.5728 - accuracy: 0.7114 - 2s/epoch - 2ms/step
Epoch 7/10
1323/1323 - 2s - loss: 0.5718 - accuracy: 0.7139 - 2s/epoch - 2ms/step
Epoch 8/10
1323/1323 - 2s - loss: 0.5720 - accuracy: 0.7123 - 2s/epoch - 2ms/step
Epoch 9/10
1323/1323 - 2s - loss: 0.5713 - accuracy: 0.7130 - 2s/epoch - 2ms/step
Epoch 10/10
1323/1323 - 2s - loss: 0.5720 - accuracy: 0.7115 - 2s/epoch - 2ms/step

Training embedding model en

## 3. Titanic

In [11]:
data_path = '../../../datasets/converted_datasets/titanic_converted.csv'
data = pd.read_csv(data_path, low_memory=False)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,2


- The target column to predict is the Survivrd column. 



- We know that the following columns have less than 50 unique INT values - therefore, they are actually categorical: Pclss, SibSB (number of siblings on board), Parch (number of parents/children on board and Embarked (where the traveler mounted from).


- Additionaly, we know that Pclass, SibSB and Parch are ordinal, while Embarked is nominal.

We apply our model on the titanic dataset:

In [13]:
warnings.filterwarnings("ignore")
start = datetime.datetime.now()
main(data_path, 'Survived')
end = datetime.datetime.now()
print("\n Model run time: {}".format(end-start))

The following columns have less than 50 unique INT values, therefore they will be checked:
Pclass
SibSp
Parch
Embarked

Training embedding model start.


Epoch 1/10
28/28 - 1s - loss: 0.6685 - accuracy: 0.6162 - 850ms/epoch - 30ms/step
Epoch 2/10
28/28 - 0s - loss: 0.6363 - accuracy: 0.6308 - 54ms/epoch - 2ms/step
Epoch 3/10
28/28 - 0s - loss: 0.6055 - accuracy: 0.6801 - 56ms/epoch - 2ms/step
Epoch 4/10
28/28 - 0s - loss: 0.5946 - accuracy: 0.7026 - 53ms/epoch - 2ms/step
Epoch 5/10
28/28 - 0s - loss: 0.5877 - accuracy: 0.7149 - 56ms/epoch - 2ms/step
Epoch 6/10
28/28 - 0s - loss: 0.5820 - accuracy: 0.7183 - 53ms/epoch - 2ms/step
Epoch 7/10
28/28 - 0s - loss: 0.5842 - accuracy: 0.7194 - 53ms/epoch - 2ms/step
Epoch 8/10
28/28 - 0s - loss: 0.5775 - accuracy: 0.7138 - 56ms/epoch - 2ms/step
Epoch 9/10
28/28 - 0s - loss: 0.5727 - accuracy: 0.7026 - 52ms/epoch - 2ms/step
Epoch 10/10
28/28 - 0s - loss: 0.5770 - accuracy: 0.7082 - 52ms/epoch - 2ms/step

Training embedding model end.


'Pclass': 

_____________________________________________________________________________________________________________

In total we had:  

<ins>Nominal Columns:</ins>\
adult - workclass\
adult - marital-status\
adult - occupation \
adult - native-country \
adult - relationship \
spotify - key \
spotify - genre \
titanic - Embarked 
 
 
 
<ins>Ordinal Columns:</ins> \
adult - education \
adult - educational_num \
spotify - time_signature \
titanic - SibSB \
titanic - Pclass \ 
titanic - Parch 

Our model successfully Classified the following columns:\
adult - workclass, adult - marital-status, adult - occupation, adult - native-country, adult - education, adult - educational_num, adult - relationship, spotify - key, spotify - genre, spotify - time_signature, titanic - SibSB, titanic - Pclass, titanic - Parch.

Our model misclassified the following columns:\
titanic - Embarked.

Combined with the successful classification of all of the Video Games Sales dataset columns (available in the regression task experiment) we recieve an <b>accuracy score of 16/17 = 0.94</b>

### A Short Error Analysis

If we were to further research this solution, we would have wanted to deeply analyze the errors of the model.\
In this notebook we will suggest some ideas of analyzing the errors.

<ins>titanic - Embarked:</ins>\
The model asserted that the spearman correlation mean score between the original values space to the embedding space is 0.8 which is quite high, therefore the model classified this column as ordinal.

We explored the original values which are:  Southampton, Cherbourg, and Queenstown and found no geographic explanation.

We then checked the values distribution and found some imbalance that may explain the error (More than 70% of the people boarded from Southampton. Just under 20% boarded from Cherbourg and the rest boarded from Queenstown).


<ins>adult - relationship and adult - marital-status:</ins>\
We would have also wanted to analyse the columns that hold an amount of uncertainty in regard to the model's decision since their spearman correlation score is very close to the decision threshold.
The relationship and marital-status recieved scores of 0.38 and 0.39 respectively. 
One can argue that these columns are ordinal since, for example, never-married value is smaller than husband value.
An interesting test would be to map the values by some well-thought order (in contrast to the random mapping we used) and to test if the new spearman correlation is higher.