This project was made by Dane Zieman

In this project, data from https://www.kaggle.com/miroslavsabo/young-people-survey/ is used, the goal is to predict the empathy of an example give the other features. Specifically, the goal is to label people who answered the "empathy" question with 1, 2 or 3 as "not very empathetic" and people who answered 4 or 5 as "very empathetic". In this project a Random Forest and a Neural Network will be used to accomplish this task.

First, make sure you have all the appropriate modules installed

In [3]:
import tensorflow as tf
import os
import numpy as np
import pandas as pd
import math
from tensorflow.keras import layers
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from joblib import dump, load


print(tf.VERSION)
print(tf.keras.__version__)

1.12.0
2.1.6-tf


Output: Tensor Keras version number

Next, begin by loading the data

In [None]:
#load the raw data
filenames = []
dir_path = os.path.dirname(os.path.realpath(__file__))
filename = dir_path + "/responses.csv"

npFeatures=pd.read_csv(filename, sep=',',header=0)
#extract the labels
npLabels2=[]
for z in npFeatures.values:
    npLabels2.append(z[94])

Now that the data is loaded, it needs to be preprocessed. All categorical features need to be converted to numerical data as TensorFlow cannot run on categorical data. Additionally, any example that does not have an empathy value must be removed and all NaN entries must be fixed. For this assignement, all NaNs were replaced with the column average.

In [None]:
#remove all examples that don't have our target labelled
removeInd = []
for z in range(len(npLabels2)):
    if np.isnan(npLabels2[z]):
        removeInd.append(z)
    else:
        if npLabels2[z] > 3:
            npLabels2[z] = 1
        else:
            npLabels2[z] = 0

npFeatures2 = []

for z in range(len(npFeatures)):
    if z not in removeInd:
        npFeatures2.append(npFeatures.values[z])

npLabels2 = np.delete(npLabels2,removeInd)

#Preprocessing the categorical features
for z in range(len(npFeatures2)):
    if (npFeatures2[z][73] == "never smoked"):
        npFeatures2[z][73] = 1
    elif (npFeatures2[z][73] == "tried smoking"):
        npFeatures2[z][73] = 2
    elif (npFeatures2[z][73] == "former smoker"):
        npFeatures2[z][73] = 3
    elif (npFeatures2[z][73] == "current smoker"):
        npFeatures2[z][73] = 4
    if (npFeatures2[z][74] == "never"):
        npFeatures2[z][74] = 1
    elif (npFeatures2[z][74] =="social drinker"):
        npFeatures2[z][74] = 2
    elif (npFeatures2[z][74] =="drink a lot"):
        npFeatures2[z][74] = 3
    if (npFeatures2[z][107] == "i am often early"):
        npFeatures2[z][107] = 1
    elif (npFeatures2[z][107] == "i am always on time"):
        npFeatures2[z][107] = 2
    elif (npFeatures2[z][107] == "i am often running late"):
        npFeatures2[z][107] = 3
    if (npFeatures2[z][108] == "never"):
        npFeatures2[z][108] = 1
        npFeatures2[z][94] = 1 #reusing the space that used to store the label
        #use it to store whether or not they lie pathologically
    elif (npFeatures2[z][108] == "only to avoid hurting someone"):
        npFeatures2[z][108] = 1
        npFeatures2[z][94] = 1
    elif (npFeatures2[z][108] == "sometimes"):
        npFeatures2[z][108] = 2
        npFeatures2[z][94] = 1
    elif (npFeatures2[z][108] == "everytime it suits me"):
        npFeatures2[z][108] = 3
        npFeatures2[z][94] = 2
    if (npFeatures2[z][132] == "no time at all"):
        npFeatures2[z][132] = 1
    elif (npFeatures2[z][132] == "less than an hour a day"):
        npFeatures2[z][132] = 2
    elif (npFeatures2[z][132] == "few hours a day"):
        npFeatures2[z][132] = 3
    elif (npFeatures2[z][132] == "most of the day"):
        npFeatures2[z][132] = 4
    if (npFeatures2[z][144] == "male"):
        npFeatures2[z][144] = 1
    elif (npFeatures2[z][144] == "female"):
        npFeatures2[z][144] = 2
    if (npFeatures2[z][145] == "right handed"):
        npFeatures2[z][145] = 1
    elif (npFeatures2[z][145] == "left handed"):
        npFeatures2[z][145] = 2
    if (npFeatures2[z][146] == "currently a primary school pupil"):
        npFeatures2[z][146] = 1
    elif (npFeatures2[z][146] == "primary school"):
        npFeatures2[z][146] = 2
    elif (npFeatures2[z][146] == "secondary school"):
        npFeatures2[z][146] = 3
    elif (npFeatures2[z][146] == "college/bachelor degree"):
        npFeatures2[z][146] = 4
    elif (npFeatures2[z][146] == "masters degree"):
        npFeatures2[z][146] = 5
    elif (npFeatures2[z][146] == "doctorate degree"):
        npFeatures2[z][146] = 6
    if (npFeatures2[z][147] == "no"):
        npFeatures2[z][147] = 1
    elif (npFeatures2[z][147] == "yes"):
        npFeatures2[z][147] = 2
    if (npFeatures2[z][148] == "village"):
        npFeatures2[z][148] = 1
    elif (npFeatures2[z][148] == "city"):
        npFeatures2[z][148] = 2
    if (npFeatures2[z][149] == "block of flats"):
        npFeatures2[z][149] = 1
    elif (npFeatures2[z][149] == "house/bungalow"):
        npFeatures2[z][149] = 2

#replace all nans with an average value
for z in range(len(npFeatures2[1])):
    avg = 0
    for i in range(len(npFeatures2)):
        if (np.isnan(npFeatures2[i][z])):
            continue
        else:
            avg += npFeatures2[i][z]
    avg = math.floor(avg / len(npFeatures2))
    for i in range(len(npFeatures2)):
        if (np.isnan(npFeatures2[i][z])):
            npFeatures2[i][z] = avg

npFeatures2 = np.array(npFeatures2)
npLabels2 = np.array(npLabels2)

After this, npFeatures2 and npLabels2 are numpy arrays that hold all the data and all the labels. The next step is to build a baseline classifier for this model. For this, the majority classifier will be used

In [None]:
#Do most frequent classifier
count = [0, 0]
for z in npLabels2:
    t = int(z)
    count[t-1] += 1

label = 1
if (count[0] < count[1]):
    label = 2

misses = 0
for z in npLabels2:
    t = int(z)
    if t != label:
        misses += 1

numLabels = npLabels2.size
print("Most Frequent Classifier Accuracy: " + str((numLabels-misses)/numLabels) )

Output: Roughly 0.667

Now it is time to do a train-test split. For this project, a 90-10 split will be used. From this 90% training data, it will further be split into 75% training 15% validation data.

In [None]:
#make a test train split (roughly 75-15-10 train-valid-test random split)
randomindices = np.arange(1005)
np.random.shuffle(randomindices)
test_data = npFeatures2[randomindices[905:]]
test_labels = npLabels2[randomindices[905:]]
training_set = npFeatures2[randomindices[:905]]
label_set = npLabels2[randomindices[:905]]
randomindices = np.arange(905)
training_data = training_set[randomindices[:755]]
training_labels = label_set[randomindices[:755]]
validation_data = training_set[randomindices[755:905]]
validation_labels = label_set[randomindices[755:905]] 

#save the test data and labels to a file
np.save("Processed_Test_Data", test_data)
np.save("Processed_Test_Labels", test_labels)

Next, the neural network will be built. This neural network is built using TensorFlow's Keras The class weights and model layers where chosen by iteration after testing a bunch of different choices on the model. Some choices, like poor weight choices, or by adding softmax layers to certain points of the model will cause the model to simply predict the majority every time. Try experimenting with different weights and layers!

In [None]:
#Build the tensor flow model
class_weight = {0: 1.,
                1: 2.}
model = tf.keras.Sequential()
model.add(layers.Dense(150, activation='relu'))
model.add(layers.Dense(128, activation='selu'))
model.add(layers.Dense(128, activation='sigmoid'))
model.add(layers.Dense(128, activation='selu'))
model.add(layers.Dense(128, activation='sigmoid'))
model.add(layers.Dense(64, activation='selu'))
model.add(layers.Dense(64, activation='sigmoid'))
model.add(layers.Dense(64, activation='selu'))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(16, activation='selu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=tf.train.AdamOptimizer(0.0008),
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(training_data, training_labels, epochs=50, batch_size=32,
          validation_data=(validation_data, validation_labels), class_weight=class_weight)

One way to try to improve the model is to try different validation/test splits and see if these improve the model vs the majority classifier. For example, if the train/validation split is biased towards the positive class, the model may get a good score but not be significantly better than the majority classifier. This code below trys to find a model that does the best vs the majority classifier for different train/validation splits.

In [None]:
#compare to most frequent classifier
count = [0, 0]
for z in validation_labels:
    t = int(z)
    count[t-1] += 1

label = 1
if (count[0] < count[1]):
    label = 2

misses = 0
for z in validation_labels:
    t = int(z)
    if t != label:
        misses += 1
MFCacc = (len(validation_labels)-misses)/len(validation_labels)
acc = result[1]
maxDiff = acc - MFCacc
used_validation = validation_data
used_labels = validation_labels

#Do 10 more comparisons to the majority classifier
for z in range(10): 
    print("Cross Training run #" + str(z+1) + "/10")
    randomindices = np.arange(905)
    np.random.shuffle(randomindices)
    training_data = training_set[randomindices[:755]]
    training_labels = label_set[randomindices[:755]] 
    validation_data = training_set[randomindices[755:905]]
    validation_labels = label_set[randomindices[755:905]] 
    
    model = tf.keras.Sequential()
    model.add(layers.Dense(150, activation='relu'))
    model.add(layers.Dense(128, activation='selu'))
    model.add(layers.Dense(128, activation='sigmoid'))
    model.add(layers.Dense(128, activation='selu'))
    model.add(layers.Dense(128, activation='sigmoid'))
    model.add(layers.Dense(64, activation='selu'))
    model.add(layers.Dense(64, activation='sigmoid'))
    model.add(layers.Dense(64, activation='selu'))
    model.add(layers.Dense(32, activation='relu'))
    model.add(layers.Dense(16, activation='selu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer=tf.train.AdamOptimizer(0.0008),
                loss='binary_crossentropy',
                metrics=['accuracy'])
    model.fit(training_data, training_labels, epochs=50, batch_size=32,
            validation_data=(validation_data, validation_labels), class_weight=class_weight, verbose = 0)

    #inspect the individual results on the test data
    result = model.evaluate(validation_data, validation_labels, batch_size=32)
    loss = result[0]
    acc = result[1]
    print(result)
    #compare to most frequent classifier
    #Do most frequent classifier
    count = [0, 0]
    for z in validation_labels:
        t = int(z)
        count[t-1] += 1

    label = 1
    if (count[0] < count[1]):
        label = 2

    misses = 0
    for z in validation_labels:
        t = int(z)
        if t != label:
            misses += 1
    MFCacc = (len(validation_labels)-misses)/len(validation_labels)
    diff = acc - MFCacc
    print(diff)
    if (diff > maxDiff):
        maxDiff = diff
        theModel = model
        used_validation = validation_data
        used_labels = validation_labels

The best model will be selected from the 11 runs. Now that the model is fully built it is time to have it attempt the test data.

In [None]:
#inspect the individual results on the test data
print("Testing")
result = theModel.evaluate(test_data, test_labels, batch_size=32)
print(result)
result = theModel.predict_classes(test_data, batch_size=32)
confusionMat = confusion_matrix(test_labels, result)
print(confusionMat)

Output: Loss, accuracy and confusion matrix for the test data

Interesting results from the validation data can be observed as well. For this data set, classifications of the negative class are most interesting, so those will be output with the code below. Feel free to change the output statements to look at different predcitions

In [None]:
#interesting validation results
result = theModel.predict_classes(used_validation, batch_size=32)
c = 0
comparisons = []
for z in range(len(result)):
    if (result[z][0] == 0 and used_labels[z] == 0):
        comparisons.append("Prediction #" + str(z) +":" + str(result[z][0]) + " Label:"+str(used_labels[z]))
    if (result[z][0] == 0 and used_labels[z] == 1):
        comparisons.append("Prediction #" + str(z) +":" + str(result[z][0]) + " Label:"+str(used_labels[z]))
for z in comparisons:
    print(z)

The model architecture can also be exported to a JSON, and the weights can be exported as well

In [None]:
model.save_weights('tensorModel_weights.h5')
with open('tensorModel_architecture.json', 'w') as f:
    f.write(model.to_json())

That's all there is to do for the neural network, next is the random forest classifier. Sklearn's RandomForestClassifier was used for this, and the same train-test data split will be used for this as well. Like the previous example, the class weights, number of trees in the forest, and the max height of the tree where chosen by iteration. If the max tree height was less than 5 and the class weights were chosen poorly, the random forest would be close to or exactly the majority classifier. Feel free to experiment with these hyperparameters if you choose

In [None]:
#build the first random forest
randomindices = np.arange(905)
np.random.shuffle(randomindices)
cross_training_data = training_set[randomindices[:755]]
cross_training_labels = label_set[randomindices[:755]] 
validation_data = training_set[randomindices[755:905]]
validation_labels = label_set[randomindices[755:905]]  
clf = RandomForestClassifier(n_estimators=500, max_depth=11, random_state = 0, class_weight = class_weight)
clf.fit(cross_training_data, cross_training_labels)
result = clf.predict(validation_data)
print(result)
c = 0
for z in range(len(result)):
    if (result[z] == validation_labels[z]):
        c +=1
score = c/len(result)
print(str(score))

Like for the neural network, it is desirable to compare this model to the majority classifier, and try different train-validation splits to get the model that outperform the majority classifier the best

In [None]:
theclf = clf
#compare to most frequent
count = [0, 0]
for z in validation_labels:
    t = int(z)
    count[t-1] += 1

label = 1
if (count[0] < count[1]):
    label = 2

misses = 0
for z in validation_labels:
    t = int(z)
    if t != label:
        misses += 1
MFCacc = (len(validation_labels)-misses)/len(validation_labels)
maxDiff = score - MFCacc

#try 10 different train-validation splits
for z in range(10):
    randomindices = np.arange(905)
    np.random.shuffle(randomindices)
    cross_training_data = training_set[randomindices[:755]]
    cross_training_labels = label_set[randomindices[:755]] 
    validation_data = training_set[randomindices[755:905]]
    validation_labels = label_set[randomindices[755:905]]  
    clf = RandomForestClassifier(n_estimators=500, max_depth=11, random_state = 0, class_weight = class_weight)
    clf.fit(cross_training_data, cross_training_labels)
    result = clf.predict(validation_data)
    print(result)
    c = 0
    for z in range(len(result)):
        if (result[z] == validation_labels[z]):
            c +=1
    score = c/len(result)
    print(str(score))
    #compare to most frequent
    count = [0, 0]
    for z in validation_labels:
        t = int(z)
        count[t-1] += 1

    label = 1
    if (count[0] < count[1]):
        label = 2

    misses = 0
    for z in validation_labels:
        t = int(z)
        if t != label:
            misses += 1
    MFCacc = (len(validation_labels)-misses)/len(validation_labels)
    diff = score - MFCacc
    print(diff)
    if (diff > maxDiff):
        theclf = clf
        maxDiff=diff
        used_validation = validation_data
        used_labels = validation_labels

After this, the random forest model is built. Now it is time to try it on the test data

In [None]:
result = clf.predict(test_data)
confusionMat = confusion_matrix(test_labels, result)
print(confusionMat)
c = 0
for z in range(len(result)):
    if (result[z] == test_labels[z]):
        c +=1
score = c/len(result)

It is also useful to look at interesting examples from the validation data. Below is code to look at all the examples that are classified as 0

In [None]:
result = clf.predict(used_validation)
print(str(score))
comparisons = []
for z in range(len(result)):
    if (result[z] == 0 and used_labels[z] == 0):
        comparisons.append("Prediction #" + str(z) +":" + str(result[z]) + " Label:"+str(used_labels[z]))
    if (result[z] == 0 and used_labels[z] == 1):
        comparisons.append("Prediction #" + str(z) +":" + str(result[z]) + " Label:"+str(used_labels[z]))
for z in comparisons:
    print(z)

One useful feature that the random forest has that the neural network does not, is the ability to see the importance of each individual feature. Below is the code to do this

In [None]:
featureWeights = list(zip(npFeatures, theclf.feature_importances_))
#Fix the naming error for the bin we reused
for z in featureWeights:
    if z[0] == 'Empathy':
        temp = ('Pathological Liar?', z[1])
        featureWeights.remove(z)
        featureWeights.append(temp)
featureWeights.sort(key=lambda x: x[1], reverse=True)
print(featureWeights)

Finally, save the random forest model

In [None]:
dump(theclf, 'randomForest.joblib') 

References:
TensorFlow Keras: https://www.tensorflow.org/guide/keras

Sklearn RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
