# SurveSense: Measuring Workplace Satisfaction from Large Employee Text Surveys
This notebook contains two models that I trained for my consulting project at Insight Data Science. Since the models need a GPU for training, I have used google colab notebooks to carry out the training and generating the predictions.

The notebook is divided into two parts. The first part contains training the multi-class topics classification model. The second part contains training the binary sentiment classifier. The notebook also contains the code for generating predictions. However, the code for generating predictions has been commented out.

## **Model 1: Multi-class Topics Classification** 

### 1. Mount Drive for Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [0]:
# Copy the file containing the labeled data to google drive
!cp '/content/drive/My Drive/Insight/topics_labels.csv' 'topics_labels.csv'

### 2. Import Libraries

In [3]:
import numpy as np
import pandas as pd
import re

!pip install keras==2.2.4

%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

import tensorflow_hub as hub
import os
from keras import backend as K
import keras.layers as layers
from keras.models import Model, load_model
from keras.engine import Layer

1.15.0


Using TensorFlow backend.


### 3. Read Data

In [0]:
data_df = pd.read_csv('topics_labels.csv')
data_df = data_df.sample(frac=1)


In [5]:
data_df.head()

Unnamed: 0.1,Unnamed: 0,QID,text,topic
11472,51023,Negative,"Dementia training for Care managers, PM and ni...",management
11056,44219,Negative,The company needs to realize the individual ne...,management
13577,83354,Positive,The diversity and blending of staff management,management
30268,71621,Negative,none,no
21031,35433,Positive,I have been working with silverado for a coupl...,overall


### 4. Data Preprocessing

In [0]:
# Convert text column to a list of documents
texts = list(data_df['text']) 

# Keep only the first 100 words in each document
texts = [' '.join(t.split()[0:100]) for t in texts] 

# Convert topic column to a list of labels
labels = list(data_df['topic'])

# Fit label encoder for one-hot encoding
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(labels)

# Write functions to encode to one-hot and decode to categorical
from keras.utils import to_categorical
def encode(le, labels):
    enc = le.transform(labels)
    return to_categorical(enc)

def decode(le, one_hot):
    dec = np.argmax(one_hot, axis=1)
    return le.inverse_transform(dec)

In [0]:
# Define labels and texts ready to be fed to ELMO embedding layer
label_enc = encode(le,labels)
text_enc = texts

In [8]:
# Check the shape of encoded labels
label_enc.shape

(32500, 10)

### 5. Split into train and test

In [0]:
text_train = np.asarray(text_enc[:25000])
label_train = np.asarray(label_enc[:25000])

text_test = np.asarray(text_enc[25000:])
label_test = np.asarray(label_enc[25000:])

### 6. Build ELMO Embedding Layer


In [0]:
# Download ELMO from tensorflow hub
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

In [0]:
# Create a custom layer that allows us to update weights (lambda layers do not have trainable parameters!)

class ElmoEmbeddingLayer(Layer):
    def __init__(self, **kwargs):
        self.dimensions = 1024
        self.trainable=True
        super(ElmoEmbeddingLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable,
                               name="{}_module".format(self.name))

        self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
        super(ElmoEmbeddingLayer, self).build(input_shape)

    def call(self, x, mask=None):
        result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
                      as_dict=True,
                      signature='default',
                      )['default']
        return result

    def compute_mask(self, inputs, mask=None):
        return K.not_equal(inputs, '--PAD--')

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.dimensions)

### 7. Build and Train Model

In [12]:
# Build Model Using Keras Layers

input_text = layers.Input(shape=(1,), dtype="string")
embedding = ElmoEmbeddingLayer()(input_text)
dense = layers.Dense(128, activation='relu')(embedding)
pred = layers.Dense(10, activation='sigmoid')(dense)#(dropout)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()













INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore




















Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
elmo_embedding_layer_1 (Elmo (None, 1024)              4         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               131200    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
Total params: 132,494
Trainable params: 132,494
Non-trainable params: 0
_________________________________________________________________


In [13]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(text_train, label_train, epochs=1, batch_size=6, validation_data=(text_test,label_test))
    model.save_weights('./elmo-model.h5')













Train on 25000 samples, validate on 7500 samples
Epoch 1/1


























In [0]:
# Run on testing data to generate predictions
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-model.h5')  
    predicts = model.predict(text_test, batch_size=128)

In [0]:
# Save model
!cp 'elmo-model.h5' '/content/drive/My Drive/Insight/elmo-model.h5'

### 8. Model Evaluation

In [0]:
# Import evaluation functions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [0]:
# Choose the class with the highest probability as the prediction
predicts_shape=np.shape(predicts)
y_pred = np.zeros(predicts_shape)
y_pred = predicts.argmax(axis=-1)


In [0]:
# Convert the prediction from one-hot encoding to binary encoding
lb = preprocessing.LabelBinarizer()
lb.fit(list(range(0,10)))
y_pred_ohe = lb.transform(y_pred)
y_pred_ohe = lb.transform(y_pred)

In [26]:
# Print accuracy, precision, and recall
print('Test Accuracy Score: ', format(accuracy_score(label_test,y_pred_ohe)))
print('Test Precision Score: ', format(precision_score(label_test, y_pred_ohe, average = 'macro')))
print('Test Recall Score: ', format(recall_score(label_test, y_pred_ohe, average = 'macro')))
print('Test F1-Score Score: ', format(f1_score(label_test, y_pred_ohe, average = 'macro')))

Test Accuracy Score:  0.9096
Test Precision Score:  0.903518051886465
Test Recall Score:  0.8383890831360153
Test F1-Score Score:  0.8650065429430287


### 9. Make Predictions on Unlabelled Data

In [0]:
# Copy the file containing the unlabeled data to google drive
!cp '/content/drive/My Drive/Insight/cleaned_data_english.csv' 'cleaned_data_english.csv'

In [21]:
# Read unlabeled data
unlabeled_df = pd.read_csv('cleaned_data_english.csv')
unlabeled_df.head()

Unnamed: 0,QID,Text
0,Positive,all is good
1,Negative,Nothing
2,Negative,I believe that there is more favoritism toward...
3,Positive,The connection with each individual resident
4,Positive,no


In [0]:
# Convert the text comments to a numpy array
X_predict = unlabeled_df['Text']
X_predict = list(X_predict)
X_predict = [' '.join(t.split()[0:100]) for t in X_predict]
X_predict = np.asarray(X_predict)

In [0]:
# ## Make predictions

# with tf.Session() as session:
#     K.set_session(session)
#     session.run(tf.global_variables_initializer())
#     session.run(tf.tables_initializer())
#     model.load_weights('./elmo-model.h5')  
#     predicts = model.predict(X_predict, batch_size=128)

In [0]:
# # Save predictions

# np.savetxt('predicts_final.txt', predicts)

### 10. Create DataFrame with Unlabeled Text and Predictions

In [0]:
# Load Predictions
!cp '/content/drive/My Drive/Insight/predicts_final.txt' 'predicts_final.txt'
predicts = np.loadtxt('predicts_final.txt', dtype=float)
predicts_df = pd.read_csv('cleaned_data_english.csv')

In [0]:
# Create DataFrame with topic class as column name
columnNames = ['Benefits', 'Coworkers', 'EmployeeRelations', 'Management',
               'None', 'Organized','Overall','Pay','Schedule','Staffing']

predicts_df[columnNames] = pd.DataFrame(predicts, columns = columnNames )

In [31]:
predicts_df.head(2)

Unnamed: 0,QID,Text,Benefits,Coworkers,EmployeeRelations,Management,None,Organized,Overall,Pay,Schedule,Staffing
0,Positive,all is good,0.2987912,0.000109,0.000152,0.000257,0.681909,4.9e-05,0.001233,4.7e-05,0.000155,0.00264
1,Negative,Nothing,1.788139e-07,0.0,0.0,0.0,1.0,0.0,1e-06,0.0,0.0,0.0


In [0]:
# Convert probabilities to labels by using different thresholds

predicts_df['Benefits']=predicts_df['Benefits']>0.7
predicts_df['Coworkers']=predicts_df['Coworkers']>0.5
predicts_df['EmployeeRelations']=predicts_df['EmployeeRelations']>0.05
predicts_df['Management']=predicts_df['Management']>0.85
predicts_df['None']=predicts_df['None']>0.95
predicts_df['Organized']=predicts_df['Organized']>0.05
predicts_df['Overall']=predicts_df['Overall']>0.2
predicts_df['Pay']=predicts_df['Pay']>0.4
predicts_df['Schedule']=predicts_df['Schedule']>0.4
predicts_df['Staffing']=predicts_df['Staffing']>0.3


In [34]:
predicts_df[columnNames]=predicts_df[columnNames].astype('int')
predicts_df.head()

Unnamed: 0,QID,Text,Benefits,Coworkers,EmployeeRelations,Management,None,Organized,Overall,Pay,Schedule,Staffing
0,Positive,all is good,0,0,0,0,0,0,0,0,0,0
1,Negative,Nothing,0,0,0,0,1,0,0,0,0,0
2,Negative,I believe that there is more favoritism toward...,0,0,1,0,0,0,0,0,0,0
3,Positive,The connection with each individual resident,0,0,0,0,0,0,0,0,0,0
4,Positive,no,0,0,0,0,1,0,0,0,0,0


## **Model 2: Binary Sentiment Classification** 

### 1. Load Data 

In [35]:
# Copy data to colab drive
!cp '/content/drive/My Drive/Insight/sentiment_labels_training.csv' 'sentiment_labels_training.csv'
data_df = pd.read_csv('sentiment_labels_training.csv', header=None, names=['QID','Comment','Sentiment'])
data_df.head()

#

Unnamed: 0,QID,Comment,Sentiment
0,Positive,all is good,1
1,Negative,Nothing,1
2,Negative,I believe that there is more favoritism toward...,0
3,Positive,The connection with each individual resident,1
4,Positive,no,0


### 2. Preprocess Data

In [0]:
# Concatenate the Question Asked to Capture Implied Sentiment
data_df['Question'] = data_df['QID'].replace(['Positive','Negative'],['What works well: ','What needs improvement: '])
data_df['Text'] = data_df['Question']+data_df['Comment']

### 3. Split Data Into Train and Test Sets

In [0]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data_df[['Text','Sentiment']], test_size=0.1)

# Create datasets (Only take up to 100 words for memory)
train_text = train_df['Text'].tolist()
train_text = [' '.join(t.split()[0:100]) for t in train_text]
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
train_label = train_df['Sentiment'].tolist()

test_text = test_df['Text'].tolist()
test_text = [' '.join(t.split()[0:100]) for t in test_text]
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = test_df['Sentiment'].tolist()

### 4. Build ELMO Embedding Layer

In [0]:
# Create a custom layer that allows us to update weights


class ElmoEmbeddingLayer(Layer):
    def __init__(self, **kwargs):
        self.dimensions = 1024
        self.trainable=True
        super(ElmoEmbeddingLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable,
                               name="{}_module".format(self.name))

        self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
        super(ElmoEmbeddingLayer, self).build(input_shape)

    def call(self, x, mask=None):
        result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
                      as_dict=True,
                      signature='default',
                      )['default']
        return result

    def compute_mask(self, inputs, mask=None):
        return K.not_equal(inputs, '--PAD--')

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.dimensions)

### 5. Build Model

In [39]:
from keras.optimizers import Adam

input_text = layers.Input(shape=(1,), dtype="string")
embedding = ElmoEmbeddingLayer()(input_text)
dense = layers.Dense(128, activation='relu')(embedding)
pred = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.0001), metrics=['accuracy'])
model.summary()


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
elmo_embedding_layer_2 (Elmo (None, 1024)              4         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               131200    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 131,333
Trainable params: 131,333
Non-trainable params: 0
_________________________________________________________________


### 6. Train Model

In [44]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(train_text, train_label, epochs=10, batch_size=6, validation_data=(test_text, test_label))
    model.save_weights('./elmo-sentiment.h5')

Train on 404 samples, validate on 45 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### 7. Make Predictions on Test Set

In [0]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-sentiment.h5')  
    predicts = model.predict(test_text, batch_size=64)

### 8. Model Evaluation

In [46]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

predicts_shape=np.shape(predicts)
y_pred = np.zeros(predicts_shape)
y_pred[:] = predicts[:] > 0.5

print('Test Accuracy Score: ', format(accuracy_score(test_label,y_pred)))
print('Test Precision Score: ', format(precision_score(test_label, y_pred)))
print('Test Recall Score: ', format(recall_score(test_label, y_pred)))

Test Accuracy Score:  0.9111111111111111
Test Precision Score:  0.8421052631578947
Test Recall Score:  0.9411764705882353


### 9. Make Predictions on Unlabeled Data

In [0]:
# data_list = unlabeled_df['Text'].tolist()
# data_list = [' '.join(t.split()[0:100]) for t in data_list]
# data_array = np.array(data_list, dtype=object)[:, np.newaxis]

# with tf.Session() as session:
#     K.set_session(session)
#     session.run(tf.global_variables_initializer())
#     session.run(tf.tables_initializer())
#     model.load_weights('./elmo-sentiment.h5')  
#     predicts = model.predict(data_array, batch_size=128)

In [0]:
# Save Predictions
#np.savetxt('sentiment_final.txt', predicts)
#!cp 'sentiment_final.txt' '/content/drive/My Drive/Insight/sentiment_final.txt'