# Introduction

Dataset used in this homework is my project data - scraped from Glassdoor from companies that have had or is having a female CEO. This data includes 2 text components - positive and negative comments towards the company. I also have a column telling the employee's employement status. I wanted to use this dataset to build a model to predict future employee turnover based on employee comments. 

I tried to run this on my full data - it took forever. So I drew 1000 entries randomly for this homework. I made sure that there's no class imbalance for my outcome variable. 

Link to data & Code: https://pennstateoffice365-my.sharepoint.com/:f:/r/personal/tzz5177_psu_edu/Documents/597hm4?csf=1&web=1&e=Q3efK2

# Packages and Data

In [1]:
import pandas as pd
import numpy as np

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import roc_auc_score, confusion_matrix
from gensim.models import KeyedVectors




In [19]:
all = pd.read_csv("df_subset.csv")

In [20]:
all['status'] 

0       0
1       1
2       1
3       1
4       0
       ..
1995    1
1996    0
1997    0
1998    1
1999    0
Name: status, Length: 2000, dtype: int64

# version 1

In [24]:
X_train, X_test, y_train, y_test = train_test_split(all['review_pros'], all['status'], test_size=0.2)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train).toarray()
X_test = vectorizer.transform(X_test).toarray()

model = Sequential()
model.add(Dense(10, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(1, activation='sigmoid'))  # Sigmoid for binary classification


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.5849999785423279


First run with very simple settings, accuracy is .58 which is not much better than guessing. 

In [22]:
y_pred = model.predict(X_test)
y_pred_class = (y_pred > 0.5).astype('int32')

# Calculate AUC-ROC
auc_roc = roc_auc_score(y_test, y_pred)
print(f"AUC-ROC: {auc_roc}")

# Calculate Specificity
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class).ravel()
specificity = tn / (tn + fp)
print(f"Specificity: {specificity}")

AUC-ROC: 0.5923308270676693
Specificity: 0.5631578947368421


AUC-ROC = .59 and Specificity = .56. Not great. I am going to adjust learning rate to be slower, make batch size smaller, and increases epoc.


# Version 2

In [25]:
# Hyperparameters
learning_rate = 0.001
batch_size = 16
epochs = 25


model_v2 = Sequential()
model_v2.add(Dense(64, activation='relu', input_dim=X_train.shape[1]))
model_v2.add(Dropout(0.5))
model_v2.add(Dense(32, activation='relu'))
model_v2.add(Dropout(0.5))
model_v2.add(Dense(16, activation='relu'))
model_v2.add(Dense(1, activation='sigmoid'))

optimizer = Adam(learning_rate=learning_rate)

model_v2.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
model_v2.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)


y_pred_prob_v2 = model_v2.predict(X_test)
y_pred_class_v2 = (y_pred_prob_v2 > 0.5).astype('int32')


auc_roc_v2 = roc_auc_score(y_test, y_pred_prob_v2)
print(f"AUC-ROC: {auc_roc_v2}")

tn, fp, fn, tp = confusion_matrix(y_test, y_pred_class_v2).ravel()
specificity_v2 = tn / (tn + fp)
print(f"Specificity: {specificity_v2}")

loss_v2, accuracy_v2 = model_v2.evaluate(X_test, y_test)
print(f"Test Loss: {loss_v2}")
print(f"Test Accuracy: {accuracy_v2}")

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
AUC-ROC: 0.6074891970656215
Specificity: 0.4719626168224299
Test Loss: 2.47048282623291
Test Accuracy: 0.5874999761581421


AUC-ROC = .61, Specificity = .47, Accuracy = .59. I am going to try to use CNN in the next model. 

# Version 3

In [29]:

max_features_v3 = 10000  # Size of the vocabulary
maxlen_v3 = 100         # Maximum length of each sequence
tokenizer_v3 = Tokenizer(num_words=max_features_v3)
tokenizer_v3.fit_on_texts(all['review_pros'])
X_v3 = tokenizer_v3.texts_to_sequences(all['review_pros'])
X_v3 = pad_sequences(X_v3, maxlen=maxlen_v3)
X_train_v3, X_test_v3, y_train_v3, y_test_v3 = train_test_split(X_v3, all['status'], test_size=0.2, random_state=42)

# CNN model architecture
model_v3 = Sequential()
model_v3.add(Embedding(max_features_v3, 50, input_length=maxlen_v3))
model_v3.add(Conv1D(64, 5, activation='relu'))
model_v3.add(GlobalMaxPooling1D())
model_v3.add(Dense(10, activation='relu'))
model_v3.add(Dropout(0.5))
model_v3.add(Dense(1, activation='sigmoid'))


optimizer_v3 = Adam(learning_rate=0.001)
model_v3.compile(optimizer=optimizer_v3, loss='binary_crossentropy', metrics=['accuracy'])
model_v3.fit(X_train_v3, y_train_v3, epochs=30, batch_size=32)


y_pred_prob_v3 = model_v3.predict(X_test_v3)
y_pred_class_v3 = (y_pred_prob_v3 > 0.5).astype('int32')


auc_roc_v3 = roc_auc_score(y_test_v3, y_pred_prob_v3)
print(f"AUC-ROC: {auc_roc_v3}")
tn, fp, fn, tp = confusion_matrix(y_test_v3, y_pred_class_v3).ravel()
specificity_v3 = tn / (tn + fp)
print(f"Specificity: {specificity_v3}")
loss_v3, accuracy_v3 = model_v3.evaluate(X_test_v3, y_test_v3)
print(f"Test Loss: {loss_v3}")
print(f"Test Accuracy: {accuracy_v3}")


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
AUC-ROC: 0.6158599124452783
Specificity: 0.6102564102564103
Test Loss: 2.316941022872925
Test Accuracy: 0.5699999928474426


AUC-ROC = .62 and Specificity = .61. This version is better than v2. I want to start using pre-trained embedding in the next version. 

# Version 4 but I accidently named it v5

In [33]:
import gensim.downloader as api

# Download and load the Word2Vec model from Gensim
word_vectors = api.load("word2vec-google-news-300")




In [34]:
# Text preprocessing
max_features_v5 = 10000
maxlen_v5 = 100
tokenizer_v5 = Tokenizer(num_words=max_features_v5)
tokenizer_v5.fit_on_texts(all['review_pros'])
X_v5 = tokenizer_v5.texts_to_sequences(all['review_pros'])
X_v5 = pad_sequences(X_v5, maxlen=maxlen_v5)
X_train_v5, X_test_v5, y_train_v5, y_test_v5 = train_test_split(X_v5, all['status'], test_size=0.2, random_state=42)

embedding_dim_v5 = 300
embedding_matrix_v5 = np.zeros((max_features_v5, embedding_dim_v5))
for word, i in tokenizer_v5.word_index.items():
    if i < max_features_v5 and word in word_vectors.key_to_index:
        embedding_matrix_v5[i] = word_vectors[word]

model_v5 = Sequential()
model_v5.add(Embedding(max_features_v5, embedding_dim_v5, input_length=maxlen_v5,
                       weights=[embedding_matrix_v5], trainable=False))
model_v5.add(Conv1D(64, 5, activation='relu'))
model_v5.add(GlobalMaxPooling1D())
model_v5.add(Dense(10, activation='relu'))
model_v5.add(Dropout(0.5))
model_v5.add(Dense(1, activation='sigmoid'))


model_v5.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_v5.fit(X_train_v5, y_train_v5, epochs=30, batch_size=32)


loss_v5, accuracy_v5 = model_v5.evaluate(X_test_v5, y_test_v5)
print(f"Test Loss: {loss_v5}")
print(f"Test Accuracy: {accuracy_v5}")


y_pred_prob_v5 = model_v5.predict(X_test_v5)
y_pred_class_v5 = (y_pred_prob_v5 > 0.5).astype('int32')
auc_roc_v5 = roc_auc_score(y_test_v5, y_pred_prob_v5)
print(f"AUC-ROC: {auc_roc_v5}")
tn, fp, fn, tp = confusion_matrix(y_test_v5, y_pred_class_v5).ravel()
specificity_v5 = tn / (tn + fp)
print(f"Specificity: {specificity_v5}")

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test Loss: 1.696050763130188
Test Accuracy: 0.5625
AUC-ROC: 0.5497185741088181
Specificity: 0.358974358974359


Using Word2Vec actually makes my model perform worse - could be that the training set of it does not fit my current data (news vs. online comment). I am going to make one last try by modifying the complexity of my neural network and adjusting layer size and number of layers.

# Version 5 (v6)

In [35]:
max_features_v6 = 10000  
maxlen_v6 = 100          

tokenizer_v6 = Tokenizer(num_words=max_features_v6)
tokenizer_v6.fit_on_texts(all['review_pros'])
X_v6 = tokenizer_v6.texts_to_sequences(all['review_pros'])
X_v6 = pad_sequences(X_v6, maxlen=maxlen_v6)

X_train_v6, X_test_v6, y_train_v6, y_test_v6 = train_test_split(X_v6, all['status'], test_size=0.2, random_state=42)

# Adjusted model architecture
model_v6 = Sequential()
model_v6.add(Embedding(max_features_v6, 50, input_length=maxlen_v6))  # Embedding layer
model_v6.add(Conv1D(128, 3, activation='relu'))  # Increased number of filters and changed kernel size
model_v6.add(GlobalMaxPooling1D())
model_v6.add(Dense(20, activation='relu'))  # Increased size of this dense layer
model_v6.add(Dropout(0.5))  # Dropout for regularization
model_v6.add(Dense(10, activation='relu'))  # Additional dense layer
model_v6.add(Dense(1, activation='sigmoid'))  # Output layer


model_v6.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_v6.fit(X_train_v6, y_train_v6, epochs=30, batch_size=32)


loss_v6, accuracy_v6 = model_v6.evaluate(X_test_v6, y_test_v6)
print(f"Test Loss: {loss_v6}")
print(f"Test Accuracy: {accuracy_v6}")


y_pred_prob_v6 = model_v6.predict(X_test_v6)
y_pred_class_v6 = (y_pred_prob_v6 > 0.5).astype('int32')
auc_roc_v6 = roc_auc_score(y_test_v6, y_pred_prob_v6)
print(f"AUC-ROC: {auc_roc_v6}")
tn, fp, fn, tp = confusion_matrix(y_test_v6, y_pred_class_v6).ravel()
specificity_v6 = tn / (tn + fp)
print(f"Specificity: {specificity_v6}")

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test Loss: 2.6523804664611816
Test Accuracy: 0.5924999713897705
AUC-ROC: 0.6364727954971858
Specificity: 0.7128205128205128


This version performs a lot better than the previous version. Accuracy = .59, AUC-ROC = .64, and Specificity (the thing I cared abotu the most) = .71. 

# Version 6 (v7)

In [36]:
# Parameters (based on the previous successful model)
max_features_v7 = 8000  # Keeping the same as v6
maxlen_v7 = 100         # Keeping the same as v6

# Preprocessing (same as v6)
tokenizer_v7 = Tokenizer(num_words=max_features_v7)
tokenizer_v7.fit_on_texts(all['review_pros'])
X_v7 = tokenizer_v7.texts_to_sequences(all['review_pros'])
X_v7 = pad_sequences(X_v7, maxlen=maxlen_v7)
X_train_v7, X_test_v7, y_train_v7, y_test_v7 = train_test_split(X_v7, all['status'], test_size=0.2, random_state=42)

# Adjusted model architecture
model_v7 = Sequential()
model_v7.add(Embedding(max_features_v7, 50, input_length=maxlen_v7))
model_v7.add(Conv1D(128, 3, activation='relu'))
model_v7.add(Conv1D(128, 3, activation='relu'))  # Additional convolutional layer
model_v7.add(GlobalMaxPooling1D())
model_v7.add(Dense(30, activation='relu'))  # Increased neurons in this layer
model_v7.add(Dropout(0.5))
model_v7.add(Dense(15, activation='relu'))  # Another dense layer with more neurons
model_v7.add(Dropout(0.5))  # Additional dropout layer for regularization
model_v7.add(Dense(1, activation='sigmoid'))


model_v7.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_v7.fit(X_train_v7, y_train_v7, epochs=30, batch_size=32)


loss_v7, accuracy_v7 = model_v7.evaluate(X_test_v7, y_test_v7)
print(f"Test Loss: {loss_v7}")
print(f"Test Accuracy: {accuracy_v7}")

y_pred_prob_v7 = model_v7.predict(X_test_v7)
y_pred_class_v7 = (y_pred_prob_v7 > 0.5).astype('int32')
auc_roc_v7 = roc_auc_score(y_test_v7, y_pred_prob_v7)
print(f"AUC-ROC: {auc_roc_v7}")
tn, fp, fn, tp = confusion_matrix(y_test_v7, y_pred_class_v7).ravel()
specificity_v7 = tn / (tn + fp)
print(f"Specificity: {specificity_v7}")

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test Loss: 6.378418445587158
Test Accuracy: 0.5475000143051147
AUC-ROC: 0.5863539712320199
Specificity: 0.6153846153846154


worse than v6 maybe it overfit. I am going to keep v6 as my final model for this homework. Now to test how well it works - I am going to write a few sentences myself and to use my model to predict. I tried to think of how I'd rate these companies if I worked for them. 

# Testing Model Performance with my Writings

In [37]:
# Sample text entries
texts = [
    "Honestly, I don't know what to say about this place, I really can't think of anything good to say. I HATE HATE my job at Walgreens.",
    "Bed bath and beyond is honestly the best employer I have, people are nice, the benefits are good, and my manager she's like an angel. It's really the people that make your experience great.",
    "Payments are good, work-life balance is good, and end of year benefit is decent.",
    "It's a big company across the world so you'd get a lot of travel opportunities. I guess it could be good or bad. I like the free lunch provided there.",
    "NONE None NONE DON'T work for them.",
    "This job makes me want to kms.",
    "They have free daycare in the building, which to me is the most amazing thing = I can leave my kid there and visit her during the day within my office building. I think this by itself makes me want to work for Oracle forever.",
    "Employee discount is pretty good. Sometimes you get 40% off for some latest shoes, and you can always get the first dibs on all the popular ones."
]


X_sample = tokenizer_v6.texts_to_sequences(texts)
X_sample = pad_sequences(X_sample, maxlen=maxlen_v6)
predictions = model_v6.predict(X_sample)
predicted_classes = (predictions > 0.5).astype(int)


predicted_classes




array([[0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0]])

1. honestly I don't know what to say about this place, I really can't think of anything good to say. I HATE HATE my job at Walgreens. 
[Model prediction - Quit]

2. bed bath and beyond is hoenstly the best employers I have, people are nice, the benefits are good, and my manager she's like an angel. It's really the people tha make your experience great. 
[Model prediction - Quit]

3. payments are good, work-life balance is good, and end of year benefit is decent. 
[Model prediction - Stay]

4. it's a big company across the world so you'd get a lot of travel opportunities. I guess it could be good or bad. I like the free lunch provided there.
[Model prediction - Quit]

5. NONe None NONE DON't work for them
[Model prediction - Quit]

6. This job makes me want to kms
[Model prediction - Stay]

7. they have free daycare in the building, which to me is the most amazing thing = I can leave my kid there and visit her during the day within my office buidling. I think this by itself makes me want to work for oracle forever. 
[Model prediction - Stay]

8. employee discount is pretty good. sometimes you get 40% off for some latest shoes, and you can always get the first dibs on all the popular ones. 
[Model prediction - Quit]

# Summary

I think this model I built is decently good - it defnitely captured the most intense one: 1, 5 and 7, which is super negative and super positive. It missed 6 - does not capture internet language too well, which makes sense. The model defnitely is more prone to categorize quit than stay, which aligns with my original intention. I want this to be something to capture employees with even remote turnover potentials, so that maybe companies could target some preventions. I see that for ambiguous comments such as 4 and 8 - the model categorized them as Quit, which could go either way with human interpretation. Afterall - had I have a different purpose, like wanting to maximize other metrics, this model might not be selected. 

The model can defnitely be improved if the original training data was to measure turnover intention instead of their actualy employement status - since employement status can be polluted by a number of other stuff compared to turnover intention. This model only used 2000 entries of data and I am content with it's current performance. If more data can be thrown in and use a more complicated word embedding such as bert, it might work better. (I tried using bert but one epoc with only 600 entries took more than 12 hours and I had to interrupt and change strategy). 