This notebook is for the purpose of converting the essays in string form to word embedding vectorized form. This notebook makes use of the GLoVe word embeddings. More information about these word embeddings can be found here: https://nlp.stanford.edu/projects/glove/

If you'd like to move directly to the Neural Network, you can skip this notebook completely and load the prepared data directly!

If you would like to follow along with the preprocessing work, follow the link above and download the glove.6B.zip file to the "data/" directory of this project. Then uncompress the zip file to reveal directory named "glove.6B/" with 4 txt files inside. These are the four available word embeddings. You are now able to proceed with this notebook. 

In [1]:
from src.preprocess import get_data, vectorize_essays, pad_embedding, word_count
%load_ext autoreload
%autoreload 2

In [2]:
# Import training data from tsv file and store in a pandas dataframe
data_path = './data/training_set_rel3.tsv'
essay_df = get_data(data_path)

We can inspect the contents of the dataframe to understand what information is provided in the training set
More EDA if we have time

We'd like to drop some columns that are irrelevant to the neural network

In [3]:
drop_labels =['rater1_trait1', 'rater1_trait2', 'rater1_trait3', 'rater1_trait4', 'rater1_trait5', 'rater1_trait6',\
              'rater2_trait1', 'rater2_trait2', 'rater2_trait3', 'rater2_trait4', 'rater2_trait5', 'rater2_trait6',\
              'rater3_trait1', 'rater3_trait2', 'rater3_trait3', 'rater3_trait4', 'rater3_trait5', 'rater3_trait6',\
              'rater3_domain1', 'rater1_domain2', 'rater2_domain2', 'domain1_score', 'domain2_score']
essay_df.drop(columns=drop_labels, inplace=True)
essay_df.tail()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1
12971,21626,8,In most stories mothers and daughters are eit...,17,18
12972,21628,8,I never understood the meaning laughter is th...,15,17
12973,21629,8,"When you laugh, is @CAPS5 out of habit, or is ...",20,26
12974,21630,8,Trippin' on fen...,20,20
12975,21633,8,Many people believe that laughter can improve...,20,20


The given dataset holds essays from 8 distinct "essay sets". In order to speed training time and reduce memory impact, we will only examine a subset of these essay sets. The rest of the essays can be discarded

In [4]:
sets = [1,3,4,5,6]
essay_df = essay_df[essay_df['essay_set'].isin(sets)]

"Examining essays in sets {}".format(essay_df.essay_set.unique())

'Examining essays in sets [1 3 4 5 6]'

In [5]:
essays = essay_df['essay'] # A list of essays as strings

# Add a column to dataframe with essays as word embeddings
essay_df['essays_embed'] = vectorize_essays(essays, embed_size=100, verbose=True)

The total number of essays is 8884
1000 Essays Vectorized
2000 Essays Vectorized
3000 Essays Vectorized
4000 Essays Vectorized
5000 Essays Vectorized
6000 Essays Vectorized
7000 Essays Vectorized
8000 Essays Vectorized
8884 Total Essays Vectorized!


In [6]:
essay_df['word_count'] = essay_df.apply(lambda row: word_count(row), axis=1)
# Potentially add matplotlib visualization later
print("*** Word Count Statistics ***\n")
#word_count_stats = [None]*len(sets)
for set in sets:
    print('*'*40)
    print("Word count statistics for set {}: ".format(set))
    print(essay_df[essay_df['essay_set']==set]['word_count'].describe())
    print('\n')

*** Word Count Statistics ***

****************************************
Word count statistics for set 1: 
count    1783.000000
mean      319.823331
std       106.196183
min        10.000000
25%       249.000000
50%       318.000000
75%       385.000000
max       732.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 3: 
count    1726.000000
mean       91.243917
std        45.958534
min         6.000000
25%        55.000000
50%        84.000000
75%       123.000000
max       302.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 4: 
count    1770.000000
mean       75.905085
std        43.277093
min         2.000000
25%        42.000000
50%        69.000000
75%       103.000000
max       313.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 5: 
count    1805.000000
mean      103.087535
std        50.1541

We would also like to remove some outliers for word count. Those essays with length greater than 600 words will be removed.

In [7]:
max_length = 600 # Maximum length for an essay
essay_df = essay_df[essay_df['word_count']<=max_length]

essay_df['word_count'].describe()

count    8871.000000
mean      143.059858
std       108.228778
min         2.000000
25%        68.000000
50%       113.000000
75%       172.000000
max       600.000000
Name: word_count, dtype: float64

Next we will pad each essay to the maximum length so that all essays have a uniform size of word embedding matrix

In [8]:
essays_embed = essay_df['essays_embed']
essay_df['essays_embed'] = pad_embedding(essays_embed,\
                                         max_length=max_length, right_pad=True)

In [9]:
essay_df.dropna(subset=['essays_embed'], inplace=True)

In [10]:
# Add max score for each set
score_ranges = [(0,0), (2,12), (0,0), (0,3), (0,4), (0,4), (0,4)]

for idx in essay_df.index:
    essay_df.loc[idx,'max_score']=score_ranges[essay_df.loc[idx,'essay_set']][1]

In [11]:
maximum_score = 12
for idx in essay_df.index:
    essay_df.loc[idx,'norm_score1']=essay_df.loc[idx,'rater1_domain1']*maximum_score/essay_df.loc[idx,'max_score']
    essay_df.loc[idx,'norm_score2']=essay_df.loc[idx,'rater2_domain1']*maximum_score/essay_df.loc[idx,'max_score']
    
essay_df['norm_score1']=[int(x) for x in essay_df['norm_score1']]
essay_df['norm_score2']=[int(x) for x in essay_df['norm_score2']]

In [14]:
essay_df.head()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,essays_embed,word_count,max_score,norm_score1,norm_score2
0,1,1,"Dear local newspaper, I think effects computer...",4,4,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",299,12.0,4,4
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",349,12.0,5,4
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",236,12.0,4,3
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5,5,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",449,12.0,5,5
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",387,12.0,4,4


In [15]:
# Store the preprocessed data frame for later use
essay_df.to_pickle('./data/essay_df.pkl')