This notebook is for the purpose of converting the essays in string form to word embedding vectorized form. This notebook makes use of the GLoVe word embeddings. More information about these word embeddings can be found here: https://nlp.stanford.edu/projects/glove/

If you'd like to move directly to the Neural Network, you can skip this notebook completely and load the prepared data directly!

If you would like to follow along with the preprocessing work, follow the link above and download the glove.6B.zip file to the "data/" directory of this project. Then uncompress the zip file to reveal directory named "glove.6B/" with 4 txt files inside. These are the four available word embeddings. You are now able to proceed with this notebook. 

In [140]:
from src.preprocess import get_data, vectorize_essays, pad_embedding, word_count
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [141]:
# Import training data from tsv file and store in a pandas dataframe
data_path = './data/training_set_rel3.tsv'
essay_df = get_data(data_path)

We can inspect the contents of the dataframe to understand what information is provided in the training set
More EDA if we have time

We'd like to drop some columns that are irrelevant to the neural network

In [142]:
drop_labels =['rater1_trait1', 'rater1_trait2', 'rater1_trait3', 'rater1_trait4', 'rater1_trait5', 'rater1_trait6',\
              'rater2_trait1', 'rater2_trait2', 'rater2_trait3', 'rater2_trait4', 'rater2_trait5', 'rater2_trait6',\
              'rater3_trait1', 'rater3_trait2', 'rater3_trait3', 'rater3_trait4', 'rater3_trait5', 'rater3_trait6',\
              'rater1_domain1', 'rater2_domain1', 'rater3_domain1', 'rater1_domain2', 'rater2_domain2']
essay_df.drop(columns=drop_labels, inplace=True)
essay_df.tail()

Unnamed: 0,essay_id,essay_set,essay,domain1_score,domain2_score
12971,21626,8,In most stories mothers and daughters are eit...,35,
12972,21628,8,I never understood the meaning laughter is th...,32,
12973,21629,8,"When you laugh, is @CAPS5 out of habit, or is ...",40,
12974,21630,8,Trippin' on fen...,40,
12975,21633,8,Many people believe that laughter can improve...,40,


The given dataset holds essays from 8 distinct "essay sets". In order to speed training time and reduce memory impact, we will only examine a subset of these essay sets. The rest of the essays can be discarded

In [143]:
sets = [1,3,4,5,6]
essay_df = essay_df[essay_df['essay_set'].isin(sets)]

print("Examining essays in sets {}".format(essay_df.essay_set.unique()))

Examining essays in sets [1 3 4 5 6]


In [144]:
essays = essay_df['essay'] # A list of essays as strings

# Add a column to dataframe with essays as word embeddings
essay_df['essays_embed'] = vectorize_essays(essays, embed_size=100, verbose=True)

The total number of essays is 8884


1000 Essays Vectorized


2000 Essays Vectorized


3000 Essays Vectorized


4000 Essays Vectorized


5000 Essays Vectorized


6000 Essays Vectorized


7000 Essays Vectorized


8000 Essays Vectorized


8884 Total Essays Vectorized!


In [145]:
essay_df['word_count'] = essay_df.apply(lambda row: word_count(row), axis=1)
# Potentially add matplotlib visualization later
print("*** Word Count Statistics ***\n")
#word_count_stats = [None]*len(sets)
for set in sets:
    print('*'*40)
    print("Word count statistics for set {}: ".format(set))
    print(essay_df[essay_df['essay_set']==set]['word_count'].describe())
    print('\n')

*** Word Count Statistics ***

****************************************
Word count statistics for set 1: 
count    1783.000000
mean      403.526640
std       133.734738
min        10.000000
25%       315.000000
50%       402.000000
75%       487.000000
max       918.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 3: 
count    1726.000000
mean      118.953071
std        58.932136
min        10.000000
25%        73.000000
50%       110.000000
75%       159.000000
max       401.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 4: 
count    1770.000000
mean      102.149153
std        57.075221
min         2.000000
25%        57.000000
50%        94.000000
75%       138.750000
max       401.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 5: 
count    1805.000000
mean      134.165651
std        63.9813

We would also like to remove some outliers for word count. Those essays with length greater than 600 words will be removed.

In [146]:
max_length = 600 # Maximum length for an essay
essay_df = essay_df[essay_df['word_count']<=max_length]

print(essay_df['word_count'].describe())

count    8767.000000
mean      179.020532
std       125.991802
min         2.000000
25%        90.000000
50%       147.000000
75%       218.500000
max       600.000000
Name: word_count, dtype: float64


Next we will pad each essay to the maximum length so that all essays have a uniform size of word embedding matrix

In [147]:
essays_embed = essay_df['essays_embed']
essay_df['essays_embed'] = pad_embedding(essays_embed,\
                                         max_length=max_length, right_pad=True)

In [139]:
# Add max score for each set
score_ranges = [(2,12), (0,3), (0,3), (0,4), (0,4)]
for idx,set in enumerate(sets):
    essay_df['max_score'] = essay_df.apply(lambda row: score_ranges[idx][1] if row['essay_set']==set, axis=1)

SyntaxError: invalid syntax (<ipython-input-139-60f72daa8988>, line 4)

In [119]:
#essay_df = normalize_scores(essay_df, sets, max_score=12)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [120]:
essay_df.describe()

Unnamed: 0,essay_id,essay_set,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,word_count,norm_score
count,8767.0,8767.0,8767.0,8767.0,0.0,8767.0,0.0,0.0,0.0,8767.0,1800.0
mean,9337.33398,3.849549,2.389301,2.379035,,3.302156,,,,179.020532,8.16
std,5082.461507,1.706457,1.31976,1.31955,,2.7182,,,,125.991802,2.911891
min,1.0,1.0,0.0,0.0,,0.0,,,,2.0,0.0
25%,6504.5,3.0,1.0,1.0,,2.0,,,,90.0,6.0
50%,9857.0,4.0,2.0,2.0,,2.0,,,,147.0,9.0
75%,13239.5,5.0,3.0,3.0,,4.0,,,,218.5,9.0
max,16633.0,6.0,6.0,6.0,,12.0,,,,600.0,12.0


In [11]:
# Store the preprocessed data frame for later use
essay_df.to_pickle('./data/essay_df.pkl')