This notebook is for the purpose of converting the essays in string form to word embedding vectorized form. This notebook makes use of the GLoVe word embeddings. More information about these word embeddings can be found here: https://nlp.stanford.edu/projects/glove/

If you would like to follow along with the preprocessing work, follow the link above and download the glove.6B.zip file to the "data/" directory of this project. Then uncompress the zip file to reveal directory named "glove.6B/" with 4 txt files inside. These are the four available word embeddings. You are now able to proceed with this notebook. 

In [1]:
from src.preprocess import get_data, vectorize_essays, pad_embedding, word_count
%load_ext autoreload
%autoreload 2

In [2]:
# Import training data from tsv file and store in a pandas dataframe
data_path = './data/training_set_rel3.tsv'
essay_df = get_data(data_path)

We can inspect the contents of the dataframe to understand what information is provided in the training set.

More EDA if we have time

We'd like to drop some columns that are irrelevant to the neural network

In [3]:
drop_labels =['rater1_trait1', 'rater1_trait2', 'rater1_trait3', 'rater1_trait4', 'rater1_trait5', 'rater1_trait6',\
              'rater2_trait1', 'rater2_trait2', 'rater2_trait3', 'rater2_trait4', 'rater2_trait5', 'rater2_trait6',\
              'rater3_trait1', 'rater3_trait2', 'rater3_trait3', 'rater3_trait4', 'rater3_trait5', 'rater3_trait6',\
              'rater3_domain1', 'rater1_domain2', 'rater2_domain2', 'domain2_score']
essay_df.drop(columns=drop_labels, inplace=True)
essay_df.head()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,domain1_score
0,1,1,"Dear local newspaper, I think effects computer...",4,4,8
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,9
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,7
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5,5,10
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,8


The given dataset holds essays from 8 distinct "essay sets". In order to speed training time and reduce memory impact, we will only examine a subset of these essay sets. The rest of the essays can be discarded. For a more detailed analysis of why certain essay sets were used and some were not, please see the research report. 

In [4]:
sets = [1,3,4,5,6]
essay_df = essay_df[essay_df['essay_set'].isin(sets)]

"Examining essays in sets {}".format(essay_df.essay_set.unique())

'Examining essays in sets [1 3 4 5 6]'

In [5]:
essays = essay_df['essay'] # A list of essays as strings

# Add a column to dataframe with essays as word embeddings
essay_df['essays_embed'] = vectorize_essays(essays, embed_size=100, verbose=False)

Vectorizing essays...


8884 Total Essays Vectorized!


In [6]:
essay_df['word_count'] = essay_df.apply(lambda row: word_count(row), axis=1)
# Potentially add matplotlib visualization later
print("*** Word Count Statistics ***\n")
wc_stats = {}
for set in sets:
    print('*'*40)
    print("Word count statistics for set {}: ".format(set))
    print(essay_df[essay_df['essay_set']==set]['word_count'].describe())
    print('\n')
    mean,std = essay_df[essay_df['essay_set']==set]['word_count'].agg(['mean','std'])
    
    # Record the min and max allowable word counts for each set
    wc_stats[set] = (int(round(mean-std)),int(round(mean+std)))

*** Word Count Statistics ***

****************************************
Word count statistics for set 1: 
count    1783.000000
mean      319.823331
std       106.196183
min        10.000000
25%       249.000000
50%       318.000000
75%       385.000000
max       732.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 3: 
count    1726.000000
mean       91.243917
std        45.958534
min         6.000000
25%        55.000000
50%        84.000000
75%       123.000000
max       302.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 4: 
count    1770.000000
mean       75.905085
std        43.277093
min         2.000000
25%        42.000000
50%        69.000000
75%       103.000000
max       313.000000
Name: word_count, dtype: float64


****************************************
Word count statistics for set 5: 
count    1805.000000
mean      103.087535
std        50.1541

We would also like to remove some outliers for word count. The chosen method is to examine each essay set individually. For each essay set, remove essays that have word counts that fall outside of 1 standard deviation from the mean.
Here we can put a second plot of word count statistics after updating to show the improvement

In [7]:
for set in sets:
    wc_min, wc_max = wc_stats[set]
    essay_df = essay_df[(essay_df['essay_set']!=set) | ((essay_df['word_count']<wc_max) & (essay_df['word_count']>wc_min))]
    print(essay_df[essay_df['essay_set']==set]['word_count'].describe())
    print('*'*40)

count    1234.000000
mean      318.367099
std        55.685500
min       215.000000
25%       275.000000
50%       318.000000
75%       363.000000
max       425.000000
Name: word_count, dtype: float64
****************************************
count    1130.000000
mean       86.095575
std        25.733142
min        46.000000
25%        64.000000
50%        83.000000
75%       107.750000
max       136.000000
Name: word_count, dtype: float64
****************************************
count    1202.000000
mean       71.164725
std        24.172592
min        34.000000
25%        50.000000
50%        69.500000
75%        91.000000
max       118.000000
Name: word_count, dtype: float64
****************************************
count    1208.000000
mean      101.022351
std        27.013765
min        54.000000
25%        79.000000
50%       101.000000
75%       123.000000
max       152.000000
Name: word_count, dtype: float64
****************************************
count    1229.000000
mean      1

Next we will pad each essay to the maximum length so that all essays have a uniform size of word embedding matrix

In [8]:
essay_df['essays_embed_pad'] = essay_df.apply(lambda x: pad_embedding(x['essays_embed'], x['essay_set'], wc_stats, right_pad=True), axis=1)
#essay_df['essays_embed'] = pad_embedding(essays_embed,\
#                                         max_length=max_length, right_pad=True)

In [9]:
for set in sets:
    n_essays = len(essay_df[essay_df['essay_set']==set])
    shapes = [essay_df[essay_df['essay_set']==set]['essays_embed_pad'].iloc[x].shape for x in range(n_essays)]
    if all(shape==shapes[0] for shape in shapes):
        print('All {} padded essays in set {} have a shape of {}'.format(n_essays, set, shapes[0]))

All 1234 padded essays in set 1 have a shape of (426, 100)


All 1130 padded essays in set 3 have a shape of (137, 100)


All 1202 padded essays in set 4 have a shape of (119, 100)


All 1208 padded essays in set 5 have a shape of (153, 100)


All 1229 padded essays in set 6 have a shape of (175, 100)


In [10]:
# Add max score for each set
score_ranges = [(0,0), (2,12), (0,0), (0,3), (0,4), (0,4), (0,4)]

for idx in essay_df.index:
    essay_df.loc[idx,'max_score']=score_ranges[essay_df.loc[idx,'essay_set']][1]

In [11]:
maximum_score = 12
for idx in essay_df.index:
    essay_df.loc[idx,'norm_score1']=essay_df.loc[idx,'rater1_domain1']*maximum_score/essay_df.loc[idx,'max_score']
    essay_df.loc[idx,'norm_score2']=essay_df.loc[idx,'rater2_domain1']*maximum_score/essay_df.loc[idx,'max_score']
    essay_df.loc[idx,'norm_score']=essay_df.loc[idx,'domain1_score']*maximum_score/essay_df.loc[idx,'max_score']
    
    
essay_df['norm_score1']=[int(x) for x in essay_df['norm_score1']]
essay_df['norm_score2']=[int(x) for x in essay_df['norm_score2']]
essay_df['norm_score']=[int(x) for x in essay_df['norm_score']]

In [128]:
essay_df.head(15)

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,domain1_score,essays_embed,word_count,max_score,norm_score1,norm_score2,norm_score
0,1,1,"Dear local newspaper, I think effects computer...",4,4,8,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",299,12.0,4,4,8
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,9,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",349,12.0,5,4,9
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,7,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",236,12.0,4,3,7
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,8,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",387,12.0,4,4,8
8,9,1,"Dear reader, @ORGANIZATION1 has had a dramatic...",4,5,9,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",395,12.0,4,5,9
9,10,1,In the @LOCATION1 we have the technology of a ...,5,4,9,"[[0.027166, -0.1762, -0.19623, 0.33527, 0.0623...",419,12.0,5,4,9
10,11,1,"Dear @LOCATION1, @CAPS1 people acknowledge the...",4,4,8,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",284,12.0,4,4,8
11,12,1,Dear @CAPS1 @CAPS2 I feel that computers do ta...,4,4,8,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",321,12.0,4,4,8
13,14,1,My three detaileds for this news paper article...,3,3,6,"[[0.080273, -0.10861, 0.72067, -0.45136, -0.74...",261,12.0,3,3,6
16,17,1,"Dear Local Newspaper, I belive that computers ...",4,4,8,"[[-0.45701, 0.23121, 0.87486, -0.7933, -0.1945...",298,12.0,4,4,8


In [12]:
# Store the preprocessed data frame for later use
essay_df.to_pickle('./data/essay_df.pkl')