# Vocabulary and embedding

In [1]:
import pandas as pd
import numpy as np

In [2]:
from ast import literal_eval

Now that we have text pre-processed, we can create the vocabulary and embedding.

There are many different options, which embedding to use - for example doc2vec, glove, fastext, .. for the current first iteration of our project, we are going to go with Facebook's fastext. 

In [3]:
# load data
df = pd.read_csv('../data/train/comments_processed.csv', index_col=0)
df_test = pd.read_csv('../data/test/comments_processed.csv', index_col=0)

In [4]:
df.comment = df.comment.apply(literal_eval)
df_test.comment = df_test.comment.apply(literal_eval)

In [86]:
df.head(n=5)

Unnamed: 0,comment,sentiment
0,"[movi, get, respect, sure, lot, memor, quot, l...",1
1,"[bizarr, horror, movi, fill, famou, face, stol...",1
2,"[solid, unremark, film, matthau, einstein, won...",1
3,"[strang, feel, sit, alon, theater, occupi, par...",1
4,"[probabl, alreadi, know, addit, episod, never,...",1


We will create the vocabulary by assigning every word its number. 
However, we will sort the words according to their "priority", which is simply number of word occurences. 
This is only being done for the purpose of potentially limiting the size of vocabulary (e.g. to 15000 most frequently used words). We will see, if it makes sense later.

Also, another option is to use already built vocabulary along with already trained and prepared embeddings. 
This has an advantage of covering most of the words for english language. However, there might be slight disadvantage that the embedding is not domain specific - and people might communicate movie reviews a little bit differently than other texts. 

We can compare mulitple approaches later. 

In [5]:
# create vocabulary
words = {}
for idx, comment in df.comment.iteritems():
    for word in comment:
        if word in words.keys():
            words[word] += 1
        else:
            words[word] = 1

Order by occurences 

In [6]:
words = [(k, v) for k, v in words.items()]

In [7]:
words = sorted(words, key=lambda x: x[1], reverse=True)

In [8]:
words[:30]

[('movi', 50595),
 ('film', 47239),
 ('one', 26912),
 ('like', 22185),
 ('time', 15542),
 ('good', 14924),
 ('make', 14546),
 ('get', 14065),
 ('charact', 13993),
 ('see', 13896),
 ('watch', 13796),
 ('would', 13188),
 ('stori', 12895),
 ('even', 12764),
 ('realli', 11694),
 ('scene', 10429),
 ('well', 9951),
 ('show', 9750),
 ('look', 9730),
 ('much', 9667),
 ('end', 9445),
 ('peopl', 9314),
 ('could', 9216),
 ('bad', 9108),
 ('also', 9097),
 ('great', 9064),
 ('first', 8888),
 ('think', 8847),
 ('love', 8794),
 ('way', 8673)]

In [9]:
words[-30:]

[('firsthalf', 1),
 ('drosselmei', 1),
 ('closedforweekend', 1),
 ('highsecur', 1),
 ('warveteran', 1),
 ('milla', 1),
 ('antithril', 1),
 ('descriptionwoman', 1),
 ('securityi', 1),
 ('realizationi', 1),
 ('quarterfin', 1),
 ('rakishli', 1),
 ('untempt', 1),
 ('crumley', 1),
 ('buic', 1),
 ('antiplot', 1),
 ('fictiondrama', 1),
 ('thereinaft', 1),
 ('lockup', 1),
 ('dresssuit', 1),
 ('overvot', 1),
 ('ontyp', 1),
 ('infantalis', 1),
 ('rou', 1),
 ('orientalist', 1),
 ('tooand', 1),
 ('repleat', 1),
 ('jowl', 1),
 ('camora', 1),
 ('capich', 1)]

In [8]:
len(words)

79709

As probably expected, we can see that for example occurence of word "movi" (stem from "movie") is almost the same as number of documents (even higher). So it goes for word film. I think these two words might be domain stopwords - frequently used and there with no added value to comments. 

What is interesting is that words "good" and "like" made it pretty high. However, they might have been used with not - so they alone (without context) are not helpful (this might be an example of why "not" should not be in the englighs stopwords for this problem. 

Another thing is that we surely have a lot of words, that occur once in all the comments. 
Our vocabulary is on the other hand quite large - almost 80 000. 
There comes the question, if it makes sense to leave every word in vocabulary - or if we want to use only words that occur for example at least 2-3 times (or more). 

Further analysis, like tf-idf computing, might help us understand the most meaningful words for our model, but for now, we will just skip words that occured a little times. 

In [9]:
words[12500]

('overdu', 12)

In [120]:
words[15001]

('pinter', 9)

In [117]:
words[25000]

('preordain', 3)

In [116]:
words[35000]

('apparentlyonli', 2)

Here we can also see that our vocab also consist of typos. 
After first most frequent 15000 words we can see that words occur no more than 9 times (from the whole set of 25000 movies). We can here try to limit our vocab to 15000 most ferquent words (with the risk od decreasing accuracy). 

We can also play with this parameter later - this is now done to simplyfy things a liitle and to speed up training as well. 

In [11]:
words = words[:15000]

In [12]:
vocab = {}
# assign words a number
for idx, word in enumerate(words):
    vocab[word[0]] = idx + 1

In [15]:
[(k, v) for k, v in vocab.items()][:20]  # not the most efective way :) 

[('movi', 1),
 ('film', 2),
 ('one', 3),
 ('like', 4),
 ('time', 5),
 ('good', 6),
 ('make', 7),
 ('get', 8),
 ('charact', 9),
 ('see', 10),
 ('watch', 11),
 ('would', 12),
 ('stori', 13),
 ('even', 14),
 ('realli', 15),
 ('scene', 16),
 ('well', 17),
 ('show', 18),
 ('look', 19),
 ('much', 20)]

Now that we have our vocabulary constructed, we can replace words with their ids.

In [12]:
df['comment_ids'] = df.comment.apply(lambda comm: list(map(lambda x: vocab.get(x, None), comm)))

Since we created the vocabulary ourselves and on train data set only, here can happen that test data set contains words that are new (not present in train data set). We will remove the words from comments as well. 

In [13]:
df_test['comment_ids'] = df_test.comment.apply(lambda comm: list(map(lambda x: vocab.get(x, None), comm)))

In [14]:
df['comment_ids'] = df.comment_ids.apply(lambda comm: list(filter(lambda x: x is not None, comm)))
df_test['comment_ids'] = df_test.comment_ids.apply(lambda comm: list(filter(lambda x: x is not None, comm)))

In [15]:
df_test.head(n=3)

Unnamed: 0,comment,sentiment,comment_ids
0,"[base, actual, stori, john, boorman, show, str...",1,"[332, 63, 13, 221, 9212, 18, 764, 190, 786, 54..."
1,"[gem, film, four, product, anticip, qualiti, i...",1,"[1145, 2, 619, 218, 2348, 367, 750, 518, 150, ..."
2,"[realli, like, show, drama, romanc, comedi, ro...",1,"[15, 4, 18, 373, 717, 106, 847, 3, 587, 344, 2..."


In [130]:
df.head(n=10)

Unnamed: 0,comment,sentiment,comment_ids
0,"[movi, get, respect, sure, lot, memor, quot, l...",1,"[1, 8, 615, 140, 67, 751, 1564, 716, 1145, 354..."
1,"[bizarr, horror, movi, fill, famou, face, stol...",1,"[966, 109, 1, 624, 701, 228, 2183, 6760, 1478,..."
2,"[solid, unremark, film, matthau, einstein, won...",1,"[998, 7012, 2, 2525, 4637, 102, 379, 61, 33, 1..."
3,"[strang, feel, sit, alon, theater, occupi, par...",1,"[473, 60, 424, 502, 503, 3788, 597, 13585, 137..."
4,"[probabl, alreadi, know, addit, episod, never,...",1,"[156, 385, 35, 1006, 176, 48, 673, 229, 116, 1..."
5,"[saw, movi, two, grown, children, although, cl...",1,"[131, 1, 40, 1976, 357, 185, 922, 6135, 98, 16..."
6,"[use, imdb, given, hefti, vote, favourit, film...",1,"[62, 806, 285, 10303, 1281, 1282, 2, 64, 72, 2..."
7,"[good, film, power, messag, love, redempt, lov...",1,"[6, 2, 275, 556, 29, 2540, 29, 1431, 323, 1030..."
8,"[made, quartet, trio, continu, qualiti, earlie...",1,"[34, 8684, 3138, 454, 367, 799, 2, 193, 259, 1..."
9,"[matur, man, admit, shed, tear, film, matur, r...",1,"[1800, 56, 725, 2480, 1073, 2, 1800, 952, 1800..."


This is the part where we will create and train our Fastext (or another) embedding, or just use already trained one (and comapre results then). 
Since this is only the first iteration on our project, we are ending here and letting our Neural network create and train the embedding - even if it does not capture the word relations. 
We can then compare, how using advanced embeddings helps us to achieve better results. 

In [57]:
# embedding training will come here...

We also still need to make sure that our documents have are the same length. Let's do a quick analysis of lengths first:

In [16]:
# compute comment lengths
df['words_n'] = df.comment_ids.apply(len)

In [17]:
df.words_n.describe()

count    25000.000000
mean       112.184480
std         84.741327
min          4.000000
25%         60.000000
50%         83.000000
75%        137.000000
max       1320.000000
Name: words_n, dtype: float64

In [133]:
df.words_n.quantile(0.95)

291.0

In [134]:
df.words_n.quantile(0.90)

223.0

In [18]:
df.words_n.quantile(0.85)

183.0

Only a smaller part of our data set (slightly more than 10\%) contains more than 200 words. 
Since we think that 100 words could be enough to detect sentiment, we will set that as maximum length of our comment and cut the words in comments after. However, this is a parameter to play with later. 

In [19]:
# set 100 as maximum comment length
df['x'] = df.comment_ids.apply(lambda x: x[:100])
df_test['x'] = df_test.comment_ids.apply(lambda x: x[:100])

In [20]:
# pad shorter comments with 0
df['x'] = df.x.apply(lambda x: np.pad(x, (0, 100 - len(x)), mode='constant'))
df_test['x'] = df_test.x.apply(lambda x: np.pad(x, (0, 100 - len(x)), mode='constant'))

And we are ready to go for now.

In [21]:
# length of vocabulary for further processing
len(vocab)

15000

In [22]:
# persist - to pickle now
df.to_pickle('../data/train/comments_embed.pkl')
df_test.to_pickle('../data/test/comments_embed.pkl')

In [23]:
del df
del df_test