In [7]:
import pandas as pd
import torch
import fastai

# The data
We will start working with a sample from the IMDB sentiment classification dataset. The goal for now is to understand the dataset and prepare it for working with neural nets.


In [8]:
data_path = fastai.untar_data(fastai.URLs.IMDB_SAMPLE)
data_path.ls()


[PosixPath('/Users/dani/.fastai/data/imdb_sample/models'),
 PosixPath('/Users/dani/.fastai/data/imdb_sample/texts.csv')]

Let's load and see the data:

In [9]:
dataset = pd.read_csv(data_path/'texts.csv')
dataset.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


Each row corresponds to one example: the sentiment `label`, input `text` and a special field added by the fastai guys to split the dataset for training and validation.

# Preparing the data
Before feeding a neural net with text, we need to turn it into sequences of numbers.
This can be done in many ways: an integer per word, char, subword, etc. 
Let's keep it simple for now and turn words into integer ids

In [17]:
# Let's get the first example
raw_text = dataset['text'][0]
raw_text

"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!"

In [18]:
# The simplest tokenization we can do is splitting by white-space
tokens = raw_text.split(' ')
tokens

['Un-bleeping-believable!',
 'Meg',
 'Ryan',
 "doesn't",
 'even',
 'look',
 'her',
 'usual',
 'pert',
 'lovable',
 'self',
 'in',
 'this,',
 'which',
 'normally',
 'makes',
 'me',
 'forgive',
 'her',
 'shallow',
 'ticky',
 'acting',
 'schtick.',
 'Hard',
 'to',
 'believe',
 'she',
 'was',
 'the',
 'producer',
 'on',
 'this',
 'dog.',
 'Plus',
 'Kevin',
 'Kline:',
 'what',
 'kind',
 'of',
 'suicide',
 'trip',
 'has',
 'his',
 'career',
 'been',
 'on?',
 'Whoosh...',
 'Banzai!!!',
 'Finally',
 'this',
 'was',
 'directed',
 'by',
 'the',
 'guy',
 'who',
 'did',
 'Big',
 'Chill?',
 'Must',
 'be',
 'a',
 'replay',
 'of',
 'Jonestown',
 '-',
 'hollywood',
 'style.',
 'Wooofff!']

Ok, we have a list of words for the text. In order to turn it into integer ids, we need to build a map word -> id and viceversa (id -> word). This is what is called a vocabulary, and is an essential component of any deep learning NLP pipeline. Let's build our first vocabulary (just for this text sample):

In [23]:
unique_tokens = set(tokens) # remove repeated tokens 
unique_tokens

{'-',
 'Banzai!!!',
 'Big',
 'Chill?',
 'Finally',
 'Hard',
 'Jonestown',
 'Kevin',
 'Kline:',
 'Meg',
 'Must',
 'Plus',
 'Ryan',
 'Un-bleeping-believable!',
 'Whoosh...',
 'Wooofff!',
 'a',
 'acting',
 'be',
 'been',
 'believe',
 'by',
 'career',
 'did',
 'directed',
 "doesn't",
 'dog.',
 'even',
 'forgive',
 'guy',
 'has',
 'her',
 'his',
 'hollywood',
 'in',
 'kind',
 'look',
 'lovable',
 'makes',
 'me',
 'normally',
 'of',
 'on',
 'on?',
 'pert',
 'producer',
 'replay',
 'schtick.',
 'self',
 'shallow',
 'she',
 'style.',
 'suicide',
 'the',
 'this',
 'this,',
 'ticky',
 'to',
 'trip',
 'usual',
 'was',
 'what',
 'which',
 'who'}

In [34]:
word_to_id= {word: i for i, word in enumerate(tokens)} # turns words into ids
this_id = word_to_idx['this']
this_id

49

In [36]:
id_to_word = {i: word for i, word in enumerate(tokens)} # turns ids into words
id_to_word[this_id]

'this'

Now we can use our 'vocab' to turn our text into a sequence of numbers:

In [41]:
numericalized_tokens = [word_to_id[w] for w in tokens]
numericalized_tokens

[0,
 1,
 2,
 3,
 4,
 5,
 18,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 50,
 53,
 29,
 30,
 49,
 32,
 33,
 34,
 35,
 36,
 37,
 63,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68]

In [42]:
# Now let's build a vocab for the whole dataset
all_tokens = []

In [46]:
for text in dataset['text']:
    all_tokens.extend(text.split(' '))
all_tokens[100:110]

['relatively',
 'cheery.',
 'There',
 'are',
 'no',
 'really',
 'superstars',
 'in',
 'the',
 'cast,']

In [47]:
len(all_tokens)

743391

In [49]:
unique_all_tokens = set(all_tokens)
# Get the size of our vocab
len(unique_all_tokens)

36462

In [53]:
# Build the vocab
word_to_id = {word: i for i, word in enumerate(unique_all_tokens)}
id_to_word = {i: word for i, word in enumerate(unique_all_tokens)}
john_id = word_to_id['John']

In [59]:
id_to_word[john_id+1] # next word in the vocab

'lingering'

Now we have a really simple vocab for numericalizing our training/validation data.

# Turning words into vectors
We have now ids for every word. Almost every neural net for NLP uses this integer ids to get a vector for the word (or character, or..) in the first layer. This is what is know a the Embedding layer. The embedding layer is basically a lookup table of size Vxd, where V is the size of the vocab and d the dimension of the embedding vector. Let's see how this works:


In [61]:
import torch
from torch.nn.modules import Embedding
vocab_size = len(word_to_id)
emb_dim = 50 
embedding_layer = Embedding(vocab_size, emb_dim)
embedding_layer


Embedding(36462, 50)

In [70]:
# Let's get the vector for our first word
v_0 = embedding_layer(torch.tensor(0)) # The network only understands torch.tensor objects
v_0

tensor([-0.3850, -0.5787,  0.7424,  0.1339, -0.1917,  0.3614, -0.7191,  0.4496,
        -0.6583, -0.1638,  1.1447, -0.1735,  2.1880, -1.7839,  0.8560, -0.2002,
        -1.4404,  1.6758,  0.2755, -0.6098, -2.2459,  1.6085,  1.6329, -0.4918,
         0.0097, -2.4239,  0.9176,  1.5877,  0.8225,  0.8477,  0.6123,  0.6060,
         1.1446,  0.4422,  1.3977,  1.4591, -1.2798,  1.2375, -0.9987, -1.5003,
        -1.3405, -2.1017, -0.4779, -0.6793,  0.1321, -0.4403, -0.5138,  0.4170,
        -0.6814, -0.4919], grad_fn=<EmbeddingBackward>)

In [76]:
# Now let's try with our first full example
tokens = dataset['text'][0].split(' ')
numericalized_example = [word_to_id[w] for w in tokens]
'text of length {} tokens'.format(len(numericalized_example)), numericalized_example, tokens

('text of length 69 tokens',
 [2208,
  30483,
  18175,
  24362,
  22588,
  11295,
  31186,
  15496,
  29227,
  4232,
  16973,
  28194,
  12297,
  6063,
  14246,
  2856,
  29999,
  23056,
  31186,
  15063,
  4932,
  17484,
  4822,
  21957,
  17282,
  32531,
  28989,
  8554,
  1886,
  15482,
  29042,
  26219,
  7010,
  26343,
  33113,
  19499,
  25225,
  19757,
  29308,
  9386,
  25803,
  18325,
  678,
  27816,
  18676,
  9198,
  30989,
  2892,
  2370,
  26219,
  8554,
  16123,
  10256,
  1886,
  32002,
  29390,
  21029,
  17212,
  4351,
  13757,
  8564,
  9114,
  30043,
  29308,
  13593,
  26959,
  219,
  20330,
  34417],
 ['Un-bleeping-believable!',
  'Meg',
  'Ryan',
  "doesn't",
  'even',
  'look',
  'her',
  'usual',
  'pert',
  'lovable',
  'self',
  'in',
  'this,',
  'which',
  'normally',
  'makes',
  'me',
  'forgive',
  'her',
  'shallow',
  'ticky',
  'acting',
  'schtick.',
  'Hard',
  'to',
  'believe',
  'she',
  'was',
  'the',
  'producer',
  'on',
  'this',
  'dog.',
  

In [77]:
v_example = embedding_layer(torch.tensor(numericalized_example))
v_example # A matrix of the vectors corresponding to each of the 69 tokens

tensor([[ 1.2027, -0.2946,  0.8443,  ...,  1.4954, -1.4349, -0.2545],
        [-0.8398,  0.0350, -0.9193,  ..., -0.7567, -0.6925, -0.3173],
        [-0.5739, -1.7067, -2.0933,  ...,  0.9647, -0.6415,  0.4665],
        ...,
        [-2.4109,  2.4399, -0.3522,  ..., -0.1244,  0.7360, -0.3778],
        [ 0.3791, -0.5356,  1.3328,  ...,  0.9829,  0.8671, -0.1291],
        [ 0.5168,  0.9240, -1.0961,  ..., -0.0199, -0.2333,  0.9682]],
       grad_fn=<EmbeddingBackward>)

In [78]:
v_example[0] # the embedding vector of the first token

tensor([ 1.2027, -0.2946,  0.8443, -0.6528,  0.1477, -1.2029,  0.4607, -1.2531,
        -0.1346, -0.6594,  0.7075,  0.1972,  1.1120, -0.0413, -2.2882,  0.4343,
         2.0869, -0.9942,  0.6648,  0.2084, -0.0663,  0.7633, -0.9001,  0.9449,
        -0.8934, -0.0338,  0.5865,  2.1943,  0.8381,  1.1049, -0.0109,  0.2139,
         0.1015, -1.6645, -0.4943, -0.5859, -1.4801, -2.3506, -1.6687, -0.3140,
         0.0878,  1.1838,  1.1238, -1.4592, -2.0059, -0.1387,  0.4990,  1.4954,
        -1.4349, -0.2545], grad_fn=<SelectBackward>)

Good job! Now we have turn text into 'dense' real-valued vectors! 
Now, let's try to generalize this a little bit.

But first, let's try our vocab on text outside the IMDB sample dataset.


In [79]:
my_movie_review_text = 'Climax from Gaspar Noé is a shockingly beatiful movie'.split(' ')
numericalized_movie_review = [word_to_id[w] for w in my_movie_review_text]

KeyError: 'Climax'

What happened?

'Climax' is what it's called an out of vocabulary word (or oov, unk..). This is an important thing to deal with when working with supervised learning for NLP, as our model is expected to work with text not seen during training, validation or test. The simplest way to deal with this is to add a special token to our vocabulary which will be assigned to every unknown word. 

But for this let's generalize a little bit our vocabulary functionality.



In [178]:
class Vocab:
    def __init__(self, unk_symbol='<unk>', is_label=False):
        self.size = 1
        self.word_to_id = {}
        self.id_to_word = {}
        # you will understand this later
        if not is_label:
            self.unk_symbol = unk_symbol
            self.unk_id = self.add_word(unk_symbol)
    def add_word(self, w):
        if w not in self.word_to_id:
            self.word_to_id[w] = self.size
            self.id_to_word[self.size] = w
            self.size += 1
        return self.size - 1
    def to_id(self, w):
       return self.word_to_id[w] if w in self.word_to_id else self.unk_id
    def to_word(self, id):
       return self.id_to_word[id] if id in self.id_to_word else self.unk_symbol  
    def __len__(self):
        return self.size
vocab = Vocab()
vocab.to_id('Climax')

1

In [155]:
len(vocab)

2

In [156]:
vocab.add_word('Climax')

2

In [157]:
# Now, let's try to build the vocab for the full dataset
full_vocab = Vocab()
for text in dataset['text']:
    for w in text.split(' '):
        full_vocab.add_word(w)
len(full_vocab) # We should get our previous lenght + 1 (for the unk token) = 36463
    

36464

In [164]:
# Finally let's try on our previous unseen example
my_movie_review_text

['Climax',
 'from',
 'Gaspar',
 'Noé',
 'is',
 'a',
 'shockingly',
 'beatiful',
 'movie']

In [165]:
numericalized_movie_review = [full_vocab.to_id(w) for w in my_movie_review_text]
numericalized_movie_review

[1, 283, 1, 1, 67, 59, 25417, 1, 270]

In [166]:
numericalized_movie_review
[full_vocab.to_word(i) for i in numericalized_movie_review]

['<unk>', 'from', '<unk>', '<unk>', 'is', 'a', 'shockingly', '<unk>', 'movie']

In [167]:
# We got three unknown words, one of them a misspeling of beautiful
# Lets try with the right spelling
my_movie_review_text = 'Climax from Gaspar Noé is a shockingly beautiful movie'.split(' ')
numericalized_movie_review = [full_vocab.to_id(w) for w in my_movie_review_text]
numericalized_movie_review

[1, 283, 1, 1, 67, 59, 25417, 1535, 270]

In [168]:
full_vocab.to_word(1535)

'beautiful'

Now we have a working Vocab functionality with a simple tokenization mechanism (split by empty spaces), but what if we wanted a more general tokenizer with functions such as lowercasing, normalization, etc.?

Let's try to generalize this a little

In [173]:
class Tokenizer:
    def __init__(self, lowercase=False):
        self.lowercase = lowercase
    def __call__(self, text):
        return [w.lower() if self.lowercase else w for w in text.split(' ')]
my_tokenizer = Tokenizer(lowercase=True)
        

In [174]:
my_tokenizer('Climax is a horrible movie with nice music')

['climax', 'is', 'a', 'horrible', 'movie', 'with', 'nice', 'music']

## Exercise 1
Now please build a new vocab by tokenizing the full dataset with lowercased words:

In [176]:
# Now, let's try to build the vocab for the full dataset
lowercased_vocab = Vocab()
for text in dataset['text']:
    # your code here
len(lowercased_vocab) # We should get a smaller vocab

33877

## Exercise 2
We have been focusing on text, but what about labels? Labels are frequently also text, like in our case, where we have `positive`and `negative`. Neural nets don't understand text, so what should we do? We need to turn them into numbers. Good news is that we can reuse our previous vocab to do this!

Please create the vocab for labels. In this case we do not want the vocab to contain an unkwnow label, so we will use the is_label parameter

In [183]:
labels_vocab = Vocab(is_label=True)
# Create the labels vocab
for label in dataset['label']:
    labels_vocab.add_word(label)
labels_vocab.word_to_id

{'negative': 1, 'positive': 2}