## Movie review data preparation

This notebook prepares the movie review dataset for use with a PyTorch model. The dataset consists of sentences, which need to be converted to a numerical representation.  
The sentences are split into tokens using the [spacy](https://spacy.io) tokenizer. A vocabulary is then 
build with [torchtext](https://pytorch.org/text/stable/index.html) based on an existing set of word vectors, [GloVe](https://nlp.stanford.edu/projects/glove/), with a 100-dimensional embedding space trained on 6 billion words.

In [1]:
import os
import pandas as pd
import numpy as np

import torch
from torchtext.legacy.data import Field
from torchtext.data import get_tokenizer

In [2]:
# to install the spacy tokenizer and word embeddings, run these two commands:
# !pip install spacy
# !python3 -m spacy download en_core_web_sm

In [3]:
# load the raw training data
data_path = os.path.join(os.path.expanduser('~'), 'surfdrive/Shared/datasets/stanford_sentiment_treebank_v2')
train_data = pd.read_csv(os.path.join(data_path, 'train.tsv'), delimiter='\t')

print(f'The training data contains {len(train_data)} samples')
train_data.head()

The training data contains 67349 samples


Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


In [4]:
# tokenize the training data sentences
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
tokens = [tokenizer(sentence) for sentence in train_data['sentence']]

In [5]:
# build vocabulary based on glove vectors
TEXT = Field(pad_token='<pad>', unk_token='<unk>')
TEXT.build_vocab(tokens, vectors='glove.6B.100d')

print(f'Number of words in vocabulary: {TEXT.vocab.vectors.size(0)}')
print(f'Size of embedding space: {TEXT.vocab.vectors.size(1)}')

Number of words in vocabulary: 13889
Size of embedding space: 100


In [6]:
# store the vocabulary: each word and the associated vector
with open('word_vectors.txt', 'w') as f:
    for i, embedding in enumerate(TEXT.vocab.vectors):
        word = TEXT.vocab.itos[i]
        f.write(f'{word} {" ".join(embedding.numpy().astype(str))}\n')