Copy this notebook (File>Save a copy in Drive) and then work on your copy.
==
To send me your work: Share(top-right of the window)>Get link/copy link

Send me an email containing the link after having
*   changed the link permission to Editor and
*   allowed anyone with the link to access the notebook.

Merit KULDKEPP : meritkuldkepp@gmail.com

OR

Nicolas MERLI : merli.nicolas.0@gmail.com

Goal
==

We are about to design and train a neural system to perform sentiment analysis on film reviews. More precisely, the network will have to output the probability that the input review expresses a positive opinion (overall).

The system will be a bag-of-words model using GloVe embeddings. It will have to first average the embeddings of the words of the input review, and then send the result through a simple network that should output a probability.

There is a lot of already written code at the beginning of the notebook. It is important that you understand it as you will have to reuse/reproduce it for future work.

Loading Pytorch is important.
==

In [1]:
# Imports Pytorch.
import torch

Downloading the dataset
==
The dataset we are going to use this the Large Movie Review Dataset (https://ai.stanford.edu/~amaas/data/sentiment/).

In [2]:
# Downloads the dataset.
import urllib

tmp = urllib.request.urlretrieve("https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
filename = tmp[0]

In [3]:
# Extracts the dataset.
import tarfile
tar = tarfile.open(filename)
tar.extractall()
tar.close()

In [4]:
import os # Useful library to read files and inspect directories.

In [5]:
# Shows which files and directories are present at the root of the file system.
for filename in os.listdir("."):
  print(filename)

aclImdb
[TP_Sentiment_Analysis]_Students.ipynb
.ipynb_checkpoints


In [6]:
dataset_root = "aclImdb"
# Shows which files and directories are present at the root of the dataset directory.
for filename in os.listdir(dataset_root):
  print(filename)

train
imdb.vocab
imdbEr.txt
README
test


In [7]:
# Shows several reviews.
dirname = os.path.join(dataset_root, "train", "neg") # "aclImdb/{train|test}/{neg|pos}"
for idx, filename in enumerate(os.listdir(dirname)):
  if(idx >= 5): break # Stops after the 5th file.
  
  print(filename)
  with open(os.path.join(dirname, filename)) as f:
    review = f.read()
    print(review)
  print()

3479_1.txt
I went to see this film at the cinemas and i was shocked when I got in the room. There was only me and my girlfriend! This shouted to me that this film is not very good. <br /><br />Not to my surprise, the film was dire. Ben Affleck plays a guy who buys a family for Christmas. It is a very predictable narrative with him falling in love with the girl that hates him. His acting is OKish but for the comedy aspect of the film he is not very good. The plot line is poor and the comedy almost non-existent.<br /><br />However, there are some good points. For example, the family is falling apart and the mother is very funny.<br /><br />I hope this review stops other people wasting their money. I was very embarrassed when I came out of the room!!!

1912_4.txt
There's nothing new here. All the standard romantic-comedy scenes, even down to the taxi sprinting to the airport to stop the woman flying away. The only thing that saves this is the acting of Alison Eastwood & some of the minor 

Preprocessing the dataset
==

In [9]:
import nltk # Imports NLTK, an NLP library.
nltk.download('punkt') # Loads a module required for tokenization.

import collections # This library defines useful data structures. 

[nltk_data] Downloading package punkt to /workspace/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
# TODO
# Write a preprocess() function that takes a review as input
# 1. Replaces all "<br />" occurrences in the review with spaces
# 2. Splits the review into tokens using ntlk library (check word_tokenize method documentation)
# 3. Lowercase all the extracted tokens
# Returns a list of tokens of the current review

newline = "<br />"
def preprocess(text):
  text=text.replace(newline," ")
  text=text.lower()
  return(nltk.word_tokenize(text))
  
  

In [11]:
# Reads and pre-processes the reviews.
dataset = {"train": [], "test": []}
binary_classes = {"neg": 0, "pos": 1}
for part_name, l in dataset.items():
  for class_name, value in binary_classes.items():
    path = os.path.join(dataset_root, part_name, class_name)
    print("Processing %s..." % path, end='');
    for filename in os.listdir(path):
        with open(os.path.join(path, filename)) as f:
          review_text = f.read()
          review_tokens = preprocess(review_text)
          
          l.append((review_tokens, value))
    print(" done")

Processing aclImdb/train/neg... done
Processing aclImdb/train/pos... done
Processing aclImdb/test/neg... done
Processing aclImdb/test/pos... done


In [12]:
# Splits the train set into a proper train set and a development/validation set.
# 'dataset["train"]' happens to be a list composed of a certain number of negative examples followed by the same number of positive examples.
# We are going to use 3/4 of the original train set as our actual train set, and 1/4 as our development set.
# We want to keep balanced train and development sets, i.e. for both, half of the reviews should be positive and half should be negative.
if("dev" in dataset): print("This should only be run once.")
else:
  dev_set_half_size = int((len(dataset["train"]) / 4) / 2) # Half of a quarter of the training set size.
  dataset["dev"] = dataset["train"][:dev_set_half_size] + dataset["train"][-dev_set_half_size:] # Takes some negative examples at the beginning and some positive ones at the end.
  dataset["train"] = dataset["train"][dev_set_half_size:-dev_set_half_size] # Removes the examples used for the development set.

  for (part, data) in dataset.items():
    class_counts = collections.defaultdict(int)
    for (_, p) in data: class_counts[p] += 1
    print(f"{part}: {class_counts}")
  print("Train set split into train/dev.")

train: defaultdict(<class 'int'>, {0: 9375, 1: 9375})
test: defaultdict(<class 'int'>, {0: 12500, 1: 12500})
dev: defaultdict(<class 'int'>, {0: 3125, 1: 3125})
Train set split into train/dev.


Loading the word embeddings
==
We are going to use GloVe embeddings.

All word forms with a frequency below a given threshold are going to be considered unknown forms.

In [13]:
# Computes the frequency of all word forms in the train set.
word_counts = collections.defaultdict(int)
for tokens, _ in dataset["train"]:
  for token in tokens: word_counts[token] += 1

print(word_counts)



In [14]:
# Builds a vocabulary containing only those words present in the train set with a frequency above a given threshold.
count_threshold = 4;
vocabulary = set()
for word, count in word_counts.items():
    if(count > count_threshold): vocabulary.add(word)

print(vocabulary)
print(len(vocabulary))

26417


In [15]:
import zipfile
import numpy as np

In [16]:
# Returns a dictionary {word[String]: id[Integer]} and a list of Numpy arrays
# `data_path` is the path of the directory containing the GloVe files (if None, 'glove.6B' is used)
# `max_size` is the number of word embeddings read (starting from the most frequent; in the GloVe files, the words are sorted)
# If `vocabulary` is specified, the output vocabulary contains the intersection of `vocabulary` and the words with a defined embedding. Otherwise, all words with a defined embedding are used.
def get_glove(dim=50, vocabulary=None, max_size=-1, data_path=None):
  dimensions = set([50, 100, 200, 300]) # Available dimensions for GloVe 6B
  fallback_url = 'http://nlp.stanford.edu/data/glove.6B.zip' # (Remember that in GloVe 6B, words are lowercased.)

  assert (dim in dimensions), (f'Unavailable GloVe 6B dimension: {dim}.')

  if(data_path is None): data_path = 'glove.6B'

  # Checks that the data is here, otherwise downloads it.
  if(not os.path.isdir(data_path)):
    #print('Directory "%s" does not exist. Creation.' % data_path)
    os.makedirs(data_path)
  
  glove_weights_file_path = os.path.join(data_path, f'glove.6B.{dim}d.txt')
  
  if(not os.path.isfile(glove_weights_file_path)):
    local_zip_file_path = os.path.join(data_path, os.path.basename(fallback_url))
  
    if(not os.path.isfile(local_zip_file_path)):
      print(f'Retreiving GloVe embeddings from {fallback_url}.')
      urllib.request.urlretrieve(fallback_url, local_zip_file_path)
    
    with zipfile.ZipFile(local_zip_file_path, 'r') as z:
      print(f'Extracting GloVe embeddings from {local_zip_file_path}.')
      z.extractall(path=data_path)
  
  assert os.path.isfile(glove_weights_file_path), (f"GloVe file {glove_weights_file_path} not found.")

  # Reads GloVe data.
  print('Reading GloVe embeddings.')
  new_vocabulary = {} # A dictionary {word[String]: id[Integer]}
  embeddings = [] # The list of embeddings (Numpy arrays)
  with open(glove_weights_file_path, 'r') as f:
    for line in f: # Each line consist of the word followed by a space and all of the coefficients of the vector separated by a space.
      values = line.split()

      # Here, I'm trying to detect where on the line the word ends and where the vector begins. As in some version(s) of GloVe words can contain spaces, this is not entirely trivial.
      vector_part = ' '.join(values[-dim:])
      x = line.find(vector_part)
      word = line[:(x - 1)]

      if((vocabulary is not None) and (not word in vocabulary)): # If a vocabulary was specified and if the word is not it…
        continue # …this word is skipped.

      new_vocabulary[word] = len(new_vocabulary)
      embedding = np.asarray(values[-dim:], dtype=np.float32)
      embeddings.append(embedding)

      if(len(new_vocabulary) == max_size): break
  print('(GloVe embeddings loaded.)')
  print()

  return (new_vocabulary, embeddings)

In [17]:
(new_vocabulary, embeddings) = get_glove(dim=50, vocabulary=vocabulary)

Retreiving GloVe embeddings from http://nlp.stanford.edu/data/glove.6B.zip.
Extracting GloVe embeddings from glove.6B/glove.6B.zip.
Reading GloVe embeddings.
(GloVe embeddings loaded.)



In [18]:
print(len(new_vocabulary)) # Shows the size of the vocabulary.
print(new_vocabulary) # Shows each word and its id.

25594


Batch generator
==

In [19]:
# Defines a class of objects that produce batches from the dataset.
class BatchGenerator:
  def __init__(self, dataset, vocabulary):
    self.dataset = dataset
    for part in self.dataset.values(): # Shuffles the dataset so that positive and negative examples are mixed.
      np.random.shuffle(part)

    self.vocabulary = vocabulary # Dictonary {word[String]: id[Integer]}
    self.unknown_word_id = len(vocabulary) # Id for unknown forms
    self.padding_idx = len(vocabulary) + 1 # Not all reviews of a given batch will have the same length. We will "pad" shorter reviews with a special token id so that the batch can be represented by a matrix.
  
  def length(self, data_type='train'):
    return len(self.dataset[data_type])

  # Returns a random batch.
  # If `subset` is an integer, only a subset of the corpus is used. This can be useful to debug the system.
  def get_batch(self, batch_size, data_type, subset=None):
    data = self.dataset[data_type] # selects the relevant portion of the dataset.
    
    max_i = len(data) if(subset is None) else min(subset, len(data))
    instance_ids = np.random.randint(max_i, size=batch_size) # Randomly picks some instance ids.

    return self._ids_to_batch(data, instance_ids)

  def _ids_to_batch(self, data, instance_ids):
    word_ids = [] # Will be a list of lists of word ids (Integer)
    polarity = [] # Will be a list of review polarities (Boolean)
    texts = [] # Will be a list of lists of words (String)
    for instance_id in instance_ids:
      text, p = data[instance_id]

      word_ids.append([self.vocabulary.get(w, self.unknown_word_id) for w in text])
      polarity.append(p)
      texts.append(text)
    
    # Padding
    self.pad(word_ids)

    word_ids = torch.tensor(word_ids, dtype=torch.long) # Conversion to a tensor
    polarity = torch.tensor(polarity, dtype=torch.bool) # Conversion to a tensor

    return (word_ids, polarity, texts) # We don't really need `texts` but it might be useful to debug the system.
  
  # Pads a list of lists (i.e. adds fake word ids so that all sequences in the batch have the same length, so that we can use a matrix to represent them).
  # In place
  def pad(self, word_ids):
    max_length = max([len(s) for s in word_ids])
    for s in word_ids: s.extend([self.padding_idx] * (max_length - len(s)))
  
  # Returns a generator of batches for a full epoch.
  # If `subset` is an integer, only a subset of the corpus is used. This can be useful to debug the system.
  def all_batches(self, batch_size, data_type="train", subset=None):
    data = self.dataset[data_type]
    
    max_i = len(data) if(subset is None) else min(subset, len(data))

    # Loop that generates all full batches (batches of size 'batch_size')
    i = 0
    while((i + batch_size) <= max_i):
      instance_ids = np.arange(i, (i + batch_size))
      yield self._ids_to_batch(data, instance_ids)
      i += batch_size
    
    # Possibly generates the last (not full) batch.
    if(i < max_i):
      instance_ids = np.arange(i, max_i)
      yield self._ids_to_batch(data, instance_ids)
  
  # Turns a list of arbitrary pre-processed texts into a batch.
  # This function will be used to infer the polarity of a unannotated review.
  def turn_into_batch(self, texts):
    word_ids = [[self.vocabulary.get(w, self.unknown_word_id) for w in text] for text in texts]
    self.pad(word_ids)
    return torch.tensor(word_ids, dtype=torch.long)

batch_generator = BatchGenerator(dataset=dataset, vocabulary=new_vocabulary)
print(batch_generator.length('train')) # Prints the number of instance in the train set.

18750


In [24]:
tmp = batch_generator.get_batch(3, data_type="train")
print(tmp[0]) # Prints the matrix of token ids.
print(tmp[1]) # Prints the vector of polarities.
print(tmp[2]) # Prints the list of reviews.

tensor([[  142,    36,    13,  ..., 25595, 25595, 25595],
        [ 2593,  2997, 25594,  ..., 25595, 25595, 25595],
        [ 2497, 24073,    13,  ...,     7, 24602,     2]])
tensor([False, False,  True])
[['well', 'this', 'is', 'a', 'typical', '``', 'straight', 'to', 'the', 'toilet', "''", 'slasher', 'film', '.', 'long', 'story', 'short', ',', 'a', 'bunch', 'of', 'teenagers/young', 'adults', 'becoming', 'stranded', 'in', 'the', 'middle', 'of', 'creepy', 'woods', 'and', 'get', 'hacked', 'down', 'by', 'naked', 'nymphomaniac', 'demons', '.', 'this', 'movie', 'has', 'all', 'the', 'basics', 'for', 'this', 'slasher', 'fromage', ':', '-naked', 'women', ',', '-teens', 'or', 'young', 'adults', 'being', 'marooned', 'in', 'someplace', 'spooky', ',', '-gory', 'death', 'scenes', ',', '-the', 'last', 'survivor', 'being', 'a', 'well', 'built', 'young', 'woman', 'who', 'will', 'always', 'show', 'off', 'her', 'midriff', ',', 'but', 'never', 'bra', 'less', ',', '-a', 'creepy', ',', 'crazy', 'man', 'who

In [25]:
len(list(batch_generator.all_batches(batch_size=3, data_type="train"))) # Number of batches of size 3 in the training set

6250

The model
==

In [31]:
class SentimentClassifier(torch.nn.Module):
  # embeddings: list of Numpy arrays
  # hidden_sizes: list of the size of all hidden layers (Integer)
  def __init__(self, embeddings, hidden_sizes, freeze_embeddings=True, device='cuda'):
    embeddings = list(embeddings) # Creates a copy of the list of embeddings, so we can add or remove entries without affecting the original list.
    super().__init__() # Calls the constructor of the parent class. Usually, this is necessary when creating a custom module.

    self.padding_idx = len(embeddings) + 1 # len(embeddings) will be the id of the embedding of the unknown word

    # Here you have to 
    # (i) define a vector for unknown forms (the average of actual word embeddings) and a vector for the padding token (full of 0·s) and 
    unkown_vector= np.mean(embeddings,axis=0)
    embeddings.append(unkown_vector)
    # (ii) define an embedding layer 'self.embeddings' using torch.nn.Embedding and not forgeting to use the 'freeze' and 'padding_idx' arguments.
    #################
    padding_vector=np.zeros(len(unkown_vector))
    embeddings.append(padding_vector)

    self.embeddings=torch.nn.Embedding.from_pretrained(torch.FloatTensor(embeddings))
    #################
    self.embeddings = self.embeddings.to(device) # Sends the word embeddings to 'device', which is potentially a GPU.

    # Here you have to define self.main_part, the network that computes a probability out of a review (represented as the average of the embeddings of the tokens).
    # The number of hidden layers is determined by 'hidden_sizes, which is a list of integers describing the (output) size of each of them.
    # Use torch.nn.Linear to build linear layers.
    # torch.nn.Sequential takes one argument per module and not a list of modules as argument, but if 'modules' is a list of modules, 'torch.nn.Sequential(*modules)' (with the star notation) works.
    #################
    modules = []

    for i in range(len(hidden_sizes)):
      if i == 0:
        modules.append(torch.nn.Linear(self.embeddings.weight.shape[1],hidden_sizes[i]))
        modules.append(torch.nn.ReLU())
      else:
        modules.append(torch.nn.Linear(hidden_sizes[i-1],hidden_sizes[i]))
        modules.append(torch.nn.ReLU())
     
    self.main_part = torch.nn.Sequential(*modules)

    #################
    self.main_part = self.main_part.to(device) # Sends the network to 'device', which is potentially a GPU.

    self.device = device

  # 'batch' is a matrix of word ids (Integer).
  def forward(self, batch):
    # Here you have to (i) turns 'batch' into a matrix of embeddings (i.e. a tensor of rank 3), 
    batch=torch.tensor(batch)
    # (ii) average all embeddings for a given review being careful not to take into account padding vectors, 
    average_mean=torch.nn.EmbeddingBag.from_pretrained(self.embedding.weight,padding_idx=self.padding_idx,mode='mean')
    bag_of_words=average_mean(batch)
    # (iii) send these bag-of-words representations to the network.
    
    # Return a tensor of shape (batch size) instead of (batch size, 1).
    #################
    return(batch)

    #################

In [32]:
# Checking if the model is well defined and if a forward pass is feasible.
# This cell should return a torch tensor of size 3
model = SentimentClassifier(embeddings, hidden_sizes=[100, 200], freeze_embeddings=True)
batch = batch_generator.get_batch(3, data_type="train")
print(model(batch[0]))

tensor([[   40,   471,   879,  ..., 25595, 25595, 25595],
        [    0,  9321,   308,  ...,     2, 14206, 13603],
        [    6,  8179,     1,  ..., 25595, 25595, 25595]])


  batch=torch.tensor(batch)


In [33]:
# Function that computes the accuracy of the model on a given part of the dataset.
evaluation_batch_size = 256
def evaluation(data_type, subset=None):
  nb_correct = 0
  total = 0
  for batch in batch_generator.all_batches(batch_size, data_type=data_type, subset=subset):
    prob = model(batch[0].to(model.device)) # Forward pass
    answer = (prob > 0.5) # Shape: (batch_size, 1)
    nb_correct += (answer == batch[1].to(model.device)).sum().item()
    total += batch[0].shape[0]
      
  accuracy = (nb_correct / total)
  return accuracy

Training
==
Once everything works, feel free to find better hyperparameters.
The goal is to maximise the accuracy on the development set.

In [None]:
model = embeddings, hidden_sizes=[30, 20], freeze_embeddings=False, device='cuda')
MSELoss = torch.nn.MSELoss()

# Tests the model on a couple of instance before training.
model.eval() # Tells Pytorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
print(model(batch_generator.turn_into_batch([preprocess(text) for text in ["This movie was terrible!!", "Pure gold!"]]).to(model.device)))

# Training procedure
learning_rate = 0.006
l2_reg = 0.0001
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.99, weight_decay=l2_reg) # Once the backward propagation has been done, call the 'step' method (with no argument) to update the parameters.
batch_size = 64
subset = None # Use an integer to train on a smaller portion of the training set, otherwise use None.
epoch_size = batch_generator.length("train") if(subset is None) else subset # In number of instances

nb_epoch = 20
epoch_id = 0 # Id of the current epoch
instances_processed = 0 # Number of instances trained on in the current epoch
epoch_loss = [] # Will contain the loss for each batch of the current epoch
while(epoch_id < nb_epoch):
  model.train() # Tells Pytorch we are in training mode (can be useful if dropout is used, for instance).
  
  model.zero_grad() # Makes sure the gradient is reinitialised to zero.
  
  batch = batch_generator.get_batch(batch_size, data_type="train", subset=subset)

  # You have to (i) compute the prediction of the model, 
  # (ii) compute the loss (use an average over the batch), 
  # (iii) call "backward" on the loss and (iv) store the loss in "epoch_loss".
  ###################


  ###################
  
  optimizer.step() # Updates the parameters.

  instances_processed += batch_size
  if(instances_processed > epoch_size):
    print(f"-- END OF EPOCH {epoch_id}.")
    print(f"Average loss: {sum(epoch_loss) / len(epoch_loss)}.")

    # Evaluation
    model.eval() # Tells Pytorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
    with torch.no_grad(): # Deactivates Autograd (it is computationaly expensive and we don't need it here).
      accuracy = evaluation("train")
      print(f"Accuracy on the train set: {accuracy}.")

      accuracy = evaluation("dev")
      print(f"Accuracy on the dev set: {accuracy}.")

    epoch_id += 1
    instances_processed -= epoch_size
    epoch_loss = []

In [None]:
model.eval() # Tells Pytorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
model(batch_generator.turn_into_batch([preprocess(text) for text in ["This movie was terrible!!", "Pure gold!", "Bad.", "Not bad!"]]).to(model.device))

In [None]:
# To go further

# 1. Try to explain the results of the four reviews given in the cell above. Is there any weird behavior ? How could you explain it ?
# 2. You can try using longer GloVe vectors when creating the vocabulary (get_glove function). Beware : this will increase training time.
# 3. Feel free to play with the given hyperparams of the model to obtain a better accuracy.
# 4. Accuracy is not a really informative metric. Could you explain why ? If so, try to find a new metric and implement it.