<a href="https://colab.research.google.com/github/Zinni98/Sentiment-analysis-project/blob/lbsa/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Polarity Classification

In [1]:
from google.colab import drive
drive.mount("/content/gdrive/")

Mounted at /content/gdrive/


In [2]:
import sys
sys.path.append("/content/gdrive/My Drive/nlu-project")

### Get the data

In [3]:
import nltk
import torch
from nltk.corpus import movie_reviews
import numpy as np
nltk.download("punkt")

nltk.download("movie_reviews")
nltk.download("subjectivity")
nltk.download("stopwords")
nltk.download("sentiwordnet")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Unzipping corpora/subjectivity.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

## Exploratory analysis

Firstly let's explore the data:

In [4]:
mr = movie_reviews
neg = mr.paras(categories = "neg")
pos = mr.paras(categories = "pos")
print(f"length of each part of the dataset:\n - pos: {len(pos)} \n - neg: {len(neg)}")
print(pos)

length of each part of the dataset:
 - pos: 1000 
 - neg: 1000
[[['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', "'", 's', 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.'], ['for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'", '80s', 'with', 'a', '12', '-', 'part', 'series', 'called', 'the', 'watchmen', '.'], ['to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'l

It's easy to see that data comes in the following format:

- pos = [doc1, doc2, ..., doc1000] (the same applies for negative sentiment examples)

Where each doc has the following structure:

- doc1 = [sentence_1, sentence_2, ..., sentence_k]

Each sentence is a list of tokens, so the dataset is already tokenized.

### Word embedding
Since I'm going to use deep learning models, I'm going to choose a word embedding to transform the text into vectors.
I'm going to start with a pretrained version of GloVe word embedding.
Since is a pre-trained word embedding (hence basically a lookup table), I'm going to check how many words of the vocabulary are covered by the pretrained word embedding model:

In [None]:
def create_vocab(corpus_words):
    vocab = dict()
    for word in corpus_words:
      try:
        vocab[word] += 1
      except:
        vocab[word] = 1
    return vocab

def get_corpus_words(corpus) -> list:
    return [w for doc in corpus for sent in doc for w in sent]

In [None]:
import operator
from tqdm import tqdm
from torchtext.vocab import GloVe
import torch

global_vectors = GloVe(name='840B', dim=300)

# function inspired by https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings/notebook
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    null_embedding = torch.tensor([0.0]*300)
    for word in tqdm(vocab):
        try:
          if torch.equal(embeddings_index.get_vecs_by_tokens(word), null_embedding):
            raise KeyError
          a[word] = embeddings_index.get_vecs_by_tokens(word)
          k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print()
    print(f'Found embeddings for {len(a) / len(vocab):.2%} of vocab')
    print(f'Found embeddings for  {k / (k + i):.2%} of all text')
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

vocab = create_vocab(get_corpus_words(pos + neg))
oov = check_coverage(vocab, global_vectors)

.vector_cache/glove.840B.300d.zip: 2.18GB [06:50, 5.31MB/s]                            
100%|█████████▉| 2196016/2196017 [04:24<00:00, 8308.59it/s]
100%|██████████| 39768/39768 [00:01<00:00, 22744.70it/s]



Found embeddings for 91.93% of vocab
Found embeddings for  99.58% of all text


In [None]:
oov

I'm going to see which are the words that are not covered by the embedding (Out Of Vocabulary words), so I can try to see if there are some tenchniques that can be applied in order to improve coverage.
The majority of OOV words aren't related with a praticular sentiment (they are basically nouns or some type punctuation), so they can be safely removed. That happens because unknown words are encoded as $[0] * embedding.length$, so no useful information is added.
Others OOV words are regular words surrounded by underscores, so they are not recognized by the fixed word embedding. To avoid this problem I implemented a procedure in order to clean these words:

In [None]:
def remove_underscores(corpus):
  for doc in corpus:
    for sent in doc:
      for idx, word in enumerate(sent):
        if "_" in word:
          cleaned_word = _clean_word(word)
          sent[idx] = cleaned_word
  return corpus


def _clean_word(word: str):
  word = word.replace("_", " ")
  word = word.split()
  word = " ".join(word)
  return word


In [None]:
corpus = pos + neg
clean_corpus = remove_underscores(corpus), oov
vocab = create_vocab(get_corpus_words(clean_corpus))
oov = check_coverage(vocab, global_vectors)

100%|██████████| 39519/39519 [00:01<00:00, 28083.73it/s]



Found embeddings for 92.48% of vocab
Found embeddings for  99.61% of all text


In [4]:
from cmath import phase
from dis import findlabels
from unicodedata import name
from nltk.corpus import stopwords
import re
import spacy


CONTRACTION_MAP =  {"ain't": "is not",
                        "aren't": "are not",
                        "can't": "cannot",
                        "can't've": "cannot have",
                        "'cause": "because",
                        "could've": "could have",
                        "couldn't": "could not",
                        "couldn't've": "could not have",
                        "didn't": "did not",
                        "doesn't": "does not",
                        "don't": "do not",
                        "hadn't": "had not",
                        "hadn't've": "had not have",
                        "hasn't": "has not",
                        "haven't": "have not",
                        "he'd": "he would",
                        "he'd've": "he would have",
                        "he'll": "he will",
                        "he'll've": "he he will have",
                        "he's": "he is",
                        "how'd": "how did",
                        "how'd'y": "how do you",
                        "how'll": "how will",
                        "how's": "how is",
                        "i'd": "i would",
                        "i'd've": "i would have",
                        "i'll": "i will",
                        "i'll've": "i will have",
                        "i'm": "i am",
                        "i've": "i have",
                        "isn't": "is not",
                        "it'd": "it would",
                        "it'd've": "it would have",
                        "it'll": "it will",
                        "it'll've": "it will have",
                        "it's": "it is",
                        "let's": "let us",
                        "ma'am": "madam",
                        "mayn't": "may not",
                        "might've": "might have",
                        "mightn't": "might not",
                        "mightn't've": "might not have",
                        "must've": "must have",
                        "mustn't": "must not",
                        "mustn't've": "must not have",
                        "needn't": "need not",
                        "needn't've": "need not have",
                        "o'clock": "of the clock",
                        "oughtn't": "ought not",
                        "oughtn't've": "ought not have",
                        "shan't": "shall not",
                        "sha'n't": "shall not",
                        "shan't've": "shall not have",
                        "she'd": "she would",
                        "she'd've": "she would have",
                        "she'll": "she will",
                        "she'll've": "she will have",
                        "she's": "she is",
                        "should've": "should have",
                        "shouldn't": "should not",
                        "shouldn't've": "should not have",
                        "so've": "so have",
                        "so's": "so as",
                        "that'd": "that would",
                        "that'd've": "that would have",
                        "that's": "that is",
                        "there'd": "there would",
                        "there'd've": "there would have",
                        "there's": "there is",
                        "they'd": "they would",
                        "they'd've": "they would have",
                        "they'll": "they will",
                        "they'll've": "they will have",
                        "they're": "they are",
                        "they've": "they have",
                        "to've": "to have",
                        "wasn't": "was not",
                        "we'd": "we would",
                        "we'd've": "we would have",
                        "we'll": "we will",
                        "we'll've": "we will have",
                        "we're": "we are",
                        "we've": "we have",
                        "weren't": "were not",
                        "what'll": "what will",
                        "what'll've": "what will have",
                        "what're": "what are",
                        "what's": "what is",
                        "what've": "what have",
                        "when's": "when is",
                        "when've": "when have",
                        "where'd": "where did",
                        "where's": "where is",
                        "where've": "where have",
                        "who'll": "who will",
                        "who'll've": "who will have",
                        "who's": "who is",
                        "who've": "who have",
                        "why's": "why is",
                        "why've": "why have",
                        "will've": "will have",
                        "won't": "will not",
                        "won't've": "will not have",
                        "would've": "would have",
                        "wouldn't": "would not",
                        "wouldn't've": "would not have",
                        "y'all": "you all",
                        "y'all'd": "you all would",
                        "y'all'd've": "you all would have",
                        "y'all're": "you all are",
                        "y'all've": "you all have",
                        "you'd": "you would",
                        "you'd've": "you would have",
                        "you'll": "you will",
                        "you'll've": "you will have",
                        "you're": "you are",
                        "you've": "you have",
                    }
class MRAbstractPipeline():
    def __init__(self):
        self.pipeline = []
    
    def pipe(self, corpus):
        for el in self.pipeline:
            corpus = el(corpus)
        return corpus
    
    def __call__(self, *args, **kwds):
        if args[0] == None:
            raise ValueError("Need a corpus as argument")
        corpus = args[0]
        return self.pipe(corpus)
        

class MRPipelineTokens(MRAbstractPipeline):
    """
    Pipeline for documents represented as list of tokens
    """
    def __init__(self):
        super(MRPipelineTokens, self).__init__()
        self.pipeline = [self.remove_underscores, 
                         self.reducing_character_repetitions,
                         #self.unite_ts,
                         self.clean_contractions,
                         self.clean_special_chars,
                         self.remove_stop_words]

    def remove_underscores(self, corpus):
        """
        Solves the problem where some of the words are surrounded by underscores
        (e.g. "_hello_")
        """
        for doc in corpus:
            for idx, word in enumerate(doc):
                if "_" in word:
                    cleaned_word = self._clean_word(word)
                    doc[idx] = cleaned_word
        return corpus


    def _clean_word(self, word: str):
        word = word.replace("_", " ")
        # remove spaces before and after the word
        word = word.split()
        word = " ".join(word)
        return word
    
    def reducing_character_repetitions(self, corpus):
        
        new_corpus = []
        for doc in corpus:
            new_doc = [self._clean_repetitions(w) for w in doc]
            new_corpus.append(new_doc)
        return new_corpus

    # inspired by https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a
    def _clean_repetitions(self, word):
        """
        This Function will reduce repetition to two characters 
        for alphabets and to one character for punctuations.

        Parameters
        ----------
            word: str                
        Returns
        -------
        str
            Finally formatted text with alphabets repeating to 
            one characters & punctuations limited to one repetition 
            
        Example:
        Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
        Output : Really, Great !?.;:)

        """
        # Pattern matching for all case alphabets
        pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

        # Limiting all the repetitions to two characters.
        # MODIFIED: keep only one repetition of the character
        formatted_text = pattern_alpha.sub(r"\1\1", word) 

        # Pattern matching for all the punctuations that can occur
        pattern_punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')

        # Limiting punctuations in previously formatted string to only one.
        combined_formatted = pattern_punct.sub(r'\1', formatted_text)

        # The below statement is replacing repetitions of spaces that occur more than two times with that of one occurrence.
        final_formatted = re.sub(' {2,}',' ', combined_formatted)
        return final_formatted
    
    def unite_ts(self, corpus):
      new_corpus = []
      for doc in corpus:
        indexes = self._get_neg_indexes(doc)
        for el in indexes:
          doc[el[0]:el[1]] = ["".join(doc[el[0]:el[1]])]
        new_corpus.append(doc)
      return new_corpus

    def _get_neg_indexes(self, sent):
      indexes = []
      for idx, word in enumerate(sent):
        # Try to avoid out of range indexes (there can be some "'" a the beginning or end of the phrase)
        try:
          if word=="'" and sent[idx+1]=="t":
            indexes.append((idx-1,idx+2))
        except:
          pass
      return indexes
    
    def clean_contractions(self, corpus):
        new_corpus = []
        for doc in corpus:
            new_doc = []
            for word in doc:
                try:
                    correct = CONTRACTION_MAP[word]
                    correct = correct.split()
                    new_doc += correct
                except:
                    new_doc.append(word)
            new_corpus.append(new_doc)
        return new_corpus

    def clean_special_chars(self, corpus):
        new_corpus = [[self._clean_special_word(w) for w in doc] for doc in corpus] 
        return new_corpus
    
    def _clean_special_word(self, word):
        # The formatted text after removing not necessary punctuations.
        formatted_text = re.sub(r"[^a-zA-Z0-9:€$-,%.?!]+", '', word) 
        # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
        return formatted_text
    
    def remove_stop_words(self, corpus):
        stops = stopwords.words("english")
        stops = [word for word in stops if "'t" not in word or "not" not in word]
        return [[word for word in doc if word not in stops] for doc in corpus]
    

class MRPipelinePhrases(MRAbstractPipeline):
    """
    Pipeline for documents represented as list of phrases
    """
    def __init__(self):
        super(MRPipelinePhrases, self).__init__()
        self.pipeline = [self.remove_underscores, 
                         self.clean_special_chars,
                         self.reducing_character_repetitions,
                         self.lemmatize]

    def remove_underscores(self, corpus):
        """
        Solves the problem where some of the words are surrounded by underscores
        (e.g. "_hello_")
        """
        new_corpus = [self._clean_word(doc) for doc in corpus]
        return new_corpus


    def _clean_word(self, doc: str):
        doc = doc.replace("_", " ")
        return doc
    
    def reducing_character_repetitions(self, corpus):
        new_corpus = [self._clean_repetitions(doc) for doc in corpus]
        return new_corpus
    
    # inspired by https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a
    def _clean_repetitions(self, word):
        """
        This Function will reduce repetition to two characters 
        for alphabets and to one character for punctuations.

        Parameters
        ----------
            word: str                
        Returns
        -------
        str
            Finally formatted text with alphabets repeating to 
            one characters & punctuations limited to one repetition 
            
        Example:
        Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
        Output : Realy, Great !?.;:)

        """
        # Pattern matching for all case alphabets
        pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

        # Limiting all the repetitions to two characters.
        # MODIFIED: keep only one repetition of the character
        formatted_text = pattern_alpha.sub(r"\1\1", word) 

        # Pattern matching for all the punctuations that can occur
        pattern_punct = re.compile(r'([., /#!$%^&*?;:{}=_`~()+-])\1{1,}')

        # Limiting punctuations in previously formatted string to only one.
        combined_formatted = pattern_punct.sub(r'\1', formatted_text)

        # The below statement is replacing repetitions of spaces that occur more than two times with that of one occurrence.
        final_formatted = re.sub(' {2,}',' ', combined_formatted)
        return final_formatted

    def clean_special_chars(self, corpus):
        new_corpus = [self._clean_special_word(doc)  for doc in corpus]
        return new_corpus
    
    def _clean_special_word(self, word):
        # The formatted text after removing not necessary punctuations.
        formatted_text = re.sub(r"[^a-zA-Z0-9:€$-,%.?!]+", ' ', word) 
        # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
        return formatted_text
    

    def lemmatize(self, corpus):
        nlp = spacy.load('en_core_web_sm')
        return [[token.lemma_ for token in nlp(doc)] for doc in corpus]


### Corpus class
I'm going to create a class for the representation of the corpus in order to have a self contained way to have all the functions that may be useful for the processing of the corpus.

In [5]:
from nltk.corpus import movie_reviews
import numpy as np
import torch
import spacy


class MovieReviewsCorpusPhrases():
    def __init__(self, preprocess_pipeline = None):
        """
        If non preprocess_pipeline is given, the text gets tokenized by default
        using spacy tokenizer
        """
        self.mr = movie_reviews
        if preprocess_pipeline != None and not isinstance(preprocess_pipeline, MRPipelinePhrases):
            raise ValueError(f"preprocess_pipeline is not valid, you should pass \
                                a MRPipelinePhrases object or None")
        self.pipeline = preprocess_pipeline
        self.raw_corpus, self.labels = self._get_raw_corpus()
        if self.pipeline == None:
            self.processed_corpus = self.raw_corpus
        else:
            # Flattened and preprocessed corpus
            self.processed_corpus = self._preprocess()
        
        self.vocab = self._create_vocab()
        

    def _get_raw_corpus(self):
        neg = [self.mr.raw(doc) for doc in self.mr.fileids()[:1000]]
        pos = [self.mr.raw(doc) for doc in self.mr.fileids()[1000:]]
        labels = [0]*len(neg) + [1]*len(pos)
        return neg + pos, labels
    
    def _preprocess(self):
        if self.pipeline != None:
            return self.pipeline(self.raw_corpus)
        else:
            return self.raw_corpus
        
    def _create_vocab(self):
        vocab = dict()
        corpus_words = [w for doc in self.processed_corpus for w in doc]
        for word in corpus_words:
            try:
                vocab[word] += 1
            except:
                vocab[word] = 1
        return vocab

    def get_embedding_matrix(self, embedding, embedding_dim):
        """
        Returns
        -------
        np.ndarray
            A 2D which each row has the corresponding embedding from the vocabulary
        """
        matrix_length = len(self.vocab)
        embedding_matrix = np.zeros((matrix_length, embedding_dim))
        # If I use torch.zeros directly it crashes (don't know why)
        embedding_matrix = torch.from_numpy(embedding_matrix.copy())
        null_embedding = torch.tensor([0.0]*embedding_dim)
        for idx, key in enumerate(self.vocab.keys()):
            if torch.equal(embedding[key], null_embedding):
                embedding_matrix[idx] = torch.randn(embedding_dim)
            else:
                embedding_matrix[idx] = embedding[key]
                
        return embedding_matrix
    
    def get_indexed_corpus(self):
        """
        Returns
        -------
        Dictionary
            Containing correspondences word -> index
        
        list(list(torch.tensor))
            The corpus represented as indexes corresponding to each word
        """
        vocab = {}
        for idx, key in enumerate(self.vocab.keys()):
            vocab[key] = idx
        
        indexed_corpus = [torch.tensor([torch.tensor(vocab[w], dtype=torch.int32) for w in doc]) for doc in self.processed_corpus]
        return indexed_corpus, self.labels
    
    def __len__(self):
        return len(self.processed_corpus)


class MovieReviewsCorpus():
    def __init__(self, preprocess_pipeline = None):
        # list of documents, each document is a list containing words of that document
        self.mr = movie_reviews
        self.pipeline = preprocess_pipeline
        # Corpus as list of documents. Documents as list of sentences. Sentences as list of tokens
        self.unprocessed_corpus, self.labels = self._get_corpus()
        # Corpus as list of documents. Documents as list of tokens
        self.flattened_corpus = self._flatten()
        if preprocess_pipeline == None:
            self.processed_corpus = self.flattened_corpus
        else:
            # Flattened and preprocessed corpus
            self.processed_corpus = self._preprocess()

        self.corpus_words = self.get_corpus_words()
        self.vocab = self._create_vocab()



    def _list_to_str(self, doc) -> str:
        """
        Put all elements of the list into a single string, separating each element with a space.
        """
        return " ".join([w for sent in doc for w in sent])

    def _preprocess(self):
        return self.pipeline(self.flattened_corpus)

    def _flatten(self):
        """
        Returns
        -------
        list[list[str]]
            Each inner list represents a document. Each document is a list of tokens.
        """

        # 3 nested list: each list contain a document, each inner list contains a phrase (until fullstop), each phrase contains words.

        corpus = [[w for w in self._list_to_str(d).split(" ")] for d in self.unprocessed_corpus]
        return corpus

    def _get_corpus(self):
        neg = self.mr.paras(categories = "neg")
        pos = self.mr.paras(categories = "pos")
        labels = [0] * len(pos) + [1] * len(neg)
        return neg + pos, labels

    def movie_reviews_dataset_raw(self):
        """
        Returns the dataset containing:

        - A list of all the documents
        - The corresponding label for each document

        Returns
        -------
        tuple(list, list)
            The dataset: first element is the list of the document, the second element of the tuple is the associated label (positive or negative) for each document
        """

        return self.flattened_corpus, self.labels

    def get_sentence_ds(self):
        neg = self.mr.paras(categories = "neg")
        pos = self.mr.paras(categories = "pos")

        pos = [phrase for doc in pos for phrase in doc]
        neg = [phrase for doc in neg for phrase in doc]

        labels = np.array([0] * len(pos) + [1] * len(neg))
        corpus = neg+pos
        return corpus, labels


    def get_corpus_words(self) -> list:
        return [w for doc in self.processed_corpus for w in doc]
    
    def get_embedding_matrix(self, embedding, embedding_dim):
        """
        Returns
        -------
        np.ndarray
            A 2D which each row has the corresponding embedding from the vocabulary
        """
        matrix_length = len(self.vocab)
        embedding_matrix = np.zeros((matrix_length, embedding_dim))
        # If I use torch.zeros directly it crashes (don't know why)
        embedding_matrix = torch.from_numpy(embedding_matrix.copy())
        null_embedding = torch.tensor([0.0]*embedding_dim)
        for idx, key in enumerate(self.vocab.keys()):
            if torch.equal(embedding[key], null_embedding):
                embedding_matrix[idx] = torch.randn(embedding_dim)
            else:
                embedding_matrix[idx] = embedding[key]
                
        return embedding_matrix
    
    def get_fasttext_embedding_matrix(self, embedding, embedding_dim):
        matrix_length = len(self.vocab)
        embedding_matrix = np.zeros((matrix_length, embedding_dim))
        # If I use torch.zeros directly it crashes (don't know why)
        embedding_matrix = torch.from_numpy(embedding_matrix.copy())
        null_embedding = torch.tensor([0.0]*embedding_dim)
        for idx, key in enumerate(self.vocab.keys()):
            tensor_embedding = torch.from_numpy(embedding[key].copy())
            if torch.equal(tensor_embedding, null_embedding):
                embedding_matrix[idx] = torch.randn(embedding_dim)
            else:
                embedding_matrix[idx] = tensor_embedding
                
        return embedding_matrix
    
    def get_indexed_corpus(self):
        """
        Returns
        -------
        Dictionary
            Containing correspondences word -> index
        
        list(list(torch.tensor))
            The corpus represented as indexes corresponding to each word
        """
        vocab = {}
        for idx, key in enumerate(self.vocab.keys()):
            vocab[key] = idx
        
        indexed_corpus = [torch.tensor([torch.tensor(vocab[w], dtype=torch.int32) for w in doc]) for doc in self.processed_corpus]
        return indexed_corpus, self.labels


    def _create_vocab(self):
        vocab = dict()
        for word in self.corpus_words:
            try:
                vocab[word] += 1
            except:
                vocab[word] = 1
        return vocab

    def __len__(self):
        return len(self.flattened_corpus)


In [6]:
from torch.utils.data import Dataset
from torchtext.vocab import GloVe

class MovieReviewsDataset(Dataset):
  def __init__(self, raw_dataset):
    super(MovieReviewsDataset, self).__init__()
    self.corpus = np.array(raw_dataset[0], dtype = object)
    self.targets = np.array(raw_dataset[1], dtype = np.int32)
    self.max_element = len(max(self.corpus, key=lambda x: len(x)))

  def __len__(self):
    return len(self.corpus)
  
  def __getitem__(self, index):
    item = self.corpus[index]
    label = self.targets[index]
    return (item, label)

### Create the model class
Let's first try with a simple BiLSTM

In [7]:
import torch.nn as nn
from torch.autograd import Variable
from torch.nn.utils.rnn import pad_packed_sequence
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
import torch

class BiLSTM(nn.Module):
    def __init__(self, embedding_matrix = None, device = "cuda", input_size = 300, hidden_size = 128, output_size = 2):
        super(BiLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.device = device
        if embedding_matrix != None:
          self.embedding = self.create_embedding_layer(embedding_matrix)
        else:
          self.embedding = None
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True, bidirectional=True)
        self.fc = nn.Sequential(nn.ReLU(),
                                nn.BatchNorm1d(hidden_size*2, eps = 1e-08),
                                nn.Dropout(0.3),
                                nn.Linear(hidden_size*2, output_size)
                                )

    def create_embedding_layer(self, embedding_matrix):
        num_embeddings, embedding_dim = embedding_matrix.shape
        emb_layer = nn.Embedding(num_embeddings, embedding_dim, -1)
        emb_layer.load_state_dict({"weight": embedding_matrix})
        return emb_layer

    # function taken from https://discuss.pytorch.org/t/how-to-use-pack-sequence-if-we-are-going-to-use-word-embedding-and-bilstm/28184/4
    def simple_elementwise_apply(self, fn, packed_sequence):
        """applies a pointwise function fn to each element in packed_sequence"""
        return torch.nn.utils.rnn.PackedSequence(fn(packed_sequence.data), packed_sequence.batch_sizes)

    def init_hidden(self, batch_size):
        if self.cuda:
            return (torch.zeros(2, batch_size, self.hidden_size).to(self.device),
                    torch.zeros(2, batch_size, self.hidden_size).to(self.device),)

    def forward(self, x):
        batch_size = x.batch_sizes[0].item()
        hidden = self.init_hidden(batch_size)

        x = self.simple_elementwise_apply(self.embedding, x)

        # output: batch_size, sequence_length, hidden_size * 2 (since is bilstm)
        out, _ = self.lstm(x, hidden)
        out, input_sizes = pad_packed_sequence(out, batch_first=True)
        # Interested only in the last layer
        out = out[list(range(batch_size)), input_sizes - 1, :]
        out = self.fc(out)

        return out

class BiLSTMAttention(BiLSTM):
    # BiLSTM with attention inspired by the following paper: https://aclanthology.org/S18-1040.pdf
    def __init__(self, embedding_matrix = None, device="cuda", input_size=300,
                 hidden_size=128, context_size = None, output_size=2):
        super(BiLSTMAttention, self).__init__(embedding_matrix, device, input_size, hidden_size, output_size)
        # Not self attention :)
        if context_size != None:
          self.attention = nn.Linear(self.hidden_size * 2, context_size)
          self.history = nn.Parameter(torch.randn(context_size))
        else:
          self.attention = nn.Linear(self.hidden_size * 2, 1)
          self.history = None

    def forward(self, x):

        batch_size = x.batch_sizes[0].item()
        hidden = self.init_hidden(batch_size)

        if self.embedding != None:
          x = self.simple_elementwise_apply(self.embedding, x)

        # output: batch_size, sequence_length, hidden_size * 2 (since is bilstm)
        out, _ = self.lstm(x, hidden)
        out, input_sizes = pad_packed_sequence(out, batch_first=True)

        if self.history == None:
          attention_values = torch.tanh(self.attention(out)).squeeze()
          attention_weights = torch.softmax(attention_values, dim = 1).unsqueeze(1)
          # n_docs, sequence_length
        else:
          attention_values = torch.tanh(self.attention(out))
          attention_weights = torch.softmax(attention_values.matmul(self.history), dim = 1).unsqueeze(1)
          # n_docs, sequence_length

        out = torch.sum(attention_weights.matmul(out), dim = 1)

        out = self.fc(out)

        attention_weights = attention_weights.squeeze()
        att = [doc[:input_sizes[idx]] for idx, doc in enumerate(attention_weights)]

        return out, att

    


In [8]:
def training_step(net, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  net.train()

  for batch_idx, (inputs, targets, _) in enumerate(data_loader):

    inputs = inputs.to(device)
    targets = targets.to(device)
    in_size = targets.size(dim=0)

    outputs, _ = net(inputs)

    loss = cost_function(outputs, targets)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = outputs.max(dim=1)

    cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [9]:
def test_step(net, data_loader, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  net.eval()

  with torch.no_grad():

    for batch_idx, (inputs, targets, _) in enumerate(data_loader):
      inputs = inputs.to(device)
      targets = targets.to(device)
      in_size = targets.size(dim=0)

      outputs, _ = net(inputs)

      loss = cost_function(outputs, targets)

      samples += in_size
      cumulative_loss += loss.item()
      _, predicted = outputs.max(dim=1)

      cumulative_accuracy += predicted.eq(targets).sum().item()

    return cumulative_loss/samples, (cumulative_accuracy/samples)*100


In [10]:
from torch.utils.data import DataLoader
from torch.optim import Adam
import torch.nn as nn

def main(train_loader, test_loader, embedding_matrix, device = "cuda", epochs = 10):

  net = BiLSTMAttention(embedding_matrix, device = device, input_size=300).to(device)

  optimizer = Adam(net.parameters(), 0.001, betas = (0.9, 0.9), amsgrad=True)

  cost_function = nn.CrossEntropyLoss()

  for e in range(epochs):
    print(f"epoch {e}:")
    train_loss, train_accuracy = training_step(net, train_loader, optimizer, cost_function, device)
    print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
    test_loss, test_accuracy = test_step(net, test_loader, cost_function, device)
    print(f"Test loss: {test_loss} \n Test accuracy: {test_accuracy}")
    print("------------------------------------------------------------------")
  
  _, test_accuracy = test_step(net, test_loader, cost_function, device)

  return test_accuracy


In [11]:
from typing import List
from torch.nn.utils.rnn import pack_padded_sequence
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split

def pad(batch, max_size):
    # try:
    pad = torch.tensor([-1]*batch[0].size(dim=1), dtype = torch.float).to("cuda")
    embedded = 1
    # except:
    #  pad = torch.tensor([-1])
    #  embedded = 0
    for idx in range(len(batch)):
        remaining = max_size - batch[idx].size(dim = 0)
        abc = pad.repeat(remaining)
        if embedded:
          batch[idx] = torch.cat((batch[idx], pad.repeat(remaining, 1)), dim = 0)
        else:
          batch[idx] = torch.cat((batch[idx], pad.repeat(remaining)), dim = 0)
    return batch

def batch_to_tensor(X: List[torch.tensor], max_size):
    # try:
    X_tensor = torch.zeros((len(X), max_size, X[0].size(dim=1)), dtype=torch.float).to("cuda")
    # except:
    #  X_tensor = torch.zeros((len(X), max_size), dtype=torch.int32)

    for i, embed in enumerate(X):
        X_tensor[i] = embed
    return X_tensor

def sort_ds(X, Y):
    """
    Sort inputs by document lengths
    """
    document_lengths = np.array([tens.size(dim = 0) for tens in X])
    indexes = np.argsort(document_lengths)
    document_lengths = document_lengths.tolist()

    X_sorted = [X[idx] for idx in indexes][::-1]
    Y_sorted = [Y[idx] for idx in indexes][::-1]
    document_lengths = torch.tensor([document_lengths[idx] for idx in indexes][::-1])

    return X_sorted, Y_sorted, document_lengths, indexes



def collate(batch):
    X, Y = list(zip(*batch))
    # Sort dataset
    X, Y, document_lengths, indexes = sort_ds(X, Y)

    # Get tensor sizes
    max_size = torch.max(document_lengths).item()

    # Pad tensor each element
    X = pad(X, max_size)

    # Transform the batch to a tensor
    X_tensor = batch_to_tensor(X, max_size)
    Y_tensor = torch.tensor(Y)
    # Return the padded sequence object
    X_final = pack_padded_sequence(X_tensor, document_lengths, batch_first=True)
    return X_final, Y_tensor, indexes


def get_data(batch_size: int, dataset, collate_fn, random_state = 42):
  # Random Split
  train_indexes, test_indexes = train_test_split(list(range(len(dataset.targets))), test_size = 0.2,
                                                  stratify = dataset.targets, random_state = random_state)

  train_ds = Subset(dataset, train_indexes)
  test_ds = Subset(dataset, test_indexes)

  train_loader = DataLoader(train_ds, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)
  test_loader = DataLoader(test_ds, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)

  return train_loader, test_loader

In [12]:
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder

from torch.utils.data import Dataset


# Workaround in order to use .targets to access labels of the subset (doesn't work with Subset pytorch class)
# https://discuss.pytorch.org/t/attributeerror-subset-object-has-no-attribute-targets/66564
class CustomSubset(Dataset):
    """
    Subset of a dataset at specified indices.

    Arguments:
        dataset (Dataset): The whole Dataset
        indices (sequence): Indices in the whole set selected for subset
        labels(sequence) : targets as required for the indices. will be the same length as indices
    """
    def __init__(self, dataset, indices, labels):
        self.dataset = torch.utils.data.Subset(dataset, indices)
        self.targets = labels
    def __getitem__(self, idx):
        item = self.dataset[idx][0]
        target = self.targets[idx]
        return (item, target)

    def __len__(self):
        return len(self.targets)

def to_categorical(y, num_classes):
    """ 1-hot encodes a tensor """
    return np.eye(num_classes, dtype='uint8')[y]

def main_cross_validation(dataset, embedding_matrix, collate_fn,
                          device = "cuda", epochs = 20, random_state = 42, batch_size = 128):


  train_indexes, test_indexes = train_test_split(list(range(len(dataset.targets))), test_size = 0.2,
                                                  stratify = dataset.targets, random_state = random_state)

  train_targets = np.asarray(dataset.targets[train_indexes], dtype=np.int64)
  test_targets = np.asarray(dataset.targets[test_indexes], dtype=np.int64)

  # I use ds and set because the first means that the dataset should be splitted again (train + val),
  # the latter means that it is ready to use
  train_ds = CustomSubset(dataset, train_indexes, train_targets)
  test_set = CustomSubset(dataset, test_indexes, test_targets)
  test_loader = DataLoader(test_set, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)

  skf = StratifiedKFold(5, shuffle = True, random_state=random_state)

  fold_accuracies = []
  
  for fold, (train_indexes, val_indexes) in enumerate(skf.split(np.zeros(len(train_ds)),
                                                      train_targets)):
    
    net = BiLSTMAttention(embedding_matrix, device = device, input_size=300).to(device)
    optimizer = Adam(net.parameters(), 0.001, betas = (0.9, 0.9), amsgrad=True)
    cost_function = nn.CrossEntropyLoss()
    
    train_set = Subset(train_ds, train_indexes)
    val_set = Subset(train_ds, val_indexes)

    train_loader = DataLoader(train_set, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)
    val_loader = DataLoader(val_set, batch_size = batch_size, collate_fn = collate_fn, pin_memory = True)


    for e in range(epochs):
      print(f"epoch {e}:")
      train_loss, train_accuracy = training_step(net, train_loader, optimizer, cost_function, device)
      print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
      val_loss, val_accuracy = test_step(net, val_loader, cost_function, device)
      print(f"Val loss: {val_loss} \n Val accuracy: {val_accuracy}")
      print("------------------------------------------------------------------")
    
    fold_accuracies.append(val_accuracy)

  _, test_accuracy = test_step(net, test_loader, cost_function, device)

  fold_accuracies = np.array(fold_accuracies)

  return test_accuracy, fold_accuracies.mean(), fold_accuracies.std()



## Lexicon Based Supervised Attention Model

In [13]:
class MRPipelineLBSA(MRAbstractPipeline):
    """
    Pipeline for documents represented as list of tokens
    """
    def __init__(self):
        super(MRPipelineLBSA, self).__init__()
        self.pipeline = [self.remove_underscores, 
                         self.reducing_character_repetitions,
                         self.clean_contractions,
                         self.clean_special_chars,
                         self.remove_stop_words]

    def remove_underscores(self, corpus):
        """
        Solves the problem where some of the words are surrounded by underscores
        (e.g. "_hello_")
        """

        for doc in corpus:
          for sent in doc:
            for idx, word in enumerate(sent):
                if "_" in word:
                    cleaned_word = self._clean_word(word)
                    sent[idx] = cleaned_word
        return corpus


    def _clean_word(self, word: str):
        word = word.replace("_", " ")
        # remove spaces before and after the word
        word = word.split()
        word = " ".join(word)
        return word
    
    def reducing_character_repetitions(self, corpus):
        
        new_corpus = [[[self._clean_repetitions(w) for w in sent] for sent in doc] for doc in corpus]
        return new_corpus

    # inspired by https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a
    def _clean_repetitions(self, word):
        """
        This Function will reduce repetition to two characters 
        for alphabets and to one character for punctuations.

        Parameters
        ----------
            word: str                
        Returns
        -------
        str
            Finally formatted text with alphabets repeating to 
            one characters & punctuations limited to one repetition 
            
        Example:
        Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
        Output : Really, Great !?.;:)

        """
        # Pattern matching for all case alphabets
        pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

        # Limiting all the repetitions to two characters.
        # MODIFIED: keep only one repetition of the character
        formatted_text = pattern_alpha.sub(r"\1\1", word) 

        # Pattern matching for all the punctuations that can occur
        pattern_punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')

        # Limiting punctuations in previously formatted string to only one.
        combined_formatted = pattern_punct.sub(r'\1', formatted_text)

        # The below statement is replacing repetitions of spaces that occur more than two times with that of one occurrence.
        final_formatted = re.sub(' {2,}',' ', combined_formatted)
        return final_formatted
    
    def clean_contractions(self, corpus):
        new_corpus = []
        for doc in corpus:
          new_doc = []
          for sent in doc:
            new_sent = []
            for word in sent:
                try:
                    correct = CONTRACTION_MAP[word]
                    correct = correct.split()
                    new_sent += correct
                except:
                    new_sent.append(word)
            new_doc.append(new_sent)
          new_corpus.append(new_doc)
        return new_corpus

    def clean_special_chars(self, corpus):
        new_corpus = [[[self._clean_special_word(w) for w in sent] for sent in doc] for doc in corpus] 
        return new_corpus
    
    def _clean_special_word(self, word):
        # The formatted text after removing not necessary punctuations.
        formatted_text = re.sub(r"[^a-zA-Z0-9:€$-,%.?!]+", '', word) 
        # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
        return formatted_text
    
    def remove_stop_words(self, corpus):
      stops = stopwords.words("english")
      # We don't want to remove stop words associated with negations
      stops = [word for word in stops if "'t" not in word or "not" not in word]
      return [[[word for word in sent if word not in stops] for sent in doc] for doc in corpus]

In [14]:

class MovieReviewsCorpusLBSA():
  def __init__(self, preprocess_pipeline = None):
      """
      If non preprocess_pipeline is given, the text gets tokenized by default
      using spacy tokenizer
      """
      self.mr = movie_reviews
      if preprocess_pipeline != None and not isinstance(preprocess_pipeline, MRPipelineLBSA):
          raise ValueError(f"preprocess_pipeline is not valid, you should pass \
                              a MRPipelineLBSA object or None")
      self.pipeline = preprocess_pipeline
      self.raw_corpus, self.labels = self._get_raw_corpus()
      if self.pipeline == None:
          self.processed_corpus = self.raw_corpus
      else:
          # Flattened and preprocessed corpus
          self.processed_corpus = self._preprocess()
      
      self.vocab = self._create_vocab()
      

  def _get_raw_corpus(self):
      neg = self.mr.paras(categories = "neg")
      pos = self.mr.paras(categories = "pos")
      labels = [0]*len(neg) + [1]*len(pos)
      return neg + pos, labels
  
  def _preprocess(self):
      if self.pipeline != None:
          return self.pipeline(self.raw_corpus)
      else:
          return self.raw_corpus
      
  def _create_vocab(self):
      vocab = dict()
      corpus_words = [w for doc in self.processed_corpus for sent in doc for w in sent]
      for word in corpus_words:
          try:
              vocab[word] += 1
          except:
              vocab[word] = 1
      return vocab

  def get_embedding_matrix(self, embedding, embedding_dim):
      """
      Returns
      -------
      np.ndarray
          A 2D which each row has the corresponding embedding from the vocabulary
      """
      matrix_length = len(self.vocab)
      embedding_matrix = np.zeros((matrix_length, embedding_dim))
      # If I use torch.zeros directly it crashes (don't know why)
      embedding_matrix = torch.from_numpy(embedding_matrix.copy())
      null_embedding = torch.tensor([0.0]*embedding_dim)
      for idx, key in enumerate(self.vocab.keys()):
          if torch.equal(embedding[key], null_embedding):
              embedding_matrix[idx] = torch.randn(embedding_dim)
          else:
              embedding_matrix[idx] = embedding[key]
              
      return embedding_matrix
  
  def get_indexed_corpus(self):
      """
      Returns
      -------
      Dictionary
          Containing correspondences word -> index
      
      list(int)
          labels associated with each document
      """
      vocab = {}
      for idx, key in enumerate(self.vocab.keys()):
          vocab[key] = idx
      
      # each doc is a list of tensor which represent sentences, each sentence is a tensor of indexed words
      indexed_corpus = [[torch.tensor([vocab[w] for w in sent], dtype=torch.int32) 
                        for sent in doc]
                        for doc in self.processed_corpus]
      return indexed_corpus, self.labels
  
  def __len__(self):
      return len(self.processed_corpus)

c = MovieReviewsCorpusLBSA()

In [15]:
from nltk.corpus import sentiwordnet as swn
import math

class MovieReviewsDatasetLBSA(Dataset):
  def __init__(self, corpus):
    super(MovieReviewsDatasetLBSA, self).__init__()
    self.corpus = corpus
    indexed_corpus = self.corpus.get_indexed_corpus()
    # Word level gold attention vector
    self.word_lambda = 3
    self.sentence_lambda = 3
    self.sentiment_degree = self._compute_sentiment_degree()
    self.wl_gold_av = self._compute_gold_words()
    self.sl_gold_av = self._compute_gold_sents()
    self.data = indexed_corpus[0]
    self.targets = indexed_corpus[1]
  
  def _compute_sentiment_degree(self):
    vocab = self._build_senti_vocab(self.corpus.vocab)
    corpus = self.corpus.processed_corpus
    scores = [[[vocab[word] for word in sent] for sent in doc] for doc in corpus]
    return scores

  def _compute_gold_sents(self):
    sentence_sentiment_degree  = [[sum(sent)/len(sent) for sent in doc] for doc in self.sentiment_degree]
    gold = [self._normalized_softmax(doc, self.sentence_lambda) for doc in sentence_sentiment_degree]
    return gold


  def _compute_gold_words(self):
    gold = [[self._normalized_softmax(sent_scores, self.word_lambda) for sent_scores in doc] for doc in self.sentiment_degree]
    return gold

  def _normalized_softmax(self, sequence, lam):
    multiplied_sequence = [lam * el for el in sequence]
    total = sum([math.exp(el) for el in sequence])
    res = torch.tensor([math.exp(lam * el)/total for el in sequence])
    return res

  def _build_senti_vocab(self, vocab):
    for key in vocab.keys():
      vocab[key] = 0

    max_value = 0
    for key in vocab.keys():
      senses = list(swn.senti_synsets(key))
      pos = 0
      neg = 0
      for sense in senses:
        pos += sense.pos_score()
        neg += sense.neg_score()
      if (pos != 0) or (neg != 0):
        vocab[key] = max(pos, neg)
      if vocab[key] > max_value:
        max_value = vocab[key]

    for key in vocab.keys():
      vocab[key] = self.maprange((0, max_value), (0, 1), vocab[key])

    return vocab
  
  def maprange(self, a, b, s):
    """
    Maps the number s from range a = [a1, a2] to range b = [b1, b2]
    """
    # Source: https://rosettacode.org/wiki/Map_range#Python
    (a1, a2), (b1, b2) = a, b
    return  b1 + ((s - a1) * (b2 - b1) / (a2 - a1))


  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    item = self.data[index]
    label = self.targets[index]
    gold_word = self.wl_gold_av[index]
    gold_sent = self.sl_gold_av[index]
    return (item, label, gold_word, gold_sent)

corpus = MovieReviewsCorpusLBSA()
ds = MovieReviewsDatasetLBSA(corpus)
print(type(ds.sl_gold_av[1]))

<class 'torch.Tensor'>


To understand:
- If it is better to introduce intermediate supervision

- If it is better to use one hot encoding for the output

- If I intepreted well the word-loss

In [16]:
class EncoderLBSA(BiLSTMAttention):
    # Lexicon Based Supervised Attention model (LBSA) inspired by the following paper: https://aclanthology.org/C18-1074.pdf
    def __init__(self, embedding_matrix, device="cuda", input_size=300,
                 hidden_size=128, context_size = 150, output_size=2):

        super(EncoderLBSA, self).__init__(embedding_matrix, device, input_size, hidden_size, context_size, output_size)

    # TODO: Pass the part inside for to super.forward()  
    def forward(self, x):
      att = []
      res = []
      for i, doc in enumerate(x):
        doc = doc.to(self.device)
        batch_size = doc.batch_sizes[0].item()
        hidden = self.init_hidden(batch_size)

        doc = self.simple_elementwise_apply(self.embedding, doc)

        out, _ = self.lstm(doc, hidden)
        out, input_sizes = pad_packed_sequence(out, batch_first=True)
        # n_sents, n_words_per_sent, hidden_size * 2 (since is bilstm)


        attention_values = torch.tanh(self.attention(out))
        # n_sents, n_words_per_sent, context_size

        attention_weights = torch.softmax(attention_values.matmul(self.history), dim = 1).unsqueeze(1)
        # n_sents, n_words_per_sent

        out = torch.sum(attention_weights.matmul(out), dim = 1)
        # n_sents, hidden*2
        
        attention_weights = attention_weights.squeeze(dim=1)

        att.append([sent[:input_sizes[idx]] for idx, sent in enumerate(attention_weights)])

        res.append(out)
      # n_doc, seq_lengths, hidden * 2
      return res, att

In [29]:
def sortLBSA(X, w_gold, s_gold):

  sentence_lengths = [np.array([sent.size(dim=0) for sent in doc]) for doc in X]
  indexes = [np.argsort(doc) for doc in sentence_lengths]
  indexes = [el.tolist() for el in indexes]

  X_sorted = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(X)]
  w_gold = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(w_gold)]
  s_gold = [torch.tensor([doc[idx2] for idx2 in indexes[idx]][::-1]) for idx, doc in enumerate(s_gold)]
  sentence_lengths = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(sentence_lengths)]

  return X_sorted, w_gold, s_gold, sentence_lengths, indexes

def padLBSA(batch, max_sizes):
    pad = torch.tensor([-1])
    for idx1, doc in enumerate(batch):
      for idx2, sent in enumerate(doc):
        remaining = max_sizes[idx1] - sent.size(dim = 0)
        batch[idx1][idx2] = torch.cat((sent, pad.repeat(remaining)), dim = 0)
    return batch

def to_tensorLBSA(batch, max_sizes):
  res = []
  for idx, doc in enumerate(batch):
    buff = torch.zeros(len(doc), max_sizes[idx], dtype=torch.int32)
    for idx2, sent in enumerate(doc):
      buff[idx2] = sent

    res.append(buff)
  return res

def collateLBSA(batch):
  X, Y, w_gold, s_gold = list(zip(*batch))
  X, w_gold, s_gold, sentence_lengths, indexes = sortLBSA(X, w_gold, s_gold)
  # can take doc[0] since senetence_lengths is sorted
  max_sizes = [doc[0] for doc in sentence_lengths]

  # Pad tensor each element
  X = padLBSA(X, max_sizes)
  # Transform the batch to a tensor
  X = to_tensorLBSA(X, max_sizes)

  # Return the padded sequence object
  X = [pack_padded_sequence(doc, sentence_lengths[idx], batch_first=True) for idx, doc in enumerate(X)]
  return X, Y, w_gold, s_gold, indexes

In [18]:
def loss_LBSA(outputs, targets, mu_w = 0.001, mu_s = 0.05):
  dec_output, w_att, s_att = outputs
  target, w_gold, s_gold = targets

  total_loss = 0
  ce = nn.CrossEntropyLoss()

  total_loss += ce(dec_output, target)

  w_loss = torch.mean(torch.tensor([
    torch.sum(torch.tensor([
        ce(w_att[idx1][idx2], sent) for idx2, sent in enumerate(doc)
    ])) * mu_w for idx1, doc in enumerate(w_gold)
  ]))
  total_loss += w_loss

  s_loss = torch.mean(torch.tensor([
      ce(s_att[idx], doc) * mu_s for idx, doc in enumerate(s_gold)
  ]))
  total_loss += s_loss

  return total_loss

In [19]:
def training_step_LBSA(encoder, decoder, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.train()
  decoder.train()

  for batch_idx, (inputs, target, w_gold, s_gold, _) in enumerate(data_loader):
    
    in_size = len(target)

    enc_output, w_att = encoder(inputs)

    # for i, doc in enumerate(enc_output):
        # for j, sent in enumerate(doc):
          # enc_output[i][j] = sent.cpu()
    
    batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

    dec_input, target, indexes = collate(batch)
    
    s_gold = [s_gold[idx] for idx in indexes][::-1]
    # dec_input = dec_input.to(device)
    target = target.to(device)
    for idx1, doc in enumerate(w_gold):
      for idx2, sent in enumerate(doc):
        w_gold[idx1][idx2] = sent.to(device)
    
    for idx, doc in enumerate(s_gold):
       s_gold[idx] = doc.to(device)
    

    dec_output, s_att = decoder(dec_input)

    outputs = (dec_output, w_att, s_att)
    targets = (target, w_gold, s_gold)

    loss = cost_function(outputs, targets)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = dec_output.max(dim=1)

    cumulative_accuracy += predicted.eq(target).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [27]:
def test_step_LBSA(encoder, decoder, data_loader, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.eval()
  decoder.eval()

  with torch.no_grad():

    for batch_idx, (inputs, target, w_gold, s_gold, _) in enumerate(data_loader):
      in_size = len(target)

      enc_output, w_att = encoder(inputs)
      
      batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

      dec_input, target, indexes = collate(batch)

      s_gold = [s_gold[idx] for idx in indexes][::-1]
      # Not sorting also w_gold becuse in the encoder, documents don't get shuffled

      target = target.to(device)
      for idx1, doc in enumerate(w_gold):
        for idx2, sent in enumerate(doc):
          w_gold[idx1][idx2] = sent.to(device)
    
      for idx, doc in enumerate(s_gold):
        s_gold[idx] = doc.to(device)

      dec_output, s_att = decoder(dec_input)

      outputs = (dec_output, w_att, s_att)
      targets = (target, w_gold, s_gold)

      loss = cost_function(outputs, targets)
      
      samples += in_size
      cumulative_loss += loss.item()
      _, predicted = dec_output.max(dim=1)

      cumulative_accuracy += predicted.eq(target).sum().item()

    return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [None]:
def training_step_LBSA_new(encoder, decoder, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.train()
  decoder.train()

  for batch_idx, (inputs, target, w_gold, s_gold, sent_indexes) in enumerate(data_loader):
    
    in_size = len(target)

    enc_output, w_att = encoder(inputs)

    #n_doc, n_sents, hidden*2
    
    # Sorting the output of the encoder after 
    enc_output = [enc_output[doc_idx][idx] for doc_idx in sent_indexes for idx in doc_idx ]

    # for i, doc in enumerate(enc_output):
        # for j, sent in enumerate(doc):
          # enc_output[i][j] = sent.cpu()
    
    batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

    dec_input, target, indexes = collate(batch)
    
    s_gold = [s_gold[idx] for idx in indexes][::-1]
    # dec_input = dec_input.to(device)
    target = target.to(device)
    for idx1, doc in enumerate(w_gold):
      for idx2, sent in enumerate(doc):
        w_gold[idx1][idx2] = sent.to(device)
    
    for idx, doc in enumerate(s_gold):
       s_gold[idx] = doc.to(device)
    

    dec_output, s_att = decoder(dec_input)

    outputs = (dec_output, w_att, s_att)
    targets = (target, w_gold, s_gold)

    loss = cost_function(outputs, targets)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = dec_output.max(dim=1)

    cumulative_accuracy += predicted.eq(target).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [21]:
from torch.utils.data import DataLoader
from torch.optim import Adam
import torch.nn as nn

def main_LBSA(train_loader, test_loader, embedding_matrix, device = "cuda", epochs = 10):

  encoder = EncoderLBSA(embedding_matrix = embedding_matrix, device = device, input_size=300, hidden_size=100).to(device)
  decoder = BiLSTMAttention(device = device, input_size=100*2, context_size = 150).to(device)

  optimizer = Adam(list(encoder.parameters()) + list(decoder.parameters()), 0.001, betas = (0.9, 0.999), amsgrad=True)

  cost_function = loss_LBSA

  for e in range(epochs):
    print(f"epoch {e}:")
    train_loss, train_accuracy = training_step_LBSA_new(encoder, decoder, train_loader, optimizer, cost_function, device)
    print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
    test_loss, test_accuracy = test_step_LBSA(encoder, decoder, test_loader, cost_function, device)
    print(f"Test loss: {test_loss} \n Test accuracy: {test_accuracy}")
    print("------------------------------------------------------------------")
  
  _, test_accuracy = test_step(encoder, decoder, test_loader, cost_function, device)

  return test_accuracy


In [22]:
global_vectors = GloVe(name='840B', dim=300, cache = "/content/gdrive/My Drive/nlu-project/.vector_cache")

In [23]:
mr_pipeline = MRPipelineLBSA()
corpus = MovieReviewsCorpusLBSA(mr_pipeline)

In [24]:
embedding_matrix = corpus.get_embedding_matrix(global_vectors, 300)
# ds = corpus.get_indexed_corpus()

In [25]:
dataset = MovieReviewsDatasetLBSA(corpus)
train_loader, test_loader = get_data(128, dataset, collate_fn=collateLBSA)

In [28]:
# 87.3, 86 e qualcosa
train_loader, test_loader = get_data(128, dataset, collate_fn=collateLBSA)
# Da fare una collate nuova direttamente che non vada a toccare i tensori esistenti.
# Oppure ragionare un attimo.... servono davvero i gradient per l'input? O posso fare il detach?
accuracy, mean, std = main_LBSA(train_loader, test_loader, embedding_matrix, device = "cuda", epochs = 20)
print(f"Overall accuracy: {accuracy}")
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")

epoch 0:
Training loss: 0.007132519111037254 
 Training accuracy: 64.9375
Test loss: 0.009536461383104324 
 Test accuracy: 73.0
------------------------------------------------------------------
epoch 1:
Training loss: 0.004899058844894171 
 Training accuracy: 85.0
Test loss: 0.009184689819812774 
 Test accuracy: 61.75000000000001
------------------------------------------------------------------
epoch 2:
Training loss: 0.0035149676539003847 
 Training accuracy: 93.9375
Test loss: 0.008473468273878097 
 Test accuracy: 79.0
------------------------------------------------------------------
epoch 3:
Training loss: 0.0026734076254069806 
 Training accuracy: 98.875
Test loss: 0.007285471260547638 
 Test accuracy: 82.75
------------------------------------------------------------------
epoch 4:
Training loss: 0.002284518387168646 
 Training accuracy: 99.875
Test loss: 0.006816456913948059 
 Test accuracy: 81.0
------------------------------------------------------------------
epoch 5:
Train

KeyboardInterrupt: ignored

In [None]:
"""
tensor([1315, 1222, 1011, 1010,  936,  862,  814,  807,  807,  764,  718,  515,
         495,  388,  344,  323])
tensor([1617, 1361, 1311, 1178, 1081, 1068,  958,  941,  925,  768,  688,  619,
         604,  573,  484,  405])
"""

# First try to parse phrases documet-wise, then try to parse each phrase of a document separately, and then aggregate the result (if there are more positive phrases then positive, otherwise negative). (Try also to give a weight depending on the number of sentiment lexemes)