<a href="https://colab.research.google.com/github/Zinni98/Sentiment-analysis-project/blob/refactoring/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Polarity Classification

In [1]:
from google.colab import drive
drive.mount("/content/gdrive/")

Mounted at /content/gdrive/


In [2]:
import sys
sys.path.append("/content/gdrive/My Drive/nlu-project")

In [3]:
import nltk
import torch
from nltk.corpus import movie_reviews
import numpy as np
nltk.download("punkt")
nltk.download("movie_reviews")
nltk.download("subjectivity")
nltk.download("stopwords")
nltk.download("sentiwordnet")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Unzipping corpora/subjectivity.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

## Exploratory analysis

Firstly let's explore the data:

In [None]:
mr = movie_reviews
neg = mr.paras(categories = "neg")
pos = mr.paras(categories = "pos")
print(f"length of each part of the dataset:\n - pos: {len(pos)} \n - neg: {len(neg)}")
print(pos)

NameError: ignored

It's easy to see that data comes in the following format:

- pos = [doc1, doc2, ..., doc1000] (the same applies for negative sentiment examples)

Where each doc has the following structure:

- doc1 = [sentence_1, sentence_2, ..., sentence_k]

Each sentence is a list of tokens, so the dataset is already tokenized.

### Word embedding
Since I'm going to use deep learning models, I'm going to choose a word embedding to transform the text into vectors.
I'm going to start with a pretrained version of GloVe word embedding.
Since is a pre-trained word embedding (hence basically a lookup table), I'm going to check how many words of the vocabulary are covered by the pretrained word embedding model:

In [None]:
def create_vocab(corpus_words):
    vocab = dict()
    for word in corpus_words:
      try:
        vocab[word] += 1
      except:
        vocab[word] = 1
    return vocab

def get_corpus_words(corpus) -> list:
    return [w for doc in corpus for sent in doc for w in sent]

In [None]:
import operator
from tqdm import tqdm
from torchtext.vocab import GloVe
import torch

global_vectors = GloVe(name='840B', dim=300)

# function inspired by https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings/notebook
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    null_embedding = torch.tensor([0.0]*300)
    for word in tqdm(vocab):
        try:
          if torch.equal(embeddings_index.get_vecs_by_tokens(word), null_embedding):
            raise KeyError
          a[word] = embeddings_index.get_vecs_by_tokens(word)
          k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print()
    print(f'Found embeddings for {len(a) / len(vocab):.2%} of vocab')
    print(f'Found embeddings for  {k / (k + i):.2%} of all text')
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

vocab = create_vocab(get_corpus_words(pos + neg))
oov = check_coverage(vocab, global_vectors)

In [None]:
oov

I'm going to see which are the words that are not covered by the embedding (Out Of Vocabulary words), so I can try to see if there are some tenchniques that can be applied in order to improve coverage.
The majority of OOV words aren't related with a praticular sentiment (they are basically nouns or some type punctuation), so they can be safely removed. That happens because unknown words are encoded as $[0] * embedding.length$, so no useful information is added.
Others OOV words are regular words surrounded by underscores, so they are not recognized by the fixed word embedding. To avoid this problem I implemented a procedure in order to clean these words:

In [None]:
def remove_underscores(corpus):
  for doc in corpus:
    for sent in doc:
      for idx, word in enumerate(sent):
        if "_" in word:
          cleaned_word = _clean_word(word)
          sent[idx] = cleaned_word
  return corpus


def _clean_word(word: str):
  word = word.replace("_", " ")
  word = word.split()
  word = " ".join(word)
  return word


In [None]:
corpus = pos + neg
clean_corpus = remove_underscores(corpus), oov
vocab = create_vocab(get_corpus_words(clean_corpus))
oov = check_coverage(vocab, global_vectors)

100%|██████████| 39519/39519 [00:01<00:00, 28083.73it/s]



Found embeddings for 92.48% of vocab
Found embeddings for  99.61% of all text


In [4]:
from nltk.corpus import stopwords
import re
import spacy
from abc import ABC, abstractmethod

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from nltk.sentiment.util import mark_negation

from nltk.corpus import subjectivity


CONTRACTION_MAP =  {"ain't": "is not",
                        "aren't": "are not",
                        "can't": "cannot",
                        "can't've": "cannot have",
                        "'cause": "because",
                        "could've": "could have",
                        "couldn't": "could not",
                        "couldn't've": "could not have",
                        "didn't": "did not",
                        "doesn't": "does not",
                        "don't": "do not",
                        "hadn't": "had not",
                        "hadn't've": "had not have",
                        "hasn't": "has not",
                        "haven't": "have not",
                        "he'd": "he would",
                        "he'd've": "he would have",
                        "he'll": "he will",
                        "he'll've": "he he will have",
                        "he's": "he is",
                        "how'd": "how did",
                        "how'd'y": "how do you",
                        "how'll": "how will",
                        "how's": "how is",
                        "i'd": "i would",
                        "i'd've": "i would have",
                        "i'll": "i will",
                        "i'll've": "i will have",
                        "i'm": "i am",
                        "i've": "i have",
                        "isn't": "is not",
                        "it'd": "it would",
                        "it'd've": "it would have",
                        "it'll": "it will",
                        "it'll've": "it will have",
                        "it's": "it is",
                        "let's": "let us",
                        "ma'am": "madam",
                        "mayn't": "may not",
                        "might've": "might have",
                        "mightn't": "might not",
                        "mightn't've": "might not have",
                        "must've": "must have",
                        "mustn't": "must not",
                        "mustn't've": "must not have",
                        "needn't": "need not",
                        "needn't've": "need not have",
                        "o'clock": "of the clock",
                        "oughtn't": "ought not",
                        "oughtn't've": "ought not have",
                        "shan't": "shall not",
                        "sha'n't": "shall not",
                        "shan't've": "shall not have",
                        "she'd": "she would",
                        "she'd've": "she would have",
                        "she'll": "she will",
                        "she'll've": "she will have",
                        "she's": "she is",
                        "should've": "should have",
                        "shouldn't": "should not",
                        "shouldn't've": "should not have",
                        "so've": "so have",
                        "so's": "so as",
                        "that'd": "that would",
                        "that'd've": "that would have",
                        "that's": "that is",
                        "there'd": "there would",
                        "there'd've": "there would have",
                        "there's": "there is",
                        "they'd": "they would",
                        "they'd've": "they would have",
                        "they'll": "they will",
                        "they'll've": "they will have",
                        "they're": "they are",
                        "they've": "they have",
                        "to've": "to have",
                        "wasn't": "was not",
                        "we'd": "we would",
                        "we'd've": "we would have",
                        "we'll": "we will",
                        "we'll've": "we will have",
                        "we're": "we are",
                        "we've": "we have",
                        "weren't": "were not",
                        "what'll": "what will",
                        "what'll've": "what will have",
                        "what're": "what are",
                        "what's": "what is",
                        "what've": "what have",
                        "when's": "when is",
                        "when've": "when have",
                        "where'd": "where did",
                        "where's": "where is",
                        "where've": "where have",
                        "who'll": "who will",
                        "who'll've": "who will have",
                        "who's": "who is",
                        "who've": "who have",
                        "why's": "why is",
                        "why've": "why have",
                        "will've": "will have",
                        "won't": "will not",
                        "won't've": "will not have",
                        "would've": "would have",
                        "wouldn't": "would not",
                        "wouldn't've": "would not have",
                        "y'all": "you all",
                        "y'all'd": "you all would",
                        "y'all'd've": "you all would have",
                        "y'all're": "you all are",
                        "y'all've": "you all have",
                        "you'd": "you would",
                        "you'd've": "you would have",
                        "you'll": "you will",
                        "you'll've": "you will have",
                        "you're": "you are",
                        "you've": "you have",
                    }

class PipelineElement(ABC):
  """
  Abstract class for the definition of each element
  """
  def __init__(self):
    pass

  @abstractmethod
  def __call__(self):
    pass


class Pipeline():
  """
  Pipeline class which collects pipeline elements (in the order given).
  This class implements __call__ method so it is a callable.
  When called it applies all the PipelineElements in the order given.
  """
  def __init__(self, *args):
    """
    Parameters
    ----------
    *args
      PipelineElements
    """
    self.pipeline = []
    for arg in args:
      self.add_pipeline_element(arg)

  def add_pipeline_element(self, element: PipelineElement, position: int = None):
    """
    Adds a new pipeline element to the pipeline
    
    Parameters
    ----------
    element : PipelineElement
      the element to be added to the pipeline
    
    position : int
      position in the pipeline where the element should be added
      position ranges from 0 to (n_elements - 1) where n_elements
      is the number of elements in the pipeline.
    Raises
    ------
    TypeError
      If the type of element is not PipelineElement
    """
    if not issubclass(type(element), PipelineElement):
      raise TypeError("Wrong element type, only Pipeline elements subclasses can be added")
    if position:
      if position >= len(self.pipeline):
        raise ValueError("position index exceeds the lenght of the pipeline")
      self.pipeline.insert(position, element)
    else:
      self.pipeline.append(element)
  
  def pipe(self, corpus):
    """
    Applies each element in the pipeline

    Parameters
    ----------
    corpus : list
      list containing each document in the corpus
    """
    for el in self.pipeline:
        corpus = el(corpus)
    return corpus
  
  def get_elements(self):
    """
    Gives elements of the pipeline with respective index indicateing the order
    in which elements are called

    Returns
    -------
    dict
      Where the key indicates the position of each element in the pipeline
      (i.e. execution order, where 0 is the first element of the pipeline
      being called) and the value indicates the actual element.
    """
    res = {}
    for idx, el in pipeline:
      res[idx] = el
    return res
 
  def __call__(self, *args):
      if args[0] == None:
          raise ValueError("Need a corpus as argument")
      corpus = args[0]
      return self.pipe(corpus)
  
  def __len__(self):
    return len(pipeline)
        
# Flattened Elements

class UnderscoreRemoverFlat(PipelineElement):
  """
  Assumes the corpus is flat (i.e. the corpus is a list of documents,
  each document is a list of words, therefore the document is not
  divided in sentences)
  """
  def __init__(self):
    super(UnderscoreRemoverFlat, self).__init__()

  def remove_underscores(self, corpus):
    """
    Solves the problem where some of the words are surrounded by underscores
    (e.g. "_hello_")

    Parameters
    ----------
    corpus : list of list of list
      corpus to be processed
    """
    for doc in corpus:
        for idx, word in enumerate(doc):
            if "_" in word:
                cleaned_word = self._clean_word(word)
                doc[idx] = cleaned_word
    return corpus


  def _clean_word(self, word: str):
    word = word.replace("_", " ")
    # in order to remove spaces before and after the word
    word = word.split()
    word = " ".join(word)
    return word

  def __call__(self, corpus):
    return self.remove_underscores(corpus)

class CharacterRepetitionRemoverFlat(PipelineElement):
  """
  Reduces repetition to two characters 
  for alphabets and to one character for punctuations.

  Examples
  --------
  >>> reducing_character_repetitions([["Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)"]])
  Really, Great !?.;:)
  """
  def __init__(self):
    super(CharacterRepetitionRemoverFlat, self).__init__()

  def reducing_character_repetitions(self, corpus):
    """
    Parameters
    ----------
      corpus : list of list of list
    Returns
    -------
    list of list
      Formatted text with alphabets repeating to 
      two characters & punctuations limited to one repetition 

    """
    new_corpus = []
    for doc in corpus:
        new_doc = [self._clean_repetitions(w) for w in doc]
        new_corpus.append(new_doc)
    return new_corpus

  # inspired by https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a
  def _clean_repetitions(self, word):
    # Pattern matching for all case alphabets
    pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

    # Limiting all the repetitions to two characters.
    formatted_text = pattern_alpha.sub(r"\1\1", word) 

    # Pattern matching for all the punctuations that can occur
    pattern_punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')

    # Limiting punctuations in previously formatted string to only one.
    combined_formatted = pattern_punct.sub(r'\1', formatted_text)

    # The below statement is replacing repetitions of spaces that occur more than two times with that of one occurrence.
    final_formatted = re.sub(' {2,}',' ', combined_formatted)
    return final_formatted

  def __call__(self, corpus):
    return self.reducing_character_repetitions(corpus)

class ApostrophesMergerFlat(PipelineElement):
  """
  Merges words like "don't" which in the original corpus are
  separated like: ["don", "'", "t"]

  Examples
  --------
  >>> am = ApostrophesMergerFlat()
  >>> am([["I", "'", "ve", "a", "pair", "of", "shoes"]])
  [["I've", "a", "pair", "of", "shoes"]]
  """

  def __init__(self):
    super(ApostrophesMergerFlat, self).__init__()

  def merge_apostrophes(self, corpus):
    """
    Parameters
    ----------
    corpus : list of list of list

    Returns
    -------
    list of list
      Formatted text where contractions are merged into one single word
    
    """
    new_corpus = []
    for doc in corpus:
      indexes = self._get_neg_indexes(doc)
      for el in indexes:
        doc[el[0]:el[1]] = ["".join(doc[el[0]:el[1]])]
      new_corpus.append(doc)
    return new_corpus

  def _get_neg_indexes(self, sent):

    # s not considered because contraction can be either "is", genitive or "has"
    contr = ["t", "ve", "re", "ll", "d", "all", "y", "cause", "m", "clock", "am"] #, "s"]
    indexes = []
    for idx, word in enumerate(sent):
      # Try-except to avoid out of range indexes (there can be some "'" a the beginning or end of the phrase)
      try:
        if word=="'" and sent[idx+1] in contr:
          indexes.append((idx-1,idx+2))
      except:
        pass
    return indexes
  
  def __call__(self, corpus):
    return self.merge_apostrophes(corpus)


class ContractionCleanerFlat(PipelineElement):
  """
  Clean all contractions by using a predifined contraction map

  Example
  -------
  >>> cc = ContractionCleanerFlat
  """
  def __init__(self):
    super(ContractionCleanerFlat, self).__init__()

  def clean_contractions(self, corpus):
    """
    Parameters
    ----------
    corpus : list of list of list

    Returns
    -------
    list of list
      Formatted text where contractions are merged into one single word

    """
    new_corpus = []
    for doc in corpus:
      new_doc = []
      for word in doc:
        try:
            correct = CONTRACTION_MAP[word]
            correct = correct.split()
            new_doc += correct
        except:
            new_doc.append(word)
      new_corpus.append(new_doc)
    return new_corpus

  def __call__(self, corpus):
    return self.clean_contractions(corpus)

class SpecialCharsCleanerFlat(PipelineElement):
  """
  Removes all special characters which are not part of
  the folllowing regex pattern: "[^a-zA-Z0-9:€$-,%?!]+"
  """
  def __init__(self):
    super(SpecialCharsCleanerFlat, self).__init__()

  def clean_special_chars(self, corpus):
    new_corpus = [[self._clean_special_word(w) for w in doc] for doc in corpus]
    new_corpus = [[w for w in doc] for doc in corpus]
    return new_corpus
    
  def _clean_special_word(self, word):
    # The formatted text after removing not necessary punctuations.
    formatted_text = re.sub(r"[^a-zA-Z0-9:€$-,%?!]+", '', word) 
    # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
    return formatted_text
  
  def __call__(self, corpus):
    return self.clean_special_chars(corpus)

class StopWordsRemoverFlat(PipelineElement):
  """
  Removes stopwords from the document.
  It doesn't remove stopwords that contain negations
  """

  def __init__(self):
    super(StopWordsRemoverFlat, self).__init__()

  def remove_stop_words(self, corpus):
    stops = stopwords.words("english")
    stops = [word for word in stops if "'t" not in word or "not" not in word]
    return [[word for word in doc if word not in stops] for doc in corpus]

  def __call__(self, corpus):
    return self.remove_stop_words(corpus)


################################################################################

# Non-flattened elements
"""
These items are the same as before with the only difference that now
the assumption is that the corpus is not flattened, so each document
is composed by serveral separated sentences:

Example
-------

"""
class UnderscoreRemover(PipelineElement):

  def __init__(self):
    super(UnderscoreRemover, self).__init__()

  def remove_underscores(self, corpus):
    """
    Solves the problem where some of the words are surrounded by underscores
    (e.g. "_hello_")
    """
    for doc in corpus:
      for sent_idx, sent in enumerate(doc):
        new_sent = []
        for idx, word in enumerate(sent):
          if "_" in word:
            cleaned_word = self._clean_word(word)
            new_sent += cleaned_word
          else:
            new_sent.append(word)
        if len(new_sent) > 0:
          doc[sent_idx] = new_sent
    return corpus

  def _clean_word(self, word: str):
    word = word.replace("_", " ")
    # remove spaces before and after the word
    word = word.split()
    return word

  def __call__(self, corpus):
    return self.remove_underscores(corpus)



class CharacterRepetitionRemover(PipelineElement):
  def __init__(self):
    super(CharacterRepetitionRemover, self).__init__()

  def reducing_character_repetitions(self, corpus):
      new_corpus = [[[self._clean_repetitions(w) for w in sent] for sent in doc] for doc in corpus]
      return new_corpus
      # inspired by https://towardsdatascience.com/cleaning-preprocessing-text-data-by-building-nlp-pipeline-853148add68a

  def _clean_repetitions(self, word):
    """
    This Function will reduce repetition to two characters 
    for alphabets and to one character for punctuations.

    Parameters
    ----------
        word: str                
    Returns
    -------
    str
        Finally formatted text with alphabets repeating to 
        one characters & punctuations limited to one repetition 
        
    Example:
    Input : Realllllllllyyyyy,        Greeeeaaaatttt   !!!!?....;;;;:)
    Output : Really, Great !?.;:)

    """
    # Pattern matching for all case alphabets
    pattern_alpha = re.compile(r"([A-Za-z])\1{1,}", re.DOTALL)

    # Limiting all the repetitions to two characters.
    # MODIFIED: keep only one repetition of the character
    formatted_text = pattern_alpha.sub(r"\1\1", word) 

    # Pattern matching for all the punctuations that can occur
    pattern_punct = re.compile(r'([.,/#!$%^&*?;:{}=_`~()+-])\1{1,}')

    # Limiting punctuations in previously formatted string to only one.
    combined_formatted = pattern_punct.sub(r'\1', formatted_text)

    # The below statement is replacing repetitions of spaces that occur more than two times with that of one occurrence.
    final_formatted = re.sub(' {2,}',' ', combined_formatted)
    return final_formatted
  
  def __call__(self, corpus):
    return self.reducing_character_repetitions(corpus)


class ApostrophesMerger(PipelineElement):
  def __init__(self):
    super(ApostrophesMerger, self).__init__()

  def merge_apostrophes(self, corpus):
    new_corpus = []
    for doc in corpus:
      new_doc = []
      for sent in doc:
        indexes = self._get_neg_indexes(sent)
        for el in indexes:
          sent[el[0]:el[1]] = ["".join(sent[el[0]:el[1]])]
        new_doc.append(sent)
      new_corpus.append(new_doc)
    return new_corpus

  def _get_neg_indexes(self, sent):
    contr = ["t", "ve", "re", "ll", "d", "all", "y", "cause", "m", "clock", "am"]#, "s"]
    indexes = []
    for idx, word in enumerate(sent):
      # Try-except to avoid out of range indexes (there can be some "'" a the beginning or end of the phrase)
      try:
        if word=="'" and sent[idx+1] in contr:
          indexes.append((idx-1,idx+2))
      except:
        pass
      return indexes
  
  def __call__(self, corpus):
    return self.merge_apostrophes(corpus)


class ContractionCleaner(PipelineElement):

  def __init__(self):
    super(ContractionCleaner, self).__init__()

  def clean_contractions(self, corpus):
    new_corpus = []
    for doc in corpus:
      new_doc = []
      for sent in doc:
        new_sent = []
        for word in sent:
          try:
            correct = CONTRACTION_MAP[word]
            correct = correct.split()
            new_sent += correct
          except:
            new_sent.append(word)
        new_doc.append(new_sent)
      new_corpus.append(new_doc)
    return new_corpus

  def __call__(self, corpus):
    return self.clean_contractions(corpus)


class SpecialCharsCleaner(PipelineElement):
  def __init__(self):
    super(SpecialCharsCleaner, self).__init__()
  
  def clean_special_chars(self, corpus):
      for idx_doc, doc in enumerate(corpus):
        for sent_idx, sent in enumerate(doc):
          new_sent = []
          for word_idx, word in enumerate(sent):
            new_word = self._clean_special_word(word)
            if new_word != " ":
              new_sent += new_word.split()
          if len(new_sent) > 0:
            doc[sent_idx] = new_sent
      return corpus
    
  def _clean_special_word(self, word):
    # The formatted text after removing not necessary punctuations.
    formatted_text = re.sub(r"[^a-zA-Z0-9:€$-,%?!]+", ' ', word) 
    return formatted_text
  
  def __call__(self, corpus):
    return self.clean_special_chars(corpus)

class StopWordsRemover(PipelineElement):
  def __init__(self):
    super(StopWordsRemover, self).__init__()

  def remove_stop_words(self, corpus):
    stops = stopwords.words("english")
    # Don't want to remove stop words associated with negations
    stops = [word for word in stops if "'t" not in word or "not" not in word]
    return [[[word for word in sent if word not in stops] for sent in doc] for doc in corpus]

  def __call__(self, corpus):
    return self.remove_stop_words(corpus)

class ShallowObjectiveSentsRemover(PipelineElement):
  def __init__(self, threshold = .5, clf = MultinomialNB, trained = False):
    self.vectorizer = CountVectorizer()
    self.classifier = clf()
    if not trained:
      self.best_estimator = self._train()
    else:
      self.best_estimator = self.classifier
  
  def _train(self):
    subj = [sent for sent in subjectivity.sents(categories = 'subj')]
    obj = [sent for sent in subjectivity.sents(categories = 'obj')]

    corpus = [self.neg_marking_list2str(d) for d in subj] + [self.neg_marking_list2str(d) for d in obj]
    vectors = self.vectorizer.fit_transform(corpus)
    labels = np.array([1] * len(subj) + [0] * len(obj))
    scores = cross_validate(self.classifier, vectors, labels, cv=StratifiedKFold(n_splits=10) , scoring=['accuracy'], return_estimator=True)
    estimator = scores["estimator"][scores["test_accuracy"].argmax()]
    return estimator

  def neg_marking_list2str(self, sent):
    # takes the doc and produces a single list
    # negates the whole document
    negated_doc = mark_negation(sent, double_neg_flip=True)
    return " ".join([w for w in negated_doc])
    

  def remove_objective_sents(self, corpus):
    transformed_corpus = [[self.vectorizer.transform([self.neg_marking_list2str(sent)]) for sent in doc] for doc in corpus]
    res = [[corpus[doc_idx][sent_idx] for sent_idx, sent in enumerate(doc) if self.best_estimator.predict(sent).item()]
           for doc_idx, doc in enumerate(transformed_corpus)]
    return res
  
  def __call__(self, corpus):
    return self.remove_objective_sents(corpus)

class Flattener(PipelineElement):
  def __init__(self):
    super().__init__()
  
  def flatten(self, corpus):
    corpus = [[w for sent in doc for w in sent] for doc in corpus]
    return corpus

  def __call__(self, corpus):
    return self.flatten(corpus)


### Corpus class
I'm going to create a class for the representation of the corpus in order to have a self contained way to have all the information about corpus attributes (vocab, words ....)

In [5]:
from nltk.corpus import movie_reviews
import numpy as np
import torch
import spacy


class MovieReviewsCorpus():
  def __init__(self, preprocess_pipeline = None):
    # list of documents, each document is a list containing words of that document
    self.mr = movie_reviews
    self.pipeline = preprocess_pipeline
    # Corpus as list of documents. Documents as list of sentences. Sentences as list of tokens
    self.unprocessed_corpus, self.labels = self._get_corpus()
    if preprocess_pipeline == None:
      self.pipeline = Pipeline(Flattener())
      self.processed_corpus = self._preprocess()
    else:
        # Flattened and preprocessed corpus
      self.processed_corpus = self._preprocess()

    self.corpus_words = self.get_corpus_words()
    self.vocab = self._create_vocab()

  def _preprocess(self):
    return self.pipeline(self.unprocessed_corpus)

  def _get_corpus(self):
    neg = self.mr.paras(categories = "neg")
    pos = self.mr.paras(categories = "pos")
    labels = [0] * len(pos) + [1] * len(neg)
    return neg + pos, labels

  def movie_reviews_dataset_raw(self):
    """
    Returns the dataset containing:

    - A list of all the documents
    - The corresponding label for each document

    Returns
    -------
    tuple(list, list)
        The dataset: first element is the list of the document, the second element of the tuple is the associated label (positive or negative) for each document
    """

    return self.flattened_corpus, self.labels

  def get_corpus_words(self) -> list:
    """
    list of all the words in the corpus
    """
    return [w for doc in self.processed_corpus for w in doc]
  
  def get_embedding_matrix(self, embedding, embedding_dim):
    """
    Returns
    -------
    np.ndarray
        A 2D which each row has the corresponding embedding from the vocabulary
    """
    matrix_length = len(self.vocab)
    embedding_matrix = np.zeros((matrix_length, embedding_dim))
    # If I use torch.zeros directly it crashes (don't know why)
    embedding_matrix = torch.from_numpy(embedding_matrix.copy())
    null_embedding = torch.tensor([0.0]*embedding_dim)
    for idx, key in enumerate(self.vocab.keys()):
      if torch.equal(embedding[key], null_embedding):
        embedding_matrix[idx] = torch.randn(embedding_dim)
      else:
        embedding_matrix[idx] = embedding[key]
            
    return embedding_matrix
  
  def get_fasttext_embedding_matrix(self, embedding, embedding_dim):
      matrix_length = len(self.vocab)
      embedding_matrix = np.zeros((matrix_length, embedding_dim))
      # If I use torch.zeros directly it crashes (don't know why)
      embedding_matrix = torch.from_numpy(embedding_matrix.copy())
      null_embedding = torch.tensor([0.0]*embedding_dim)
      for idx, key in enumerate(self.vocab.keys()):
        tensor_embedding = torch.from_numpy(embedding[key].copy())
        if torch.equal(tensor_embedding, null_embedding):
          embedding_matrix[idx] = torch.randn(embedding_dim)
        else:
          embedding_matrix[idx] = tensor_embedding
            
      return embedding_matrix
  
  def get_indexed_corpus(self):
    """
    Returns
    -------
    Dictionary
        Containing correspondences word -> index
    
    list(list(torch.tensor))
        The corpus represented as indexes corresponding to each word
    """
    vocab = {}
    for idx, key in enumerate(self.vocab.keys()):
      vocab[key] = idx
    
    indexed_corpus = [torch.tensor([torch.tensor(vocab[w], dtype=torch.int32) for w in doc]) for doc in self.processed_corpus]
    return indexed_corpus, self.labels


  def _create_vocab(self):
    vocab = dict()
    for word in self.corpus_words:
      try:
        vocab[word] += 1
      except:
        vocab[word] = 1
    return vocab

  def __len__(self):
      return len(self.processed_corpus)


In [6]:
from torch.utils.data import Dataset
from torchtext.vocab import GloVe

class MovieReviewsDataset(Dataset):
  def __init__(self, raw_dataset):
    super(MovieReviewsDataset, self).__init__()
    self.corpus = raw_dataset[0]
    self.targets = raw_dataset[1]
    self.max_element = len(max(self.corpus, key=lambda x: len(x)))

  def __len__(self):
    return len(self.corpus)
  
  def __getitem__(self, index):
    item = self.corpus[index]
    label = self.targets[index]
    return (item, label)

### Create the model class
Let's first try with a simple BiLSTM

In [7]:
import torch.nn as nn
from torch.autograd import Variable
from torch.nn.utils.rnn import pad_packed_sequence
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
import torch

class BiLSTM(nn.Module):
    def __init__(self, embedding_matrix = None, device = "cuda", input_size = 300, hidden_size = 128, output_size = 2):
      super().__init__()
      self.hidden_size = hidden_size
      self.device = device
      if embedding_matrix != None:
        self.embedding = self.create_embedding_layer(embedding_matrix)
      else:
        self.embedding = None
      self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True, bidirectional=True)
      self.fc = nn.Sequential(nn.ReLU(),
                              nn.BatchNorm1d(hidden_size*2, eps = 1e-08),
                              nn.Dropout(0.2),
                              nn.Linear(hidden_size*2, output_size)
                              )

    def create_embedding_layer(self, embedding_matrix):
        num_embeddings, embedding_dim = embedding_matrix.shape
        emb_layer = nn.Embedding(num_embeddings, embedding_dim, -1)
        emb_layer.load_state_dict({"weight": embedding_matrix})
        return emb_layer

    # function taken from https://discuss.pytorch.org/t/how-to-use-pack-sequence-if-we-are-going-to-use-word-embedding-and-bilstm/28184/4
    def simple_elementwise_apply(self, fn, packed_sequence):
        """applies a pointwise function fn to each element in packed_sequence"""
        return torch.nn.utils.rnn.PackedSequence(fn(packed_sequence.data), packed_sequence.batch_sizes)

    def init_hidden(self, batch_size):
        if self.cuda:
            return (torch.zeros(2, batch_size, self.hidden_size).to(self.device),
                    torch.zeros(2, batch_size, self.hidden_size).to(self.device),)

    def common(self, x):
      batch_size = x.batch_sizes[0].item()
      hidden = self.init_hidden(batch_size)

      if self.embedding != None:
        x = self.simple_elementwise_apply(self.embedding, x)

      # output: batch_size, sequence_length, hidden_size * 2 (since is bilstm)
      out, _ = self.lstm(x, hidden)
      out, input_sizes = pad_packed_sequence(out, batch_first=True)

      return out, input_sizes

    def forward(self, x):
      batch_size = x.batch_sizes[0].item()
      out, input_sizes = self.common(x)
      # Interested only in the last layer
      out = out[list(range(batch_size)), input_sizes - 1, :]
      out = self.fc(out)
      out = out.squeeze()
      return out

class Residual(nn.Module):
  def __init__(self, in_size, out_size):
    super().__init__()

    self.fc1 = nn.Sequential(nn.ReLU(),
                              nn.Linear(in_size, in_size)
                              )
    self.fc2 = nn.Sequential(nn.ReLU(),
                              nn.Linear(in_size, in_size),
                              nn.ReLU()
                              )

    self.fc3 = nn.Linear(in_size, out_size)

  def forward(self, x):
    lay1 = self.fc1(x)
    lay2 = self.fc2(lay1) + x
    out = self.fc3(lay2)

    return out



class BiLSTMAttention(BiLSTM):
  # BiLSTM with attention inspired by the following paper: https://aclanthology.org/S18-1040.pdf
  def __init__(self, embedding_matrix = None, device="cuda", input_size=300,
                hidden_size=128, context_size = None, output_size=2):
    super().__init__(embedding_matrix, device, input_size, hidden_size, output_size)
    # Not self attention :)
    if context_size:
      self.attention = nn.Linear(self.hidden_size * 2, context_size)
      self.history = nn.Parameter(torch.randn(context_size))
    else:
      self.attention = nn.Linear(self.hidden_size * 2, 1)
      self.history = None
    
    self.fc = Residual(hidden_size*2, output_size)
  
  def forward(self, x):
    out, input_sizes = super().common(x)

    if self.history != None:
      attention_values = torch.tanh(self.attention(out))
      attention_weights = torch.softmax(attention_values.matmul(self.history), dim = 1).unsqueeze(1)
      # n_docs, sequence_length
    else:
      attention_values = torch.tanh(self.attention(out)).squeeze(dim = 2)
      attention_weights = torch.softmax(attention_values, dim = 1).unsqueeze(1)
      # n_docs, sequence_length

    out = torch.sum(attention_weights.matmul(out), dim = 1)

    out = self.fc(out)

    attention_weights = attention_weights.squeeze(dim = 1)
    att = [doc[:input_sizes[idx]] for idx, doc in enumerate(attention_weights)]

    return out, att

    


In [8]:
def training_step(net, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  net.train()

  for batch_idx, (inputs, targets, _) in enumerate(data_loader):

    inputs = inputs.to(device)
    targets = targets.to(device)
    in_size = targets.size(dim=0)

    outputs, _ = net(inputs)

    loss = cost_function(outputs, targets)

    loss.backward()
    torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = outputs.max(dim=1)

    cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [9]:
def test_step(net, data_loader, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  net.eval()

  with torch.no_grad():

    for batch_idx, (inputs, targets, _) in enumerate(data_loader):
      inputs = inputs.to(device)
      targets = targets.to(device)
      in_size = targets.size(dim=0)

      outputs, _ = net(inputs)
      loss = cost_function(outputs, targets)

      samples += in_size
      cumulative_loss += loss.item()
      _, predicted = outputs.max(dim=1)

      cumulative_accuracy += predicted.eq(targets).sum().item()

    return cumulative_loss/samples, (cumulative_accuracy/samples)*100


In [10]:
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.optim import RAdam
import torch.nn as nn
from torch.optim.lr_scheduler import ExponentialLR

def main(train_loader, test_loader, embedding_matrix, device = "cuda", epochs = 10):

  net = BiLSTMAttention(embedding_matrix, device = device, input_size=300).to(device)

  optimizer = Adam(net.parameters(), 0.001, betas = (0.9, 0.999), amsgrad=True)

  cost_function = nn.CrossEntropyLoss()

  # Empirical result in this scenario:
  # Even if I am using an adaptive lr, the schduler has been shown to guarantee
  # a more stable convergence (more stable results across folds in k-fold)
  scheduler = ExponentialLR(optimizer, 0.8)

  flag = False

  for e in range(epochs):
    print(f"epoch {e}:")
    train_loss, train_accuracy = training_step(net, train_loader, optimizer, cost_function, device)
    print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
    test_loss, test_accuracy = test_step(net, test_loader, cost_function, device)
    print(f"Test loss: {test_loss} \n Test accuracy: {test_accuracy}")
    print("------------------------------------------------------------------")
    if train_accuracy == 100:
      if flag:
        break
      else:
        flag = True
    scheduler.step()
  _, test_accuracy = test_step(net, test_loader, cost_function, device)

  return test_accuracy


In [11]:
from typing import List
from torch.nn.utils.rnn import pack_padded_sequence
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split

def pad(batch: List[torch.tensor], max_size: int):
  """
  Pads elements in the batch in order to have the same length,
  that is the length of the longest element in the sequence

  Parameters
  ----------
  batch : list of nn.tensor
    batch of elements. Each sequence of the batch can be either a tensor
    containing indexes (i.e. [2 423 1 ... 123] where each number correspond to
    one entry in a vocabulary)or can be a tensor containing directly the embeddings
    (i.e. [[embedding_word1], [embedding_word2], ..., [embedding_wordn]])
  
  max_size : int
    size of the longest sequence in the batch
  
  Returns
  -------
  list of torch.tensor
    Batch where all elements are padded

  """
  try:
    pad = torch.tensor([-1]*batch[0].size(dim=1), dtype = torch.float).to("cuda")
    embedded = 1
  except:
    pad = torch.tensor([-1])
    embedded = 0
  for idx in range(len(batch)):
      remaining = max_size - batch[idx].size(dim = 0)
      abc = pad.repeat(remaining)
      if embedded:
        batch[idx] = torch.cat((batch[idx], pad.repeat(remaining, 1)), dim = 0)
      else:
        batch[idx] = torch.cat((batch[idx], pad.repeat(remaining)), dim = 0)
  return batch

def batch_to_tensor(X: List[torch.tensor], max_size):
  """
  Transforms the entire batch into a tensor

  Parameters
  ----------
  X : list of torch.tensor
    already padded batch

  max_size : int
    maximum size of the sequences
  
  Returns
  -------
  torch.tensor
    Batch in tensor type
  """
  try:
    X_tensor = torch.zeros((len(X), max_size, X[0].size(dim=1)), dtype=torch.float).to("cuda")
  except:
    X_tensor = torch.zeros((len(X), max_size), dtype=torch.int32)

  for i, embed in enumerate(X):
      X_tensor[i] = embed
  return X_tensor

def sort_ds(X, Y):
  """
  Sort inputs by document lengths

  Parameters
  ----------
  X : list of torch.tensor
    The batch
  Y : list
    Labels
  
  Returns
  -------
  tuple
    batch sorted, labels sorted (in order to keep correspondances),
    document lengths sorted, indexes resulting from the argsort 
  """
  document_lengths = np.array([tens.size(dim = 0) for tens in X])
  indexes = np.argsort(document_lengths)
  document_lengths = document_lengths.tolist()

  X_sorted = [X[idx] for idx in indexes][::-1]
  Y_sorted = [Y[idx] for idx in indexes][::-1]
  document_lengths = torch.tensor([document_lengths[idx] for idx in indexes][::-1])

  return X_sorted, Y_sorted, document_lengths, indexes

def collate(batch):
  """
  collate function for batch of corpus

  Returns
  -------
  tuple
    packed sequence for the batch, tensor of labels, indexes for original
    position of the elements (used for lbsa method)
  """
  X, Y = list(zip(*batch))
  # Sort dataset
  X, Y, document_lengths, indexes = sort_ds(X, Y)

  # Get tensor sizes
  max_size = torch.max(document_lengths).item()

  # Pad tensor each element
  X = pad(X, max_size)

  # Transform the batch to a tensor
  X_tensor = batch_to_tensor(X, max_size)
  Y_tensor = torch.tensor(Y)
  # Return the padded sequence object
  X_final = pack_padded_sequence(X_tensor, document_lengths, batch_first=True)
  return X_final, Y_tensor, indexes


In [12]:
def get_data(batch_size: int, dataset, collate_fn, random_state = 42):
  """
  Performs a stratified random split of the dataset using a 80/20 ratio.

  Returns
  -------
  tuple
    training set data loader, test set data loader
  """
  train_indexes, test_indexes = train_test_split(list(range(len(dataset.targets))), test_size = 0.2,
                                                  stratify = dataset.targets, random_state = random_state)

  train_ds = Subset(dataset, train_indexes)
  test_ds = Subset(dataset, test_indexes)

  train_loader = DataLoader(train_ds, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)
  test_loader = DataLoader(test_ds, batch_size = batch_size, collate_fn = collate_fn, pin_memory=True)

  return train_loader, test_loader

In [13]:
from torch.utils.data.sampler import SubsetRandomSampler
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder

from torch.utils.data import Dataset



def main_cross_validation(main_fn, dataset, embedding_matrix, collate_fn,
                          device = "cuda", epochs = 20, random_state = 42, batch_size = 32):

  targets = np.asarray(dataset.targets, dtype=np.int64)

  skf = StratifiedKFold(10, shuffle = True, random_state=random_state)

  fold_accuracies = []
  
  for fold, (train_indexes, val_indexes) in enumerate(skf.split(np.zeros(len(dataset)),
                                                      targets)):
    print(f"\n Fold: {fold}")
    train_sampler = SubsetRandomSampler(train_indexes)
    val_sampler = SubsetRandomSampler(val_indexes)

    train_loader = DataLoader(dataset, batch_size = batch_size, sampler = train_sampler,
                              collate_fn = collate_fn, pin_memory=True)
    val_loader = DataLoader(dataset, batch_size = batch_size, sampler = val_sampler,
                            collate_fn = collate_fn, pin_memory = True)


    val_accuracy = main_fn(train_loader, val_loader, embedding_matrix, device, epochs)
    
    fold_accuracies.append(val_accuracy)


  fold_accuracies = np.array(fold_accuracies)

  return fold_accuracies.mean(), fold_accuracies.std()



## Lexicon Based Supervised Attention Model

In [14]:
class MovieReviewsCorpusLBSA():
  def __init__(self, preprocess_pipeline = None):
      """
      If non preprocess_pipeline is given, the text gets tokenized by default
      using spacy tokenizer
      """
      self.mr = movie_reviews
      self.pipeline = preprocess_pipeline
      self.raw_corpus, self.labels = self._get_raw_corpus()
      if self.pipeline == None:
          self.processed_corpus = self.raw_corpus
      else:
          self.processed_corpus = self._preprocess()
      
      self.vocab = self._create_vocab()
      

  def _get_raw_corpus(self):
      neg = self.mr.paras(categories = "neg")
      pos = self.mr.paras(categories = "pos")
      labels = [0]*len(neg) + [1]*len(pos)
      return neg + pos, labels
  
  def _preprocess(self):
      if self.pipeline != None:
          return self.pipeline(self.raw_corpus)
      else:
          return self.raw_corpus
      
  def _create_vocab(self):
      vocab = dict()
      corpus_words = [w for doc in self.processed_corpus for sent in doc for w in sent]
      for word in corpus_words:
          try:
              vocab[word] += 1
          except:
              vocab[word] = 1
      return vocab

  def get_embedding_matrix(self, embedding, embedding_dim):
      """
      Returns
      -------
      np.ndarray
          A 2D which each row has the corresponding embedding from the vocabulary
      """
      matrix_length = len(self.vocab)
      embedding_matrix = np.zeros((matrix_length, embedding_dim))
      # If I use torch.zeros directly it crashes (don't know why)
      embedding_matrix = torch.from_numpy(embedding_matrix.copy())
      null_embedding = torch.tensor([0.0]*embedding_dim)
      for idx, key in enumerate(self.vocab.keys()):
          if torch.equal(embedding[key], null_embedding):
              embedding_matrix[idx] = torch.randn(embedding_dim)
          else:
              embedding_matrix[idx] = embedding[key]
              
      return embedding_matrix
  
  def get_indexed_corpus(self):
      """
      Returns
      -------
      Dictionary
          Containing correspondences word -> index
      
      list(int)
          labels associated with each document
      """
      vocab = {}
      for idx, key in enumerate(self.vocab.keys()):
          vocab[key] = idx
      
      # each doc is a list of tensor which represent sentences, each sentence is a tensor of indexed words
      indexed_corpus = [[torch.tensor([vocab[w] for w in sent], dtype=torch.int32) 
                        for sent in doc]
                        for doc in self.processed_corpus]
      return indexed_corpus, self.labels
  
  
  
  def __len__(self):
      return len(self.processed_corpus)

c = MovieReviewsCorpusLBSA()

In [15]:
from nltk.corpus import sentiwordnet as swn
import pandas as pd
import os
import math
import json

class MovieReviewsDatasetLBSA(Dataset):
  def __init__(self, corpus):
    super(MovieReviewsDatasetLBSA, self).__init__()
    self.corpus = corpus
    indexed_corpus = self.corpus.get_indexed_corpus()
    # Word level gold attention vector
    self.word_lambda = 3
    self.sentence_lambda = 3
    self.sentiment_degree = self._compute_sentiment_degree()
    self.wl_gold_av = self._compute_gold_words()
    self.sl_gold_av = self._compute_gold_sents()
    self.data = indexed_corpus[0]
    self.targets = indexed_corpus[1]
  
  def _compute_sentiment_degree(self):
    senti_vocab = self._build_senti_vocab(self.corpus.vocab)
    path = '/content/gdrive/My Drive/nlu-project/lexicons/'
    mpqa_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'mpqa/mpqa.json')
    bingliu_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'bingliu/bingliu.json')
    inquirer_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'inquirer/inquirer.json')
    concreteness_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'concreteness/concreteness.json')
    twitter_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'twitter/twitter.json')
    qwn_vocab = self._build_0_1_vocab(self.corpus.vocab, path + 'qwn/qwn.json')
    social_sent = self._build_social_sent_vocab(self.corpus.vocab, path + 'social_sent')
    senticnet_vocab = self.build_sentic_net_vocab(self.corpus.vocab, path + "sentic_net/senticnet.txt")
    social_sent = self._build_social_sent_vocab(self.corpus.vocab, path + "social_sent")
    res = self._compute_average_sentiment_degree(senti_vocab,
                                                 mpqa_vocab,
                                                 bingliu_vocab,
                                                 inquirer_vocab,
                                                 concreteness_vocab,
                                                 twitter_vocab,
                                                 qwn_vocab,
                                                 senticnet_vocab,
                                                 social_sent
                                                 )
    
    corpus = self.corpus.processed_corpus
    scores = [[[res[word] for word in sent] for sent in doc] for doc in corpus]
    return scores

  def _compute_gold_sents(self):
    sentence_sentiment_degree  = [[sum(sent)/len(sent) for sent in doc] for doc in self.sentiment_degree]
    gold = [self._normalized_softmax(doc, self.sentence_lambda) for doc in sentence_sentiment_degree]
    return gold


  def _compute_gold_words(self):
    gold = [[self._normalized_softmax(sent_scores, self.word_lambda) for sent_scores in doc] for doc in self.sentiment_degree]
    return gold

  def _normalized_softmax(self, sequence, lam):
    multiplied_sequence = [lam * el for el in sequence]
    total = sum([math.exp(el) for el in sequence])
    res = torch.tensor([math.exp(lam * el)/total for el in sequence])
    return res

  def _build_0_1_vocab(self, vocab, path):
    """
    Taken from https://github.com/williamleif/socialsent/blob/master/socialsent/data/lexicons/mpqa.json

    Values:
    - 1 = positive
    - 0 = neutral
    - -1 = negative
    
    The absolute value will be taken
    """
    with open(path, 'r') as f:
      lexicon = json.load(f)
    
    res_vocab = {}
    for key in vocab.keys():
      res_vocab[key] = 0
    
    for key in res_vocab.keys():
      try:
        value = lexicon[key]
        res_vocab[key] = abs(value)
      except KeyError:
        pass
    
    return res_vocab


  def _build_senti_vocab(self, vocab):
    """
    builds a vocab using senti-wordnet
    """
    senti_vocab = {}
    for key in vocab.keys():
      senti_vocab[key] = 0

    max_value = 0
    for key in senti_vocab.keys():
      senses = list(swn.senti_synsets(key))
      pos = 0
      neg = 0
      for sense in senses:
        if sense.synset.name().split(".")[0] == key:
          pos += sense.pos_score()
          neg += sense.neg_score()
      if (pos != 0) or (neg != 0):
        senti_vocab[key] = max(pos, neg)
      if senti_vocab[key] > max_value:
        max_value = senti_vocab[key]

    # for key in senti_vocab.keys():
      # senti_vocab[key] = self.maprange((0, max_value), (0, 1), senti_vocab[key])

    return senti_vocab
  
  def build_sentic_net_vocab(self, vocab, path):

    df = pd.read_csv(path, sep="\t+")

    df.replace(["negative", "positive"], 1, inplace = True)
    # df.set_index(["CONCEPT"], inplace = True)

    df = dict(zip(df.CONCEPT, df.POLARITY))

    res_vocab = {}
    for key in vocab.keys():
      res_vocab[key] = 0

    for key in res_vocab.keys():
      try:
        value = df[key]
        res_vocab[key] = value
      except:
        pass

    return res_vocab
  
  def _build_social_sent_vocab(self, vocab, path):
    word_path = f"{path}/frequent_words/"
    adj_path = f"{path}/adjectives"
    word_files = [os.path.join(word_path, filename) for filename in os.listdir(word_path) if ".tsv" in filename]
    adj_files = [os.path.join(adj_path, filename) for filename in os.listdir(adj_path) if ".tsv" in filename]

    word_dfs = [pd.read_csv(f, sep = "\t", names = ["word", "mean", "std"]) for f in word_files]
    adj_dfs = [pd.read_csv(f, sep = "\t", names = ["word", "mean", "std"]) for f in adj_files]

    words = pd.read_csv(f"{word_path}/2000.tsv", sep = "\t", names = ["word", "mean", "std"])
    adjs = pd.read_csv(f"{adj_path}/2000.tsv", sep = "\t", names = ["word", "mean", "std"])
    tot = pd.concat([words, adjs])

    tot = tot.drop("std", axis = 1)
    tot["mean"] = tot["mean"].abs()
    tot.sort_values(by=["mean"], inplace = True)
    tot.drop_duplicates(subset = "word", keep="last", inplace = True)

    tot = dict(zip(tot["word"], tot["mean"]))

    res_vocab = {}
    for key in vocab.keys():
      res_vocab[key] = 0

    for key in res_vocab.keys():
      try:
        value = tot[key]
        res_vocab[key] = value
      except:
        pass

    return res_vocab
  
  def maprange(self, a, b, s):
    """
    Maps the number s from range a = [a1, a2] to range b = [b1, b2]
    """
    # Source: https://rosettacode.org/wiki/Map_range#Python
    (a1, a2), (b1, b2) = a, b
    return  b1 + ((s - a1) * (b2 - b1) / (a2 - a1))
  
  def _compute_average_sentiment_degree(self, *args):
    """
    Assumption: all arguments in args are dictionaries containing the same keys
    and a numbers as value.

    Returns
    -------
    Dict
      average of the sentiment degree across dictionaries for each word
    
    Example
    -------
    we have two dictionaries that give a sentiment degree to words:
    a = {"good": 0.9, "bad": 0.7}
    b = {"good": 0.5, "bad": 0.1}

    result = {"good": 0.7, "bad": 0.4}
    """
    n_args = len(args)
    res = {}
    for arg in args:
      for key in arg.keys():
        try:
          res[key].append(arg[key])
        except KeyError:
          res[key] = []
    for key in res.keys():
      if len(res[key]) != 0:
        res[key] = sum(res[key]) / len(res[key])
      else:
        res[key] = 0
    
    return res


  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    item = self.data[index]
    label = self.targets[index]
    gold_word = self.wl_gold_av[index]
    gold_sent = self.sl_gold_av[index]
    return (item, label, gold_word, gold_sent)

To understand:
- If it is better to introduce intermediate supervision

- If it is better to use one hot encoding for the output

- If I intepreted well the word-loss

In [16]:
class EncoderLBSA(BiLSTMAttention):
    # Lexicon Based Supervised Attention model (LBSA) inspired by the following paper: https://aclanthology.org/C18-1074.pdf
    def __init__(self, embedding_matrix, device="cuda", input_size=300,
                 hidden_size=128, context_size = 20, output_size=2):

        super(EncoderLBSA, self).__init__(embedding_matrix, device, input_size, hidden_size, context_size, output_size)
        self.fc = Residual(hidden_size*2, hidden_size)

    # TODO: Pass the part inside for to super.forward()  
    def forward(self, x):
      att = []
      res = []
      for i, doc in enumerate(x):
        doc = doc.to(self.device)
        batch_size = doc.batch_sizes[0].item()
        hidden = self.init_hidden(batch_size)

        doc = self.simple_elementwise_apply(self.embedding, doc)

        out, _ = self.lstm(doc, hidden)
        out, input_sizes = pad_packed_sequence(out, batch_first=True)
        # n_sents, n_words_per_sent, hidden_size * 2 (since is bilstm)


        attention_values = torch.tanh(self.attention(out))
        # n_sents, n_words_per_sent, context_size

        attention_weights = torch.softmax(attention_values.matmul(self.history), dim = 1).unsqueeze(1)
        # n_sents, n_words_per_sent

        out = torch.sum(attention_weights.matmul(out), dim = 1)
        # n_sents, hidden*2

        out = self.fc(out)
        
        attention_weights = attention_weights.squeeze(dim=1)

        att.append([sent[:input_sizes[idx]] for idx, sent in enumerate(attention_weights)])

        res.append(out)
      # n_doc, seq_lengths, hidden * 2
      return res, att

In [17]:
def sortLBSA(X, w_gold, s_gold):

  sentence_lengths = [np.array([sent.size(dim=0) for sent in doc]) for doc in X]
  indexes = [np.argsort(doc) for doc in sentence_lengths]
  indexes = [el.tolist() for el in indexes]

  X_sorted = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(X)]
  # w_gold = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(w_gold)]#
  # s_gold = [torch.tensor([doc[idx2] for idx2 in indexes[idx]][::-1]) for idx, doc in enumerate(s_gold)]#
  sentence_lengths = [[doc[idx2] for idx2 in indexes[idx]][::-1] for idx, doc in enumerate(sentence_lengths)]

  return X_sorted, w_gold, s_gold, sentence_lengths, indexes

def padLBSA(batch, max_sizes):
    pad = torch.tensor([-1])
    for idx1, doc in enumerate(batch):
      for idx2, sent in enumerate(doc):
        remaining = max_sizes[idx1] - sent.size(dim = 0)
        batch[idx1][idx2] = torch.cat((sent, pad.repeat(remaining)), dim = 0)
    return batch

def to_tensorLBSA(batch, max_sizes):
  res = []
  for idx, doc in enumerate(batch):
    buff = torch.zeros(len(doc), max_sizes[idx], dtype=torch.int32)
    for idx2, sent in enumerate(doc):
      buff[idx2] = sent

    res.append(buff)
  return res


def collateLBSA(batch):
  X, Y, w_gold, s_gold = list(zip(*batch))

  X, w_gold, s_gold, sentence_lengths, indexes = sortLBSA(X, w_gold, s_gold)
  # can take doc[0] since senetence_lengths is sorted
  max_sizes = [doc[0] for doc in sentence_lengths]

  # Pad tensor each element
  X = padLBSA(X, max_sizes)
  # Transform the batch to a tensor
  X = to_tensorLBSA(X, max_sizes)

  # Return the padded sequence object
  X = [pack_padded_sequence(doc, sentence_lengths[idx], batch_first=True) for idx, doc in enumerate(X)]
  return X, Y, w_gold, s_gold, indexes

In [18]:
def element_wise_log_loss(out, labels):
  res = - out.log().mul(labels).sum(dim=0)
  return res

def loss_LBSA(outputs, targets, mu_w = 0.0005, mu_s = 0.025):
  dec_output, w_att, s_att = outputs
  target, w_gold, s_gold = targets

  total_loss = 0
  ce = nn.CrossEntropyLoss()

  cross_loss = ce(dec_output, target)
  total_loss += cross_loss

  # pensare a fare uno scaling della loss se è molto diversa dalla cross entropy

  w_loss = torch.mean(torch.tensor([
    torch.sum(torch.tensor([
        element_wise_log_loss(w_att[idx1][idx2], sent) for idx2, sent in enumerate(doc)
    ])) * mu_w for idx1, doc in enumerate(w_gold)
  ]))
  total_loss += w_loss

  s_loss = torch.mean(torch.tensor([
      element_wise_log_loss(s_att[idx], doc) * mu_s for idx, doc in enumerate(s_gold)
  ]))
  total_loss += s_loss

  return total_loss

In [19]:
def training_step_LBSA(encoder, decoder, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.train()
  decoder.train()

  for batch_idx, (inputs, target, w_gold, s_gold, _) in enumerate(data_loader):
    
    in_size = len(target)

    enc_output, w_att = encoder(inputs)
    
    batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

    dec_input, target, indexes = collate(batch)
    
    s_gold = [s_gold[idx] for idx in indexes][::-1]

    target = target.to(device)
    for idx1, doc in enumerate(w_gold):
      for idx2, sent in enumerate(doc):
        w_gold[idx1][idx2] = sent.to(device)
    
    for idx, doc in enumerate(s_gold):
       s_gold[idx] = doc.to(device)
    

    dec_output, s_att = decoder(dec_input)

    outputs = (dec_output, w_att, s_att)
    targets = (target, w_gold, s_gold)

    loss = cost_function(outputs, targets)

    loss.backward()
    torch.nn.utils.clip_grad_norm_(list(encoder.parameters()) + list(decoder.parameters()), 1.0)

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = dec_output.max(dim=1)

    cumulative_accuracy += predicted.eq(target).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [20]:
def test_step_LBSA(encoder, decoder, data_loader, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.eval()
  decoder.eval()

  with torch.no_grad():

    for batch_idx, (inputs, target, w_gold, s_gold, _) in enumerate(data_loader):
      in_size = len(target)

      enc_output, w_att = encoder(inputs)
      
      batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

      dec_input, target, indexes = collate(batch)

      s_gold = [s_gold[idx] for idx in indexes][::-1]
      # Not sorting also w_gold becuse in the encoder, documents don't get shuffled

      target = target.to(device)
      for idx1, doc in enumerate(w_gold):
        for idx2, sent in enumerate(doc):
          w_gold[idx1][idx2] = sent.to(device)
    
      for idx, doc in enumerate(s_gold):
        s_gold[idx] = doc.to(device)

      dec_output, s_att = decoder(dec_input)

      outputs = (dec_output, w_att, s_att)
      targets = (target, w_gold, s_gold)

      loss = cost_function(outputs, targets)
      
      samples += in_size
      cumulative_loss += loss.item()
      _, predicted = dec_output.max(dim=1)

      cumulative_accuracy += predicted.eq(target).sum().item()

    return cumulative_loss/samples, (cumulative_accuracy/samples)*100

Sorting must be a problem, otherwise word_level attention doesn't get the correct supervision.
Same goes for sentences

In [21]:
def training_step_LBSA_new(encoder, decoder, data_loader, optimizer, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.train()
  decoder.train()

  for batch_idx, (inputs, target, w_gold, s_gold, sent_indexes) in enumerate(data_loader):
    
    in_size = len(target)

    enc_output, w_att = encoder(inputs)
    # Sorting the sentences of the encoder to their original position
    # Inverting the output because indexes are in ascending order, output is in descending
    enc_output = [torch.flip(enc_output[doc_idx], dims = [0]) for doc_idx, _ in enumerate(enc_output)]
    w_att = [w_att[doc_idx][::-1] for doc_idx, _ in enumerate(w_att)]
    # Using argsort on the indexes reverses the previous argsort
    inverted_indexes = [np.argsort(np.array(doc)) for doc in sent_indexes]
    inverted_indexes = [el.tolist() for el in inverted_indexes]
    # Sort the sentences with original sorting:
    # n_doc, n_sents, hidden*2

    enc_output = [enc_output[doc_idx][sent_idx] for doc_idx, sent_idx in enumerate(inverted_indexes)]
    w_att = [[doc[idx2] for idx2 in inverted_indexes[idx]] for idx, doc in enumerate(w_att)]

    
    batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

    dec_input, target, indexes = collate(batch)
    
    target = target.to(device)
    for idx1, doc in enumerate(w_gold):
      for idx2, sent in enumerate(doc):
        w_gold[idx1][idx2] = sent.to(device)
    
    for idx, doc in enumerate(s_gold):
       s_gold[idx] = doc.to(device)
    

    dec_output, s_att = decoder(dec_input)

    dec_output = torch.flip(dec_output, dims = [0])
    s_att = s_att[::-1]
    target = torch.flip(target, dims = [0])


    inverted_indexes = np.argsort(np.array(indexes))
    inverted_indexes = inverted_indexes.tolist()
    # Sort the sentences with original sorting:
    # n_doc, n_sents, hidden*2
    dec_output = dec_output[inverted_indexes]
    s_att = [s_att[idx] for idx in inverted_indexes]
    target = target[inverted_indexes]


    outputs = (dec_output, w_att, s_att)
    targets = (target, w_gold, s_gold)
    
    loss = cost_function(outputs, targets)

    loss.backward()
    torch.nn.utils.clip_grad_norm_(list(encoder.parameters()) + list(decoder.parameters()), 1.0)

    optimizer.step()

    optimizer.zero_grad()
    
    samples += in_size
    cumulative_loss += loss.item()
    _, predicted = dec_output.max(dim=1)

    cumulative_accuracy += predicted.eq(target).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [22]:
def test_step_LBSA_new(encoder, decoder, data_loader, cost_function, device = 'cuda'):
  cumulative_loss = 0
  cumulative_accuracy = 0
  samples = 0

  encoder.eval()
  decoder.eval()

  with torch.no_grad():

    for batch_idx, (inputs, target, w_gold, s_gold, sent_indexes) in enumerate(data_loader):
      in_size = len(target)

      enc_output, w_att = encoder(inputs)

      # First flip (go to ascending order thats because in the dataloader collate
      # orders sentences in descending order)
      enc_output = [torch.flip(enc_output[doc_idx], dims = [0]) for doc_idx, _ in enumerate(enc_output)]
      w_att = [w_att[doc_idx][::-1] for doc_idx, _ in enumerate(w_att)]

      # Second take indexes for getting the original positions
      # Using argsort on the indexes reverses the previous argsort
      inverted_indexes = [np.argsort(np.array(doc)) for doc in sent_indexes]
      inverted_indexes = [el.tolist() for el in inverted_indexes]

      # Third Sort the sentences with original sorting:
      # n_doc, n_sents, hidden*2
      enc_output = [enc_output[doc_idx][sent_idx] for doc_idx, sent_idx in enumerate(inverted_indexes)]
      w_att = [[doc[idx2] for idx2 in inverted_indexes[idx]] for idx, doc in enumerate(w_att)]
      
      # Create batch
      batch = [(el, target[idx]) for idx, el in enumerate(enc_output)]

      dec_input, target, indexes = collate(batch)

      # Not sorting also w_gold becuse in the encoder, documents don't get ordered inside collate

      # Send w_gold to device
      target = target.to(device)
      for idx1, doc in enumerate(w_gold):
        for idx2, sent in enumerate(doc):
          w_gold[idx1][idx2] = sent.to(device)
    
      # Send s_gold to device
      for idx, doc in enumerate(s_gold):
        s_gold[idx] = doc.to(device)

      dec_output, s_att = decoder(dec_input)


      dec_output = torch.flip(dec_output, dims = [0])
      s_att = s_att[::-1]
      target = torch.flip(target, dims = [0])

      inverted_indexes = np.argsort(np.array(indexes))
      inverted_indexes = inverted_indexes.tolist()
      # Sort the sentences with original sorting:
      # n_doc, n_sents, hidden*2
      dec_output = dec_output[inverted_indexes]
      s_att = [s_att[idx] for idx in inverted_indexes]
      target = target[inverted_indexes]

      # Not sorting also s_gold becuse in the encoder, documents don't get ordered inside collate

      outputs = (dec_output, w_att, s_att)
      targets = (target, w_gold, s_gold)

      loss = cost_function(outputs, targets)
      
      samples += in_size
      cumulative_loss += loss.item()
      _, predicted = dec_output.max(dim=1)

      cumulative_accuracy += predicted.eq(target).sum().item()

    return cumulative_loss/samples, (cumulative_accuracy/samples)*100

In [23]:
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.optim import RAdam
from torch.optim.lr_scheduler import ExponentialLR
import torch.nn as nn


def main_LBSA(train_loader, test_loader, embedding_matrix, device = "cuda", epochs = 10):
  encoder = EncoderLBSA(embedding_matrix = embedding_matrix, device = device, input_size=300, hidden_size=100).to(device)
  decoder = BiLSTMAttention(device = device, input_size = 100, context_size = 20).to(device)

  optimizer = Adam(list(encoder.parameters()) + list(decoder.parameters()), 0.001, betas = (0.9, 0.999), amsgrad=True)

  scheduler = ExponentialLR(optimizer, 0.8)

  cost_function = loss_LBSA

  flag = False

  for e in range(epochs):
    print(f"epoch {e}:")
    train_loss, train_accuracy = training_step_LBSA_new(encoder, decoder, train_loader, optimizer, cost_function, device)
    print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
    test_loss, test_accuracy = test_step_LBSA_new(encoder, decoder, test_loader, cost_function, device)
    print(f"Test loss: {test_loss} \n Test accuracy: {test_accuracy}")
    print("------------------------------------------------------------------")
    # Model has converged, so no need to go ahead
    if train_accuracy == 100:
      if flag == True:
        break
      else:
        flag = True
    scheduler.step()

  return test_accuracy


## Polarity Tests
### Shallow baseline

In [97]:
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from nltk.sentiment.util import mark_negation
from sklearn.decomposition import PCA
from nltk.corpus import subjectivity

In [98]:
def neg_marking_list2str(sent):
  # takes the doc and produces a single list
  # negates the whole document
  negated_doc = mark_negation(sent, double_neg_flip=True)
  return " ".join([w for w in negated_doc])

In [100]:
vectorizer = CountVectorizer()
classifier = MultinomialNB()

subj = [sent for sent in subjectivity.sents(categories = 'subj')]
obj = [sent for sent in subjectivity.sents(categories = 'obj')]

corpus = subj + obj

corpus = [neg_marking_list2str(d) for d in corpus]
vectors = vectorizer.fit_transform(corpus)
labels = numpy.array([1] * len(subj) + [0] * len(obj))
scores = cross_validate(classifier, vectors, labels, cv=StratifiedKFold(n_splits=10) , scoring=['accuracy'], return_estimator=True)
# Taking the best estimator in accuracy
pred = scores["estimator"][scores["test_accuracy"].argmax()]

In [101]:
def remove_objective_sents(vectorizer, estimator, corpus):
  transformed_corpus = [[vectorizer.transform([neg_marking_list2str(sent)]) for sent in doc] for doc in corpus]
  res = [[corpus[doc_idx][sent_idx] for sent_idx, sent in enumerate(doc) if estimator.predict(sent).item()]
          for doc_idx, doc in enumerate(transformed_corpus)]
  return res

In [102]:
def neg_marking_subj(doc):
  # takes the doc and produces a single list
  flattened_doc = [w for sent in doc for w in sent]
  # negates the whole document
  negated_doc = mark_negation(flattened_doc, double_neg_flip=True)
  return " ".join([w for w in negated_doc])

In [104]:
mr = movie_reviews
neg = mr.paras(categories = "neg")
pos = mr.paras(categories = "pos")
mr_corpus = pos + neg


mr_corpus = remove_objective_sents(vectorizer, pred, mr_corpus)
mr_corpus = [neg_marking_subj(d) for d in mr_corpus]

In [105]:
vectors = vectorizer.fit_transform(mr_corpus)
labels = numpy.array([0] * len(pos) + [1] * len(neg))

# Redefine vectorizer and classifier since already used for subjectivity
# classifier = SVC(kernel = "linear")

In [106]:
scores = cross_validate(classifier, vectors, labels, cv=StratifiedKFold(n_splits=10) , scoring=['accuracy'], return_estimator=True)
average = sum(scores['test_accuracy'])/len(scores['test_accuracy'])
print(round(average, 3))

0.863


### Deep Models (FastText Emebdding)
In this section I am going to analyze perfomances of deep models using fast-text embedding


#### BiLSTM with attention mechanism
The first model I am going to test is the BiLSTM with attention

In [24]:
from torchtext.vocab import FastText
fast_text = FastText('en', cache = "/content/gdrive/My Drive/nlu-project/Embeddings/.vector_cache")

In [25]:
mr_pipeline = Pipeline(UnderscoreRemover(),
                       CharacterRepetitionRemover(),
                       ApostrophesMerger(),
                       ContractionCleaner(),
                       SpecialCharsCleaner(),
                       Flattener()
                      )
mr_corpus = MovieReviewsCorpus(mr_pipeline)
mr_embedding_matrix = mr_corpus.get_embedding_matrix(fast_text, 300)
mr_dataset = MovieReviewsDataset(mr_corpus.get_indexed_corpus())

In [None]:
# Smaller batch sizes are noisy, this means that they are more regularizing and the
# generalization error will be lower
mean, std = main_cross_validation(main, mr_dataset, mr_embedding_matrix, collate, epochs = 10, batch_size=32)
# 82 +- 3 using 128 as bsize
# 85.6 +- 3.6 using 32 as bsize
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")

In [26]:
# -1 adds the element in the penultimate position
mr_pipeline.add_pipeline_element(ShallowObjectiveSentsRemover(), -1)
mr_corpus = MovieReviewsCorpus(mr_pipeline)
mr_embedding_matrix = mr_corpus.get_embedding_matrix(fast_text, 300)
mr_dataset = MovieReviewsDataset(mr_corpus.get_indexed_corpus())

In [27]:
mean, std = main_cross_validation(main, mr_dataset, mr_embedding_matrix, collate, epochs = 10, batch_size=32)
# 87.95 +- 1.45
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")


 Fold: 0
epoch 0:
Training loss: 0.021671149101522232 
 Training accuracy: 53.833333333333336
Test loss: 0.021594336926937102 
 Test accuracy: 66.5
------------------------------------------------------------------
epoch 1:
Training loss: 0.016298895469970175 
 Training accuracy: 76.5
Test loss: 0.016355734765529633 
 Test accuracy: 77.0
------------------------------------------------------------------
epoch 2:
Training loss: 0.009472410513295068 
 Training accuracy: 87.6111111111111
Test loss: 0.013666114658117295 
 Test accuracy: 82.0
------------------------------------------------------------------
epoch 3:
Training loss: 0.005179130551922652 
 Training accuracy: 94.38888888888889
Test loss: 0.010311165153980255 
 Test accuracy: 89.0
------------------------------------------------------------------
epoch 4:
Training loss: 0.0021231775113847106 
 Training accuracy: 98.33333333333333
Test loss: 0.012304572686553002 
 Test accuracy: 86.5
--------------------------------------------

##### LBSA Method

In [26]:
LBSA_pipeline = Pipeline(UnderscoreRemover(),
                         CharacterRepetitionRemover(),
                         ApostrophesMerger(),
                         ContractionCleaner(),
                         SpecialCharsCleaner(),
                         )
LBSA_corpus = MovieReviewsCorpusLBSA(LBSA_pipeline)
LBSA_embedding_matrix = LBSA_corpus.get_embedding_matrix(fast_text, 300)
LBSA_dataset = MovieReviewsDatasetLBSA(LBSA_corpus)

  return func(*args, **kwargs)


In [None]:
mean, std = main_cross_validation(main_LBSA, LBSA_dataset, LBSA_embedding_matrix, collateLBSA, epochs = 10, batch_size=32)
# 86.2 +- 2.21
# 86.1 +- 3
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")

In [27]:
LBSA_pipeline.add_pipeline_element(ShallowObjectiveSentsRemover())
LBSA_corpus = MovieReviewsCorpusLBSA(LBSA_pipeline)
LBSA_embedding_matrix = LBSA_corpus.get_embedding_matrix(fast_text, 300)
LBSA_dataset = MovieReviewsDatasetLBSA(LBSA_corpus)

  return func(*args, **kwargs)


In [None]:
mean, std = main_cross_validation(main_LBSA, LBSA_dataset, LBSA_embedding_matrix, collateLBSA, epochs = 15, batch_size=32)
# - mean: 88.5 
# - standard deviation: 1.9235384061671346

# Good Run:
# - mean: 89.35 
# - standard deviation: 1.285496013218244
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")

### Deep Models (Glove 840B Embedding)

In [28]:
global_vectors = GloVe(name='840B', dim=300, cache = "/content/gdrive/My Drive/nlu-project/Embeddings/.vector_cache")

In [29]:
LBSA_pipeline = Pipeline(UnderscoreRemover(),
                         CharacterRepetitionRemover(),
                         ApostrophesMerger(),
                         ContractionCleaner(),
                         SpecialCharsCleaner(),
                         )
LBSA_corpus = MovieReviewsCorpusLBSA(LBSA_pipeline)
LBSA_embedding_matrix = LBSA_corpus.get_embedding_matrix(global_vectors, 300)
LBSA_dataset = MovieReviewsDatasetLBSA(LBSA_corpus)

  return func(*args, **kwargs)


In [None]:
# 87.3, 86.75 e qualcosa - 88.15 lbsa

mean, std = main_cross_validation(main_LBSA, LBSA_dataset, LBSA_embedding_matrix, collateLBSA, epochs = 15, batch_size=32)
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")


 Fold: 0
epoch 0:
Training loss: 0.03088712328010135 
 Training accuracy: 52.888888888888886
Test loss: 0.033306020200252535 
 Test accuracy: 71.5
------------------------------------------------------------------
epoch 1:
Training loss: 0.023796848704417548 
 Training accuracy: 78.83333333333333
Test loss: 0.022833701968193055 
 Test accuracy: 84.5
------------------------------------------------------------------
epoch 2:
Training loss: 0.01700952697131369 
 Training accuracy: 89.66666666666666
Test loss: 0.0197608944773674 
 Test accuracy: 89.0
------------------------------------------------------------------
epoch 3:
Training loss: 0.012782681518130832 
 Training accuracy: 96.22222222222221
Test loss: 0.026397609114646912 
 Test accuracy: 86.0
------------------------------------------------------------------
epoch 4:
Training loss: 0.011112581921948327 
 Training accuracy: 99.27777777777777
Test loss: 0.024816671013832094 
 Test accuracy: 88.5
-----------------------------------

# Subjectivity Detection
Now I'm going to implement a subjectivity detector, in order to find objective sentences. This task allows me to remove objective sentences from the subjectivity dataset, so I am left only (almost only, since the model will not be 100% accurate) with objective sentences.

## Dataset exploration
First I am going to see how the dataset is composed:

In [25]:
from nltk.corpus import subjectivity


subj = [sent for sent in subjectivity.sents(categories = 'subj')]
obj = [sent for sent in subjectivity.sents(categories = 'obj')]

Then I will print the first 10 sentences of each class (subjective and objective) in order to see how the dataset is arranged:

In [None]:
subj[:10]

In [None]:
obj[:10]

It can be clearly seen that the daataset is composed of single phrases, instead of documents as opposed to the movie reviews dataset.
Now I am going to compare the length of data from the two classes.

In [None]:
print(len(obj))
print(len(subj))

It's easy to see that the length of both objective and subjective datasets are the same length, so accuracy can be used as a metric since the dataset is balanced

In [27]:
class SubjectivityCorpus():
  def __init__(self, preprocess_pipeline = None):
    # list of documents, each document is a list containing words of that document
    self.pipeline = preprocess_pipeline
    # Corpus as list of documents. Documents as list of sentences. Sentences as list of tokens
    self.corpus, self.labels = self._get_corpus()
    if preprocess_pipeline == None:
      self.processed_corpus = self.corpus
    else:
      self.processed_corpus = self._preprocess()

    #for optimization purposes
    self._vocab_index = 0


    self.corpus_words = self.get_corpus_words()
    self.vocab = self._create_vocab()



  def _list_to_str(self, doc) -> str:
      """
      Put all elements of the list into a single string, separating each element with a space.
      """
      return " ".join([w for sent in doc for w in sent])

  def _preprocess(self):
      return self.pipeline(self.corpus)

  def _get_corpus(self):
    subj = [sent for sent in subjectivity.sents(categories = 'subj')]
    obj = [sent for sent in subjectivity.sents(categories = 'obj')]
    labels = [1] * len(subj) + [0] * len(obj)
    return subj + obj, labels

  def subjectivity_dataset_raw(self):
      """
      Returns the dataset containing:

      - A list of all the sentences
      - The corresponding label for each sentence

      Returns
      -------
      tuple(list, list)
          The dataset: first element is the list of the sentence, the second element of the tuple is the associated label (positive or negative) for each sentence
      """

      return self.corpus, self.labels


  def get_corpus_words(self) -> list:
      return [w for sent in self.processed_corpus for w in sent]
  
  def get_embedding_matrix(self, embedding, embedding_dim):
      """
      Returns
      -------
      np.ndarray
          A 2D which each row has the corresponding embedding from the vocabulary
      """
      matrix_length = len(self.vocab)
      embedding_matrix = np.zeros((matrix_length, embedding_dim))
      # If I use torch.zeros directly it crashes (don't know why)
      embedding_matrix = torch.from_numpy(embedding_matrix.copy())
      null_embedding = torch.tensor([0.0]*embedding_dim)
      for key in self.vocab.keys():
          if torch.equal(embedding[key], null_embedding):
              embedding_matrix[self.vocab[key]] = torch.randn(embedding_dim)
          else:
              embedding_matrix[self.vocab[key]] = embedding[key]
              
      return embedding_matrix

  
  def get_fasttext_embedding_matrix(self, embedding, embedding_dim):
      matrix_length = len(self.vocab)
      embedding_matrix = np.zeros((matrix_length, embedding_dim))
      # If I use torch.zeros directly it crashes (don't know why)
      embedding_matrix = torch.from_numpy(embedding_matrix.copy())
      null_embedding = torch.tensor([0.0]*embedding_dim)
      for key in self.vocab.keys():
          tensor_embedding = torch.from_numpy(embedding[key].copy())
          if torch.equal(tensor_embedding, null_embedding):
              embedding_matrix[self.vocab[key]] = torch.randn(embedding_dim)
          else:
              embedding_matrix[self.vocab[key]] = tensor_embedding
              
      return embedding_matrix
  
  def get_indexed_corpus(self):
      """
      Returns
      -------
      Dictionary
          Containing correspondences word -> index
      
      list(list(torch.tensor))
          The corpus represented as indexes corresponding to each word
      """
      vocab = {}
      for idx, key in enumerate(self.vocab.keys()):
          vocab[key] = idx
      
      indexed_corpus = [torch.tensor([vocab[word] for word in sent]) for sent in self.processed_corpus]
      return indexed_corpus, self.labels

  def embed_vocab(self, vocab):
    for word in vocab.keys():
      try:
          self.vocab[word]
      except:
          self.vocab[word] = self._vocab_index
          self._vocab_index += 1

  def _create_vocab(self):
      vocab = dict()
      for word in self.corpus_words:
        try:
          vocab[word]
        except:
          vocab[word] = self._vocab_index
          self._vocab_index += 1
      return vocab

  def __len__(self):
      return len(self.corpus)


In [None]:
from torch.utils.data import Dataset
from torchtext.vocab import GloVe

class SubjectivityDataset(Dataset):
  def __init__(self, raw_dataset):
    super(SubjectivityDataset, self).__init__()
    self.corpus = raw_dataset[0]
    self.targets = raw_dataset[1]

  def __len__(self):
    return len(self.corpus)
  
  def __getitem__(self, index):
    item = self.corpus[index]
    label = self.targets[index]
    return (item, label)

In [None]:
pipeline_sub = pipeline = Pipeline(UnderscoreRemoverFlat(),
                                    CharacterRepetitionRemoverFlat(),
                                    ApostrophesMergerFlat(),
                                    ContractionCleanerFlat(),
                                    SpecialCharsCleanerFlat(),
                                    )
corpus_subj = SubjectivityCorpus(pipeline)
corpus_subj.embed_vocab(corpus.vocab)

In [None]:
dataset_subj = SubjectivityDataset(corpus_subj.get_indexed_corpus())
train_loader_subj, test_loader_subj = get_data(128, dataset_subj, collate)

In [None]:
embedding_matrix_subj = corpus_subj.get_embedding_matrix(global_vectors, 300)

In [None]:
_, net_obj= main(train_loader_subj, test_loader_subj, embedding_matrix_subj, device = "cuda", epochs = 10)

# print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")

epoch 0:
Training loss: 0.0013972479817457497 
 Training accuracy: 81.75
Test loss: 0.0011940113492310048 
 Test accuracy: 80.7
------------------------------------------------------------------
epoch 1:
Training loss: 0.00021139853063505143 
 Training accuracy: 79.725
Test loss: 0.0002531803525052965 
 Test accuracy: 81.2
------------------------------------------------------------------
epoch 2:


KeyboardInterrupt: ignored

## Try to use trained subjectivity in order to improve polarity

In [26]:
def collate_testing(X, device = "cuda"):
  if type(X) == list:
    X = torch.tensor(X)
  if len(X.size()) == 1:
    X = torch.unsqueeze(X, 0)
  
  X_tensor = X.to(device)
  X_final = pack_padded_sequence(X_tensor, torch.tensor([1]), batch_first=True)
  return X_final


In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from nltk.sentiment.util import mark_negation

class ObjectiveSentsRemover(PipelineElement):
  def __init__(self, net = None, vocab = None):
    """
    Parameters
    ----------
    net
      Already trained network in order to classify objective sentences
    """
    super(ObjectiveSentsRemover, self).__init__()
    if net == None:
      raise TypeError("Network cannot be None-type")
    self.net = net
    self.vocab = vocab
  
  def _indexed_corpus(self, corpus):
    indexed_corpus = []
    for doc in corpus:
      new_doc = []
      for sent in doc:
        new_sent = []
        for word in sent:
          new_sent.append(self.vocab[word])
        new_doc.append(new_sent)
      indexed_corpus.append(new_doc)
    return indexed_corpus


  def remove_objective_sents(self, corpus):
    indexed_corpus = self._indexed_corpus(corpus)
    print(len(corpus[0]))
    with torch.no_grad():
      self.net.eval()
      # I want to keep the sentence if it is subjective i.e. when result of classification = 1
      res = [[corpus[doc_idx][sent_idx] for sent_idx, sent in enumerate(doc) if self.net(collate_testing(sent))[0].max(dim = 1)[1].item() == 1]
             for doc_idx, doc in enumerate(indexed_corpus)]
      print(len(res[0]))
    return res
  
  def __call__(self, corpus):
    return self.remove_objective_sents(corpus)

class ShallowObjectiveSentsRemover(PipelineElement):
  def __init__(self, threshold = .5, clf = MultinomialNB, trained = False):
    self.vectorizer = CountVectorizer()
    self.classifier = clf()
    if not trained:
      self.best_estimator = self._train()
    else:
      self.best_estimator = self.classifier
  
  def _train(self):
    subj = [sent for sent in subjectivity.sents(categories = 'subj')]
    obj = [sent for sent in subjectivity.sents(categories = 'obj')]

    corpus = [self.neg_marking_list2str(d) for d in subj] + [self.neg_marking_list2str(d) for d in obj]
    vectors = self.vectorizer.fit_transform(corpus)
    labels = np.array([1] * len(subj) + [0] * len(obj))
    scores = cross_validate(self.classifier, vectors, labels, cv=StratifiedKFold(n_splits=10) , scoring=['accuracy'], return_estimator=True)
    estimator = scores["estimator"][scores["test_accuracy"].argmax()]
    return estimator

  def neg_marking_list2str(self, sent):
    # takes the doc and produces a single list
    # negates the whole document
    negated_doc = mark_negation(sent, double_neg_flip=True)
    return " ".join([w for w in negated_doc])
    

  def remove_objective_sents(self, corpus):
    transformed_corpus = [[self.vectorizer.transform([self.neg_marking_list2str(sent)]) for sent in doc] for doc in corpus]
    print(len(corpus[0]))
    res = [[corpus[doc_idx][sent_idx] for sent_idx, sent in enumerate(doc) if self.best_estimator.predict(sent).item()]
           for doc_idx, doc in enumerate(transformed_corpus)]
    print(len(res[0]))
    return res
  
  def __call__(self, corpus):
    return self.remove_objective_sents(corpus)


In [28]:
pipeline = Pipeline(UnderscoreRemover(),
                    CharacterRepetitionRemover(),
                    ApostrophesMerger(),
                    ContractionCleaner(),
                    SpecialCharsCleaner(),
                    ShallowObjectiveSentsRemover(),
                    )
corpus = MovieReviewsCorpusLBSA(pipeline)
# 22

35
27


In [29]:
embedding_matrix = corpus.get_embedding_matrix(global_vectors, 300)
dataset = MovieReviewsDatasetLBSA(corpus)

  return func(*args, **kwargs)


In [31]:
# 20 epochs because of the warmup
mean, std = main_cross_validation(main_LBSA, dataset, embedding_matrix, collateLBSA, epochs = 15)
## 88.65 +- 1.24 no residual neither batchnorm
## 89.15 +- 2.16             Residual with reg = False
print(f"Folds statistics:\n----------------\n - mean: {mean} \n - standard deviation: {std}")


 Fold: 0
epoch 0:
Training loss: 0.007959709366162618 
 Training accuracy: 61.05555555555555
Test loss: 0.009188937544822693 
 Test accuracy: 50.0
------------------------------------------------------------------
epoch 1:
Training loss: 0.005274999472830031 
 Training accuracy: 78.38888888888889
Test loss: 0.00855442076921463 
 Test accuracy: 56.49999999999999
------------------------------------------------------------------
epoch 2:
Training loss: 0.003910372091664208 
 Training accuracy: 89.77777777777777
Test loss: 0.0074517691135406496 
 Test accuracy: 85.5
------------------------------------------------------------------
epoch 3:
Training loss: 0.002459457086192237 
 Training accuracy: 96.38888888888889
Test loss: 0.0054387211799621586 
 Test accuracy: 84.5
------------------------------------------------------------------
epoch 4:
Training loss: 0.0018047150638368394 
 Training accuracy: 98.83333333333333
Test loss: 0.0091081303358078 
 Test accuracy: 79.0
-------------------

KeyboardInterrupt: ignored