# LSTM

In this notebook, we will go through basics of LSTM and Sentiment Analyser API of flair using pretrained embeddings like word2vec and flair embeddings on IMDB dataset. 

Here we will use [Flair](https://github.com/zalandoresearch/flair  "flair").


Everything is explained in-detail in [blog post](https://dudeperf3ct.github.io/lstm/gru/nlp/2019/01/28/Force-of-LSTM-and-GRU/). This is notebook which replicates the result of blog and runs in colab. Enjoy!


Let's see at [nlpprogess](http://nlpprogress.com/english/sentiment_analysis.html) what is the current state-of-the-art in sentiment analysis.


Model | Accuracy | Paper |
----- | -------- | ------|
ULMFit| 95.4     | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)|
Block-sparse LSTM| 94.99 | [GPU Kernels for Block-Sparse Weights](https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/blocksparsepaper.pdf)|
oh-LSTM | 94.1 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) |
Virtual adversarial training  | 94.1 | [Adversarial Training Methods for Semi-Supervised Text Classification](https://arxiv.org/abs/1605.07725) |
BCN+Char+CoVe | 91.8 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) |


#### Run in Colab

You can run this notebook in google colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dudeperf3ct/DL_notebooks/blob/master/lstm_and_gru/lstm_and_gru_flair.ipynb)


## Download libraries

In [0]:
! pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/44/54/76374f9a448ca765446502e7f2bb53c976e9c055102290fe6f8b0b038b37/flair-0.4.1.tar.gz (78kB)
[K    100% |████████████████████████████████| 81kB 3.0MB/s 
Collecting segtok>=1.5.7 (from flair)
  Downloading https://files.pythonhosted.org/packages/1d/59/6ed78856ab99d2da04084b59e7da797972baa0efecb71546b16d48e49d9b/segtok-1.5.7.tar.gz
Collecting mpld3>=0.3 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K    100% |████████████████████████████████| 798kB 8.1MB/s 
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting deprecated>=1.2.4 (from flair)
  Downloading https://files.pythonhosted.org/packages/9f/7a/003fa432f1e45625626549726c2fbb7a29baa764e9d1fdb2323a

## IMDB data

Code Adapted from: [link](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)

In [0]:
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from pathlib import Path
from flair.data import Sentence
import os

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


## Flair

## Data Preprocessing Step

Flair’s classification dataset format is based on the Facebook’s FastText format. The format requires one or multiple labels to be defined at the beginning of each line starting with the prefix ` __label__`. 

The format is as follows:


```
__label__<class_1> <text>
__label__<class_2> <text>
```



In [0]:
# data = pd.read_csv(csv_filename)

# data = data[['v1', 'v2']].rename(columns={"v1":"label", "v2":"text"})
 
# data['label'] = '__label__' + data['label'].astype(str)

# data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False)
# data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False)
# data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False)

#corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='test.csv', dev_file='dev.csv', train_file='train.csv')

In [0]:
# we will take a shortcut and download preprocessed dataset of imdb in required format
# We will downsample data: 70% of original --> NOT ENOUGH RAM

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB).downsample(0.7)

2019-02-23 00:21:33,175 Reading data from /root/.flair/datasets/imdb
2019-02-23 00:21:33,177 Train: /root/.flair/datasets/imdb/train.txt
2019-02-23 00:21:33,179 Dev: None
2019-02-23 00:21:33,180 Test: /root/.flair/datasets/imdb/test.txt


In [0]:
stats = corpus.obtain_statistics()
print(stats)

{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 2250,
        "number_of_documents_per_class": {
            "pos": 1130,
            "neg": 1120
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 609753,
            "min": 18,
            "max": 1269,
            "avg": 271.0013333333333
        }
    },
    "TEST": {
        "dataset": "TEST",
        "total_number_of_documents": 2500,
        "number_of_documents_per_class": {
            "pos": 1250,
            "neg": 1250
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 687743,
            "min": 23,
            "max": 1640,
            "avg": 275.0972
        }
    },
    "DEV": {
        "dataset": "DEV",
        "total_number_of_documents": 250,
        "number_of_documents_per_class": {
            "neg": 130,
            "pos": 120
        },
        "number_of_tokens_per_tag": {},
       

### Glove

In [0]:
# init embedding
glove_embedding = [WordEmbeddings('glove')]

In [0]:
document_embeddings = DocumentRNNEmbeddings(glove_embedding, hidden_size=128, rnn_layers=1,
                                            reproject_words=True, reproject_words_dimension=256,
                                            bidirectional=False, dropout=0.5, word_dropout=0.2,
                                            rnn_type='LSTM')

classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

trainer = ModelTrainer(classifier, corpus)

trainer.train('./', max_epochs=10)

In [0]:
classifier = TextClassifier.load_from_file('./best-model.pt')

In [0]:
sentence = Sentence("This film is terrible")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

In [0]:
sentence = Sentence("This film is great")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

### Custom Classifier


To train a custom text classifier we will first need a labelled dataset. Flair’s classification dataset format is based on the Facebook’s FastText format. The format requires one or multiple labels to be defined at the beginning of each line starting with the prefix ` __label__`. The format is as follows:



```
__label__<class_1> <text>
__label__<class_2> <text>
```



In [0]:
word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

In [0]:
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=512, rnn_layers=1,
                                            reproject_words=True, reproject_words_dimension=256,
                                            bidirectional=False, dropout=0.5, word_dropout=0.2,
                                            rnn_type='LSTM')

classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

trainer = ModelTrainer(classifier, corpus)

trainer.train('./', max_epochs=10)

In [0]:
classifier = TextClassifier.load_from_file('./best-model.pt')

In [0]:
sentence = Sentence("This film is terrible")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

In [0]:
sentence = Sentence("This film is great")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

### Classifier API

In [0]:
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')

2019-03-03 19:03:52,956 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/models-v0.4/TEXT-CLASSIFICATION_imdb/imdb.pt not found in cache, downloading to /tmp/tmpwckm7dng


100%|██████████| 2794252905/2794252905 [06:35<00:00, 7068826.37B/s]

2019-03-03 19:10:29,481 copying /tmp/tmpwckm7dng to cache at /root/.flair/models/imdb.pt





2019-03-03 19:10:50,402 removing temp file /tmp/tmpwckm7dng
2019-03-03 19:10:50,409 loading file /root/.flair/models/imdb.pt


  result = unpickler.load()


In [0]:
sentence = Sentence("This has to be one of the biggest misfires ever...the script was nice and could have ended a lot better.the actors should have played better and maybe then i would have given this movie a slightly better grade. maybe Hollywood should remake this movie with some little better actors and better director.sorry guys for disappointment but the movie is bad.<br /><br />If i had to re-watch it it would be like torture. I don't want to spoil everyone's opinion with mine so..my advice is watch the movie first..see if u like it and after vote(do not vote before you watch it ! ) and by the way... Have fun watching it ! Don't just peek...watch it 'till the end :))))))))) !!")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

Sentence above is:  [NEGATIVE (1.0)]


In [0]:
sentence = Sentence("Five medical students (Kevin Bacon, David Labraccio; William Baldwin, Dr. Joe Hurley; Oliver Platt, Randy Steckle; Julia Roberts, Dr. Rachel Mannus; Kiefer Sutherland, Nelson) experiment with clandestine near death & afterlife experiences, (re)searching for medical & personal enlightenment. One by one, each medical student's heart is stopped, then revived.<br /><br />Under temporary death spells each experiences bizarre visions, including forgotten childhood memories. Their flashbacks are like children's nightmares. The revived students are disturbed by remembering regretful acts they had committed or had done against them. As they experience afterlife, they bring real life experiences back into the present. As they continue to experiment, their remembrances dramatically intensify; so much so, some are physically overcome. Thus, they probe & transcend deeper into the death-afterlife experiences attempting to find a cure.<br /><br />Even though the DVD was released in 2007, this motion picture was released in 1990. Therefore, Kevin Bacon, William Baldwin, Julia Roberts & Kiefer Sutherland were in the early stages of their adult acting careers. Besides the plot being extremely intriguing, the suspense building to a dramatic climax & the script being tight & convincing, all of the young actors make \"Flatliners,\" what is now an all-star cult semi-sci-fi suspense. Who knew 17 years ago that the film careers of this young group of actors would skyrocket? I suspect that director Joel Schumacher did.")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

Sentence above is:  [POSITIVE (1.0)]


In [0]:
sentence = Sentence("A very accurate depiction of small time mob life filmed in New Jersey. The story, characters and script are believable but the acting drops the ball. Still, it's worth watching, especially for the strong images, some still with me even though I first viewed this 25 years ago.<br /><br />A young hood steps up and starts doing bigger things (tries to) but these things keep going wrong, leading the local boss to suspect that his end is being skimmed off, not a good place to be if you enjoy your health, or life.<br /><br />This is the film that introduced Joe Pesce to Martin Scorsese. Also present is that perennial screen wise guy, Frank Vincent. Strong on characterizations and visuals. Sound muddled and much of the acting is amateurish, but a great story.")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

Sentence above is:  [POSITIVE (1.0)]


In [0]:
sentence = Sentence("Afraid of the Dark left me with the impression that several different screenplays were written, all too short for a feature length film, then spliced together clumsily into this Frankenstein's monster.<br /><br />At his best, the protagonist, Lucas, is creepy. As hard as it is to draw a bead on the secondary characters, they're far more sympathetic.<br /><br />Afraid of the Dark could have achieved mediocrity had it taken just one approach and seen it through -- and had it made Lucas simply psychotic and confused instead of ghoulish and off-putting. I wanted to see him packed off into an asylum so the rest of the characters could have a normal life.")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

Sentence above is:  [NEGATIVE (1.0)]


In [0]:
sentence = Sentence("This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let's see, word to the wise, lava burns you; steam burns you. You can't stand next to lava. Diverting a minor lava flow is difficult, let alone a significant one. Scares me to think that some might actually believe what they saw in this movie.<br /><br />Even worse is the significant amount of talent that went into making this film. I mean the acting is actually very good. The effects are above average. Hard to believe somebody read the scripts for this and allowed all this talent to be wasted. I guess my suggestion would be that if this movie is about to start on TV ... look away! It is like a train wreck: it is so awful that once you know what is coming, you just have to watch. Look away and spend your time on more meaningful content.")
classifier.predict(sentence)

# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

Sentence above is:  [NEGATIVE (1.0)]
