# Building LSTM-based recurrent neural network for classifying sentiment of the image captions

https://www.oreilly.com/learning/perform-sentiment-analysis-with-lstms-using-tensorflow

GloVe word2vec word vectors will be used :

https://nlp.stanford.edu/projects/glove/

- Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip
- Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
- Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

https://github.com/adeshpande3/LSTM-Sentiment-Analysis 

https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM

Paper: Minimal Gated Unit for Recurrent Neural Networks https://arxiv.org/pdf/1603.09420v1.pdf 

https://www.quora.com/How-do-we-use-LSTM-to-do-sentiment-analysis 

It’s actually quite simple! Encode the document using the one-hot encoding, then train the LSTM to read the document character by character and optimize the hidden state after the last letter to contain the sentiment of the sentence.

Here you have an example, how to use the hidden state of LSTM to recognize picture. Use the same technique, just use sentiment labels (eg. positive, negative, neutral) instead of image labels : https://arxiv.org/pdf/1603.09420...

Note: one-hot encoding - each character is represented by a vector with size equal the alphabet size filled with 0s on most positions, and 1 at the position of the specified letter.

- **KERAS example**
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras/data 

 - **TODO: Add Python NLTK sentiment analysis **

 - **TODO: Positive/Negative word clouds **
https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis/code

- **Pytorch examples: Natural Language Inference (SNLI) with GloVe vectors, LSTMs, and torchtext ** https://github.com/pytorch/examples/tree/master/snli
- ** Deep Learning for NLP with Pytorch **  https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html


SNLI : https://nlp.stanford.edu/projects/snli/ 
https://nlp.stanford.edu/pubs/snli_paper.pdf 

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pprint
pp = pprint.PrettyPrinter(width=41, compact=True)

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# Loading Pretrained Vectors

- http://dl4nlp.info/en/latest/ NOT GOOD
 - http://pytorch-nlp-tutorial-ny2018.readthedocs.io/en/latest/
 - http://pytorch-nlp-tutorial-ny2018.readthedocs.io/en/latest/recipes/load_pretrained_vectors.html
 - git clone https://github.com/joosthub/pytorch-nlp-tutorial-ny2018.git
 - data: https://drive.google.com/file/d/0B2hg7DTHpfLsdHhEUVhHWU5hUXc/view


It can be extremely useful to make a model which had as advantageous starting point.

To do this, we can set the values of the embedding matrix.

In [None]:
# we give an example of this function in the day 1, word vector notebook
word_to_index, word_vectors, word_vector_size = load_word_vectors()


# now, we want to iterate over our vocabulary items
for word, emb_index in vectorizer.word_vocab.items():
    # if the word is in the loaded glove vectors
    if word.lower() in word_to_index:
         # get the index into the glove vectors
         glove_index = word_to_index[word.lower()]
         # get the glove vector itself and convert to pytorch structure
         glove_vec = torch.FloatTensor(word_vectors[glove_index])

         # this only matters if using cuda :)
         if settings.CUDA:
             glove_vec = glove_vec.cuda()

         # finally, if net is our network, and emb is the embedding layer:
         net.emb.weight.data[emb_index, :].set_(glove_vec)

# Formal Start

## Twitter Sentiment Analysis [1] dataset

Description: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

Download: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip 

## Inspired by : https://github.com/hpanwar08/sentiment-analysis-torchtext

## TODO
1. Add Python NLTK sentiment analysis -

2. [DONE] Positive/Negative word clouds https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis/code

## Importing GloVe vectors

https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb 

Installing torchtext: https://github.com/pytorch/text



In [2]:

# Loading word vectors

import torch
import torchtext.vocab as vocab

# sentiment = data.TabularDataset(
#     path='data/sentiment/train.json', format='json',
#     fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
#             'sentiment_gold': ('labels', data.Field(sequential=False))})

glove = vocab.GloVe(name='6B', dim=100)

print('Loaded {} words'.format(len(glove.itos)))

.vector_cache/glove.6B.zip: 862MB [02:27, 5.85MB/s]                              
100%|██████████| 400000/400000 [00:12<00:00, 31935.86it/s]


Loaded 400000 words


In [3]:
def get_word(word):
    return glove.vectors[glove.stoi[word]]

In [None]:
# source : https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8
from torchtext import data

# tokenizer function using spacy
nlp = spacy.load('en',disable=['parser', 'tagger', 'ner'])