<a href="https://colab.research.google.com/github/alanwuha/ce7455-nlp/blob/master/Logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression Sentiment Analysis

In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge and we'll actually get good results.

## 0. Environment Setup

In [2]:
!pip install torch>=1.2.0
!pip install torchtext==0.4.0
%matplotlib inline

Collecting torchtext==0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/43/94/929d6bd236a4fb5c435982a7eb9730b78dcd8659acf328fd2ef9de85f483/torchtext-0.4.0-py3-none-any.whl (53kB)
[K     |██████▏                         | 10kB 16.9MB/s eta 0:00:01[K     |████████████▍                   | 20kB 6.0MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 8.5MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 5.6MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 6.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.8MB/s 
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.4.0


## 1. Preparing Data

One of the main concepts of __TorchText__ is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the __raw string__ of the review and the sentiment, either __"pos" or "neg"__.

The parameters of a `Field` specify __how the data should be processed__.

We use the `TEXT` field to define __how the review should be processed__, and the `LABEL` field to process the sentiment.

Our `TEST` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io/) tokenizer. If no `tokenizer` argument is passed, the __default is simply splitting the string on spaces__.

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.

For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility.

Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It process the data using the `Fields` we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

In [3]:
import torch
from torchtext import data, datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 11.2MB/s]


In [4]:
print(type(data))
print(type(train_data))

sample = train_data[0]

print(sample)
print(sample.__dict__)
print(sample.__dict__.keys())

print('Text: ', ' '.join(sample.text))
print('Label: ', sample.label, type(sample.label))

<class 'module'>
<class 'torchtext.datasets.imdb.IMDB'>
<torchtext.data.example.Example object at 0x7f3c1645e390>
{'text': ['"', 'A', 'Mouse', 'in', 'the', 'House', '"', 'is', 'a', 'very', 'classic', 'cartoon', 'by', 'Tom', '&', 'Jerry', ',', 'faithful', 'to', 'their', 'tradition', 'but', 'with', 'jokes', 'of', 'its', 'own', '.', 'It', 'is', 'hysterical', ',', 'hilarious', ',', 'very', 'entertaining', 'and', 'quite', 'amusing', '.', 'Artwork', 'is', 'of', 'good', 'quality', 'either.<br', '/><br', '/>This', 'short', 'is', "n't", 'just', 'about', 'Tom', 'trying', 'to', 'catch', 'Jerry', '.', 'Butch', 'lives', 'in', 'the', 'same', 'house', 'and', 'he', "'s", 'trying', 'to', 'catch', 'the', 'mouse', 'too', ',', 'because', '«', 'there', "'s", 'only', 'going', 'to', 'be', 'one', 'cat', 'in', 'this', 'house', 'in', 'the', 'morning', '--', 'and', 'that', "'s", 'the', 'cat', 'that', 'catches', 'the', 'mouse».<br', '/><br', '/>If', 'you', 'ask', 'me', ',', 'there', 'are', 'lots', 'of', 'funny', 

We can see length of datasets and show some examples

In [6]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(vars(train_data.examples[0])) # vars(object) is equivalent to object.__dict__

Number of training examples: 25000
Number of testing examples: 25000
{'text': ['"', 'A', 'Mouse', 'in', 'the', 'House', '"', 'is', 'a', 'very', 'classic', 'cartoon', 'by', 'Tom', '&', 'Jerry', ',', 'faithful', 'to', 'their', 'tradition', 'but', 'with', 'jokes', 'of', 'its', 'own', '.', 'It', 'is', 'hysterical', ',', 'hilarious', ',', 'very', 'entertaining', 'and', 'quite', 'amusing', '.', 'Artwork', 'is', 'of', 'good', 'quality', 'either.<br', '/><br', '/>This', 'short', 'is', "n't", 'just', 'about', 'Tom', 'trying', 'to', 'catch', 'Jerry', '.', 'Butch', 'lives', 'in', 'the', 'same', 'house', 'and', 'he', "'s", 'trying', 'to', 'catch', 'the', 'mouse', 'too', ',', 'because', '«', 'there', "'s", 'only', 'going', 'to', 'be', 'one', 'cat', 'in', 'this', 'house', 'in', 'the', 'morning', '--', 'and', 'that', "'s", 'the', 'cat', 'that', 'catches', 'the', 'mouse».<br', '/><br', '/>If', 'you', 'ask', 'me', ',', 'there', 'are', 'lots', 'of', 'funny', 'gags', 'in', 'this', 'cartoon', '.', 'The', 

Generate the validation set with a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

In [7]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED), split_ratio = 0.8)
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000


Next we have to build a _vocabulary_. This is effectively a look up table where every unique work in your data in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by V.

![alt text](https://doc-0s-4g-docs.googleusercontent.com/docs/securesc/ldgmc9f1rnrbpb7r2nci7mdkujir7e1k/sttjcf30i03i1ummpenuql67vsscr25u/1580385600000/15602990810144463660/04768977881078875371/1lrne4KntVuYW7SW-V_sP_Xk8y95vswO1?authuser=0)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways to effectively cut down our vocabulary, we can either only take the top _n_ most common words or ignore words that appear less than _m_ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

The following builds the vocabulary, only keeping the most common `max_size` tokens.

![alt text](https://doc-0s-4g-docs.googleusercontent.com/docs/securesc/ldgmc9f1rnrbpb7r2nci7mdkujir7e1k/thivloa32v6eegel2v5rksgm9fh6253a/1580385600000/15602990810144463660/04768977881078875371/1FybOlHRx0ayGZp5hWxuu3WuYaHMboV3I?authuser=0)

In [8]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

print(f'Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}')
print(f'Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}')
print(TEXT.vocab.itos[:10])
print(LABEL.vocab.stoi)

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2
['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']
defaultdict(None, {'neg': 0, 'pos': 1})


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

In [11]:
print(device)

cuda
