<a href="https://colab.research.google.com/github/alanwuha/ce7455-nlp/blob/master/Logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression Sentiment Analysis

In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge and we'll actually get good results.

## 0. Environment Setup

In [28]:
!pip install torch>=1.2.0
!pip install torchtext==0.4.0
%matplotlib inline



## 1. Preparing Data

One of the main concepts of __TorchText__ is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the __raw string__ of the review and the sentiment, either __"pos" or "neg"__.

The parameters of a `Field` specify __how the data should be processed__.

We use the `TEXT` field to define __how the review should be processed__, and the `LABEL` field to process the sentiment.

Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io/) tokenizer. If no `tokenize` argument is passed, the __default is simply splitting the string on spaces__.

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.

For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility.

Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It processes the data using the `Fields` we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

In [0]:
import torch
from torchtext import data, datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [30]:
print(type(data))
print(type(train_data), '\n')

sample = train_data[0]

print(vars(train_data))
print(len(train_data.examples), '\n')

print(sample, '\n')

print(type(vars(sample)))
print(vars(sample))
print(sample.__dict__, '\n')

print(type(sample.__dict__.keys()))
print(sample.__dict__.keys(), '\n')

print(type(sample.__dict__.values()))
print(sample.__dict__.values(), '\n')

print(type(sample.__dict__.items()))
print(sample.__dict__.items(), '\n')

print('Text: ', ' '.join(sample.text))
print('Label: ', sample.label, type(sample.label))

<class 'module'>
<class 'torchtext.datasets.imdb.IMDB'> 

{'examples': [<torchtext.data.example.Example object at 0x7fd7b81e0550>, <torchtext.data.example.Example object at 0x7fd7b81e0630>, <torchtext.data.example.Example object at 0x7fd7c4a91240>, <torchtext.data.example.Example object at 0x7fd7c08eccc0>, <torchtext.data.example.Example object at 0x7fd7b4114ef0>, <torchtext.data.example.Example object at 0x7fd7b28e4668>, <torchtext.data.example.Example object at 0x7fd7b28e51d0>, <torchtext.data.example.Example object at 0x7fd7bb4cd978>, <torchtext.data.example.Example object at 0x7fd7bb4cf198>, <torchtext.data.example.Example object at 0x7fd7c3a0fdd8>, <torchtext.data.example.Example object at 0x7fd7bae12f60>, <torchtext.data.example.Example object at 0x7fd7c03a3c18>, <torchtext.data.example.Example object at 0x7fd7c0c9cda0>, <torchtext.data.example.Example object at 0x7fd7b437a668>, <torchtext.data.example.Example object at 0x7fd7c0573be0>, <torchtext.data.example.Example object at 0

We can see length of datasets and show some examples

In [31]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(vars(train_data.examples[0])) # vars(object) is equivalent to object.__dict__

Number of training examples: 25000
Number of testing examples: 25000
{'text': ['Diagnosis', 'Murder', 'has', 'been', 'shown', 'on', 'most', 'Weekday', 'afternoons', 'on', 'BBC1', 'since', 'I', 'used', 'to', 'watch', 'it', 'while', 'ill', 'from', 'School', 'a', 'good', '10', 'years', 'ago', '-', 'I', 'know', 'I', 'should', "n't", 'really', 'enjoy', 'it', ',', 'in', 'the', 'same', 'way', 'I', 'should', "n't", 'enjoy', "'", 'Murder', 'she', 'Wrote', "'", 'but', 'I', "'m", 'totally', 'addicted', 'to', 'both', 'and', 'even', 'have', 'the', 'DVD', 'box', '-', 'sets', '....', 'OK', 'I', 'know', 'that', "'s", 'sad!<br', '/><br', '/>Dick', 'Van', 'Dyke', 'carries', 'the', 'show', 'as', 'he', 'stars', 'as', 'Dr', '.', 'Mark', 'Sloan', 'a', 'Doctor', 'at', 'Community', 'General', 'Hospital', 'in', 'L.A', 'who', 'is', 'also', 'a', 'Police', 'consultant', 'for', 'the', 'L.A.P.D.', '-', 'his', 'son', 'Steve', '(', 'Barry', 'van', 'Dyke', '-', 'Dick', "'s", 'real', 'life', 'son', ')', 'is', 'a', 'Pol

Generate the validation set with a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

In [32]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED), split_ratio=0.8)
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000


Next we have to build a _vocabulary_. This is effectively a look up table where every unique work in your data in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by V.

![image1](https://doc-0s-4g-docs.googleusercontent.com/docs/securesc/ldgmc9f1rnrbpb7r2nci7mdkujir7e1k/3uop048ls2a8pqmlpku25if7nflkq6hj/1580436000000/15602990810144463660/04768977881078875371/1lrne4KntVuYW7SW-V_sP_Xk8y95vswO1?authuser=0)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways to effectively cut down our vocabulary, we can either only take the top _n_ most common words or ignore words that appear less than _m_ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

The following builds the vocabulary, only keeping the most common `max_size` tokens.

![image2](https://doc-0s-4g-docs.googleusercontent.com/docs/securesc/ldgmc9f1rnrbpb7r2nci7mdkujir7e1k/5419ncve5o5ldsrskere3sm2f8qg8i4g/1580436000000/15602990810144463660/04768977881078875371/1FybOlHRx0ayGZp5hWxuu3WuYaHMboV3I?authuser=0)

In [33]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)
print(f'Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}')
print(f'Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}')
print(TEXT.vocab.itos[:10])
print(LABEL.vocab.stoi)

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2
['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']
defaultdict(None, {'neg': 0, 'pos': 1})


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

In [57]:
print(device)

cuda


## 2. Build the Model

The next stage is building the model that we'll eventually train and evaluate. Our model us the representation of the sentence as sum of the embedding of all the tokens in the sentence.

![alt text](https://doc-0o-4g-docs.googleusercontent.com/docs/securesc/ldgmc9f1rnrbpb7r2nci7mdkujir7e1k/45027nstchj9c9ihne782iksamiio95q/1580436000000/15602990810144463660/04768977881078875371/1Pd0MDZxb_WaGkrOKQZQiitXbH2fIIAeJ?authuser=0)

In [0]:
import torch.nn as nn

class LogisticRegression(nn.Module):
  def __init__(self, vocab_size, embedding_dim):
    super().__init__()
    self.emddingbe = nn.Embedding(vocab_size, embedding_dim, padding_idx=1)
    self.fc = nn.Linear(embedding_dim, 1)

  def forward(self, text):
    # text = [sent len, batch size]
    embedded = self.embedding(text)

    # embedded = [sent len, batch size, emb dim]
    hidden = torch.sum(embedded, dim = 0)
    # hidden = [batch size, hid dim]
    output = self.fc(embedded).squeeze(0)
    # output = [batch_size]

    return self.fc(hidden.squeeze(0))

Initialize the model

In [60]:
VOCAB_SIZE = len(TEXT.vocab)
EMBEDDING_DIM = 100
model = LogisticRegression(VOCAB_SIZE, EMBEDDING_DIM)

print(model)

LogisticRegression(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)


Print the number of parameters in the model

In [61]:
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,500,301 trainable parameters


In [16]:
m = nn.Linear(5, 2)
print(type(m), '\n')
print(m, '\n')
print(m.state_dict, '\n')
print(vars(m), '\n')
print(m.parameters())

print([p.numel() for p in model.parameters if p.requires_grad])

<class 'torch.nn.modules.linear.Linear'> 

Linear(in_features=5, out_features=2, bias=True) 

<bound method Module.state_dict of Linear(in_features=5, out_features=2, bias=True)> 

{'training': True, '_parameters': OrderedDict([('weight', Parameter containing:
tensor([[-0.0282, -0.2291,  0.3909, -0.2913,  0.1076],
        [-0.2279,  0.2199, -0.3151, -0.2024,  0.3449]], requires_grad=True)), ('bias', Parameter containing:
tensor([0.2086, 0.1009], requires_grad=True))]), '_buffers': OrderedDict(), '_backward_hooks': OrderedDict(), '_forward_hooks': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_modules': OrderedDict(), 'in_features': 5, 'out_features': 2} 

<generator object Module.parameters at 0x7fed66bdffc0>


Set up the training: optimizer, criterion/loss function

In [0]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.98), eps=1e-9)

criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

Define training function

In [0]:
def binary_accuracy(preds, y):
  """Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8

  >>> binary_accuracy([-2.0], [1])
  0.0
  >>> binary_accuracy([2.0], [1])
  1.0
  >>> binary_accuracy([-6.0, 4.0], [0, 1])
  0.5
  """

  # round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(preds))
  correct = (rounded_preds == y).float()  # convert into float for division
  acc = correct.sum() / len(correct)
  return acc

## 3. Define the train and evaluation function

In [0]:
def train(model, iterator, optimizer, criterion):

  epoch_loss = 0
  epoch_acc = 0

  model.train()

  for batch in iterator:
    predictions = model(batch.text).squeeze(1)

    loss = criterion(predictions, batch.label)

    acc = binary_accuracy(predictions, batch.label)

    optimizer.zero_grad()

    loss.backward()

    optimizer.step()

    epoch_loss += loss.item()
    epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
def evaluate(model, iterator, criterion):

  epoch_loss = 0
  epoch_acc = 0

  model.eval()

  with torch.no_grad():

    for batch in iterator:

      predictions = model(batch.text).squeeze(1)

      loss = criterion(predictions, batch.label)

      acc = binary_accuracy(predictions, batch.label)

      epoch_loss += loss.item()
      epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [0]:
import time

def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60) # elapsed_time // 60
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))  # elapsed_time % 60
  return elapsed_mins, elapsed_secs

## 4. Train the network

In [70]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

  start_time = time.time()

  train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
  valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

  end_time = time.time()

  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  if valid_loss < best_valid_loss:
    best_valid_loss = valid_loss
    torch.save(model.state_dict(), 'tut1-model1.pt')

  print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins:02}m {epoch_secs:02}s')
  print(f'\tTrain Loss: {train_loss: .3f} | Train Acc: {train_acc*100:.2f}%')
  print(f'\tVal. Loss: {valid_loss: .3f} | Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 00m 03s
	Train Loss:  2.759 | Train Acc: 58.57%
	Val. Loss:  5.782 | Val. Acc: 49.35%
Epoch: 02 | Epoch Time: 00m 03s
	Train Loss:  2.654 | Train Acc: 58.79%
	Val. Loss:  1.375 | Val. Acc: 61.35%
Epoch: 03 | Epoch Time: 00m 03s
	Train Loss:  2.765 | Train Acc: 58.38%
	Val. Loss:  1.424 | Val. Acc: 61.23%
Epoch: 04 | Epoch Time: 00m 03s
	Train Loss:  2.600 | Train Acc: 58.71%
	Val. Loss:  0.982 | Val. Acc: 65.15%
Epoch: 05 | Epoch Time: 00m 03s
	Train Loss:  2.491 | Train Acc: 59.71%
	Val. Loss:  6.476 | Val. Acc: 51.66%


## 5. Test the network

In [71]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 3.494 | Test Acc: 51.63%
