<a href="https://colab.research.google.com/github/fazaghifari/Notebook-Collections/blob/master/NLP_and_timeseries/Binary_Sentiment_BERT_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple Sentiment Analysis with LSTM


#### Install Transformers from HuggingFace

In [1]:
!pip install transformers



Import Stuffs

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, Dataset
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import time
import numpy as np
# check if CUDA is available
train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')
    print(f'Device name = {torch.cuda.get_device_name(0)}')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

CUDA is available!  Training on GPU ...
Device name = Tesla P100-PCIE-16GB


---
### Load in and visualize the data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import os
file1 = 'reviews.txt'
file2 = 'labels.txt'
folder = '/content/drive/My Drive/datasets/sentiment_class/'
path1 = os.path.join(folder,file1)
path2 = os.path.join(folder,file2)
# read data from text files
with open(path1, 'r') as f:
    reviews = f.read()
with open(path2, 'r') as f:
    labels = f.read()

In [5]:
print(reviews[:2000])
print()
print(labels[:26])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

### Data cleaning

In this step, all punctuations in the sentences are removed because in this case punctuation won't make much difference. Then, the whole text is splitted into list of sentences based on '\n'. Also, the labels is converted into 0 and 1.

In [6]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])
sentences = all_text.split('\n') # Split text to list of sentences
all_text = ' '.join(sentences)

# Convert labels to 0 and 1
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
print(len(sentences))
print(len(encoded_labels))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
25000
25000


### Tokenize Sentence

BERT Tokenizer is used to tokenize the sentences

In [0]:
max_length = 200
split_frac = 0.8
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'

In [0]:
split = int(split_frac * len(sentences))
training_sentences = sentences[0:split]
val_sentences = sentences[split:len(sentences)]
training_labels = encoded_labels[0:split]
val_labels = encoded_labels[split:len(sentences)]

In [0]:
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
class SentimentDataset(Dataset):
    def __init__(self, sentences, labels, tokenizer, max_len):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, item):
        sentence = str(self.sentences[item])
        label = self.labels[item]

        encoded = tokenizer.encode_plus(
        sentence,
        max_length=max_length,
        add_special_tokens=True, # Add '[CLS]' and '[SEP]'
        return_token_type_ids=False,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt',  # Return PyTorch tensors
        )

        return {
        'text': sentence,
        'input_ids': encoded['input_ids'].flatten(),
        'attention_mask': encoded['attention_mask'].flatten(),
        'targets': torch.tensor(label, dtype=torch.long)
        }

def create_loader(sentences, labels, tokenizer, max_len, batch_size):
  ds = SentimentDataset(
    sentences=sentences,
    labels=labels,
    tokenizer=tokenizer,
    max_len=max_len
  )
  return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=4
  )

train_batch_size = 32
val_batch_size = 20

train_loader = create_loader(training_sentences, training_labels, tokenizer, max_length, train_batch_size)
val_loader = create_loader(val_sentences, val_labels, tokenizer, max_length, val_batch_size)

In [10]:
# obtain one batch of training data
dataiter = iter(train_loader)
data = next(dataiter)

print('Sample input size: ', data['input_ids'].shape) # batch_size, seq_length
print('Sample input: \n', data['input_ids'])
print()
print('Sample label size: ', data['targets'].shape) # batch_size
print('Sample label: \n', data['targets'])

Sample input size:  torch.Size([32, 200])
Sample input: 
 tensor([[  101,  9304,  4165,  ...,     0,     0,     0],
        [  101,  1642,  1104,  ...,     0,     0,     0],
        [  101, 12501,  1757,  ...,  1267,  1191,   102],
        ...,
        [  101,  1142,  2523,  ...,     0,     0,     0],
        [  101,  1176,  1141,  ...,     0,     0,     0],
        [  101,   170,  5624,  ...,     0,     0,     0]])

Sample label size:  torch.Size([32])
Sample label: 
 tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 0, 1, 0, 1, 0, 1, 0])


---
### Define Model

In [0]:
class SentimentClassifier(nn.Module):
  def __init__(self, n_classes, PRE_TRAINED_MODEL_NAME):
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.fc1 = nn.Linear(self.bert.config.hidden_size, 256)
    self.fc2 = nn.Linear(256,1)
    self.sig = nn.Sigmoid()
  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(pooled_output)
    output = F.relu(self.fc1(output))
    output = self.sig(self.fc2(output))
    return output
        

### Instantiate the network


* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3



In [18]:
# Instantiate the model w/ hyperparams
output_size = 1

net = SentimentClassifier(output_size, PRE_TRAINED_MODEL_NAME)

print(net)

SentimentClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affin

---
### Training

In [19]:
# loss and optimization functions
# for param in net.bert.parameters():
#     param.requires_grad = False
lr=2e-5

criterion = nn.BCELoss().to(device)
optimizer = AdamW(net.parameters(), lr=lr, correct_bias=False)
# training params
epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
clip=5 # gradient clipping

# move model to GPU, if available
net = net.to(device)

net.train()
# train for some number of epochs
print("Training on ", device)
print("====================TRAINING PROCESS====================")
for e in range(epochs):
    t0 = time.time()
    # keep track of training and validation loss
    train_loss = 0.0
    valid_loss = 0.0
    accuracy = 0.0

    # batch loop
    for d in train_loader:
        counter += 1
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["targets"].to(device)

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output= net(input_ids, attention_mask)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), max_norm=1.0)
        optimizer.step()
        train_loss += loss.item()

        
    # Get validation loss
    net.eval()
    with torch.no_grad():
        for d in val_loader:

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            labels = d["targets"].to(device)

            output = net(input_ids, attention_mask)
            val_loss = criterion(output.squeeze(), labels.float())

            valid_loss += val_loss.item()
            equals = torch.round(output) == labels.view(*output.shape)
            accuracy += torch.mean(equals.type(torch.FloatTensor))
        
    train_loss = train_loss/len(train_loader)
    valid_loss = valid_loss/len(val_loader)
    valid_acc = accuracy/len(val_loader)

    net.train()
    print("Epoch: {}/{}...".format(e+1, epochs),
            "Step: {}...".format(counter),
            "Loss: {:.6f}...".format(train_loss),
            "Val Loss: {:.6f}".format(valid_loss),
           "Val acc: {:.3f}".format(valid_acc),
           "Time elapsed: {:.1f}".format( time.time()-t0))

Training on  cuda:0
Epoch: 1/4... Step: 625... Loss: 0.355704... Val Loss: 0.277339 Val acc: 0.890 Time elapsed: 401.0
Epoch: 2/4... Step: 1250... Loss: 0.193204... Val Loss: 0.306142 Val acc: 0.898 Time elapsed: 400.7
Epoch: 3/4... Step: 1875... Loss: 0.114813... Val Loss: 0.464821 Val acc: 0.889 Time elapsed: 400.5
Epoch: 4/4... Step: 2500... Loss: 0.081735... Val Loss: 0.508594 Val acc: 0.885 Time elapsed: 400.2


---
### Testing

In [22]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

net.eval()
# iterate over test data
with torch.no_grad():
    for d in val_loader:

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history

        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["targets"].to(device)
        
        # get predicted outputs
        output = net(input_ids, attention_mask)
        
        # calculate loss
        test_loss = criterion(output.squeeze(), labels.float())
        test_losses.append(test_loss.item())
        
        # convert output probabilities to predicted class (0 or 1)
        pred = torch.round(output.squeeze())  # rounds to the nearest integer
        
        # compare predictions to true label
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(val_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.509
Test accuracy: 0.885


### Inference on a test review



In [0]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'


In [24]:
from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    encoded = tokenizer.encode_plus(
        test_text,
        max_length=max_length,
        add_special_tokens=True, # Add '[CLS]' and '[SEP]'
        return_token_type_ids=False,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt',  # Return PyTorch tensors
        )

    return encoded['input_ids'], encoded['attention_mask']

# test code and generate tokenized review
token,mask = tokenize_review(test_review_neg)
print(token)
print(mask)

tensor([[ 101, 1103, 4997, 2523,  178, 1138, 1562, 3176, 1108, 6434, 1105,  178,
         1328, 1139, 1948, 1171, 1142, 2523, 1125, 2213, 3176, 1105, 1103, 8556,
         1108, 3345,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,  

In [0]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    
    net.eval()
    
    # tokenize review
    test_ints, mask = tokenize_review(test_review)
    
    if(train_on_gpu):
        test_ints = test_ints.cuda()
        mask = mask.cuda()
    
    # get the output from the model
    output= net(test_ints, mask)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print(test_ints.shape)
    print(output.shape)
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")
    
        

In [0]:
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'
test_review_rand = "The movie is average. Storyline is quite good, The visual effects and the acting is pretty decent."

In [41]:
# call function
# try negative and positive reviews!
seq_length=200
predict(net, test_review_rand, seq_length)

torch.Size([1, 200])
torch.Size([1, 1])
Prediction value, pre-rounding: 0.019959
Negative review detected.
