<a href="https://colab.research.google.com/github/fazaghifari/Notebook-Collections/blob/master/NLP_and_timeseries/Simple_Sentiment_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple Sentiment Analysis with LSTM


#### Install PyTorch-NLP

In [1]:
!pip install pytorch-nlp



Import Stuffs

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import time
import numpy as np
# check if CUDA is available
train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')
    print(f'Device name = {torch.cuda.get_device_name(0)}')

CUDA is available!  Training on GPU ...
Device name = Tesla P100-PCIE-16GB


---
### Load in and visualize the data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import os
file1 = 'reviews.txt'
file2 = 'labels.txt'
folder = '/content/drive/My Drive/datasets/sentiment_class/'
path1 = os.path.join(folder,file1)
path2 = os.path.join(folder,file2)
# read data from text files
with open(path1, 'r') as f:
    reviews = f.read()
with open(path2, 'r') as f:
    labels = f.read()

In [5]:
print(reviews[:2000])
print()
print(labels[:26])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

### Data cleaning

In this step, all punctuations in the sentences are removed because in this case punctuation won't make much difference. Then, the whole text is splitted into list of sentences based on '\n'. Also, the labels is converted into 0 and 1.

In [6]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])
sentences = all_text.split('\n') # Split text to list of sentences
all_text = ' '.join(sentences)

# Convert labels to 0 and 1
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
print(len(sentences))
print(len(encoded_labels))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
25000
25000


### Tokenize Sentence

PyTorch-NLP package is used to tokenize the words. Compared to standard PyTorch text, this tools is easier to use and more straightforward.

In [0]:
embedding_dim = 100
max_length = 200
split_frac = 0.8

In [0]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    ## implement function
    
    features=np.zeros((len(reviews_ints), seq_length))
    for ii, content in enumerate(reviews_ints):
        limit = len(content) if len(content) <= seq_length else seq_length
        features[ii, :limit] = content[:seq_length]
    
    return features

In [0]:
from torchnlp.encoders.text import StaticTokenizerEncoder

tokenizer = StaticTokenizerEncoder(sentences)
sequences = [tokenizer.encode(sentence) for sentence in sentences]
padded = pad_features(sequences, max_length)
word_index = {word:ii for ii,word in enumerate(tokenizer.vocab)}
split = int(split_frac * len(sentences))

training_sequences = padded[0:split]
val_sequences = padded[split:len(sentences)]
training_labels = encoded_labels[0:split]
val_labels = encoded_labels[split:len(sentences)]

In [10]:
print(tokenizer.vocab_size)
print(word_index['i'])
print(training_sequences.shape)
print(training_labels.shape)
print(training_sequences[:30,64:74])

74077
62
(20000, 200)
(20000,)
[[  56.   14.   57.   58.   59.   60.   32.   56.   14.   61.]
 [ 137.   14.  138.  139.  140.  114.  133.   14.  141.  142.]
 [ 204.   33.  205.  206.   14.  207.  208.  170.  198.  209.]
 [ 382.  177.  383.  384.   14.  355.    7.  385.  386.  120.]
 [ 234.  674.  675.   62.   79.    8.  673.  248.  553.  467.]
 [  14.  723.   62.  131.  338.   95.  339.  216.  216.  177.]
 [ 301.   64.   33.  776.   35.  133.   28.    8.   96.   21.]
 [ 809.   48.  154.  800.  392.  193.  131.  810.  528.   64.]
 [  14.  846.   33.  252.  469.  177.   59.   28.   59.  346.]
 [ 872.  652.  873.   56.   36.  874.   64.  875.  876.  877.]
 [ 260. 1010.  216.  216.  373. 1011. 1012. 1013.   64. 1014.]
 [1062.   28. 1063.  830.  177.   14. 1064. 1065. 1066.  453.]
 [1094. 1095.  154.   64.  772. 1096.  154. 1097.   33. 1098.]
 [  13.   14.  988.   56.  505. 1129.  614.   38.   14.   15.]
 [  36. 1235. 1236.  216.  216.  154.  814.  139. 1237. 1238.]
 [ 433.  701.  107.  627

### Download Embedding Layer Weights


In [11]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d terenceliu4444/glove6b100dtxt
!unzip glove6b100dtxt.zip

kaggle.json
glove6b100dtxt.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  glove6b100dtxt.zip
replace glove.6B.100d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [0]:
embeddings_index = {};
with open('glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((tokenizer.vocab_size, embedding_dim));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;

In [13]:
print(embeddings_matrix.shape)

(74077, 100)


In [14]:
word_index['high']

6

---
### DataLoaders and Batching

In [0]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(training_sequences), torch.from_numpy(training_labels))
valid_data = TensorDataset(torch.from_numpy(val_sequences), torch.from_numpy(val_labels))

# dataloaders
train_batch_size = 32
val_batch_size = 20

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=val_batch_size)

In [16]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([32, 200])
Sample input: 
 tensor([[1.3580e+03, 1.1430e+03, 1.3100e+02,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [3.2452e+04, 7.0000e+00, 4.8300e+02,  ..., 4.6000e+02, 6.9830e+03,
         5.2800e+02],
        [3.8161e+04, 4.9440e+03, 6.4000e+01,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        ...,
        [6.2000e+01, 2.0540e+03, 1.0280e+03,  ..., 1.1590e+03, 1.0400e+02,
         5.6000e+01],
        [6.2000e+01, 2.8070e+03, 3.3000e+01,  ..., 7.0000e+02, 5.0400e+02,
         6.2000e+01],
        [3.1430e+03, 8.6400e+02, 1.0400e+02,  ..., 2.3270e+03, 2.1610e+03,
         5.0800e+02]], dtype=torch.float64)

Sample label size:  torch.Size([32])
Sample label: 
 tensor([0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0])


---
### Define Model

In [0]:
def custom_emb_layer(weights_matrix, non_trainable=False):
    num_embeddings, embedding_dim = weights_matrix.shape
    emb_layer = nn.Embedding.from_pretrained(torch.from_numpy(weights_matrix))
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, output_size, weights_matrix, hidden_dim, n_layers, drop_prob=0.5, bi=False):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.bidirect = bi
        
        # define all layers
        self.embedding, num_embeddings, embedding_dim = custom_emb_layer(weights_matrix, True)
        self.conv1d = nn.Conv1d(100, 256, 5)
        self.lstm = nn.LSTM(256, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True, bidirectional=bi)
        self.maxpool1d = nn.MaxPool1d(4)
        self.dropout1 = nn.Dropout(0.4)
        self.dropout2 = nn.Dropout(0.3)
        self.fc1 = nn.Linear(hidden_dim, 64)
        self.fc2 = nn.Linear(64, output_size)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)
        x = x.long()
        x = self.dropout1(self.embedding(x))
        x = x.transpose(1,2).float()
        x = self.maxpool1d(self.conv1d(x))
        x = x.transpose(1,2)
        x, hidden = self.lstm(x, hidden)
        x = x.contiguous().view(-1, self.hidden_dim)
        x = self.dropout2(x)
        x = F.relu(self.fc1(x))
        sig_out = self.sig(self.fc2(x))
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        if self.bidirect == True:
            c = 2
        else:
            c = 1

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers*c, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers*c, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers*c, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers*c, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

### Instantiate the network


* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3



In [18]:
# Instantiate the model w/ hyperparams
output_size = 1
hidden_dim = 256
n_layers = 2

net = SentimentRNN(output_size, embeddings_matrix, hidden_dim, n_layers, bi=False)

print(net)

SentimentRNN(
  (embedding): Embedding(74077, 100)
  (conv1d): Conv1d(100, 256, kernel_size=(5,), stride=(1,))
  (lstm): LSTM(256, 256, num_layers=2, batch_first=True, dropout=0.5)
  (maxpool1d): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (dropout1): Dropout(p=0.4, inplace=False)
  (dropout2): Dropout(p=0.3, inplace=False)
  (fc1): Linear(in_features=256, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=1, bias=True)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


---
### Training

In [19]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
# training params
epochs = 50 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
print("====================TRAINING PROCESS====================")
for e in range(epochs):
    t0 = time.time()
    # keep track of training and validation loss
    train_loss = 0.0
    valid_loss = 0.0
    accuracy = 0.0

    # initialize hidden state
    h = net.init_hidden(train_batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        # nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        train_loss += loss.item()

        
    # Get validation loss
    val_h = net.init_hidden(val_batch_size)
    net.eval()
    with torch.no_grad():
        for inputs, labels in valid_loader:

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            val_h = tuple([each.data for each in val_h])

            if(train_on_gpu):
                inputs, labels = inputs.cuda(), labels.cuda()

            output, val_h = net(inputs, val_h)
            val_loss = criterion(output.squeeze(), labels.float())

            valid_loss += val_loss.item()
            equals = torch.round(output) == labels.view(*output.shape)
            accuracy += torch.mean(equals.type(torch.FloatTensor))
        
    train_loss = train_loss/len(train_loader)
    valid_loss = valid_loss/len(valid_loader)
    valid_acc = accuracy/len(valid_loader)

    net.train()
    print("Epoch: {}/{}...".format(e+1, epochs),
            "Step: {}...".format(counter),
            "Loss: {:.6f}...".format(train_loss),
            "Val Loss: {:.6f}".format(valid_loss),
           "Val acc: {:.3f}".format(valid_acc),
           "Time elapsed: {:.1f}".format( time.time()-t0))

Epoch: 1/50... Step: 625... Loss: 0.693197... Val Loss: 0.689331 Val acc: 0.541 Time elapsed: 6.6
Epoch: 2/50... Step: 1250... Loss: 0.641000... Val Loss: 0.523590 Val acc: 0.739 Time elapsed: 6.6
Epoch: 3/50... Step: 1875... Loss: 0.531877... Val Loss: 0.472397 Val acc: 0.783 Time elapsed: 6.6
Epoch: 4/50... Step: 2500... Loss: 0.496419... Val Loss: 0.529168 Val acc: 0.781 Time elapsed: 6.6
Epoch: 5/50... Step: 3125... Loss: 0.483490... Val Loss: 0.476830 Val acc: 0.791 Time elapsed: 6.6
Epoch: 6/50... Step: 3750... Loss: 0.474553... Val Loss: 0.457054 Val acc: 0.786 Time elapsed: 6.6
Epoch: 7/50... Step: 4375... Loss: 0.459142... Val Loss: 0.466897 Val acc: 0.785 Time elapsed: 6.6
Epoch: 8/50... Step: 5000... Loss: 0.453548... Val Loss: 0.434779 Val acc: 0.802 Time elapsed: 6.6
Epoch: 9/50... Step: 5625... Loss: 0.439901... Val Loss: 0.427160 Val acc: 0.804 Time elapsed: 6.6
Epoch: 10/50... Step: 6250... Loss: 0.438985... Val Loss: 0.418977 Val acc: 0.815 Time elapsed: 6.6
Epoch: 11/

---
### Testing

In [23]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(val_batch_size)

net.eval()
# iterate over test data
for inputs, labels in valid_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(valid_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.430
Test accuracy: 0.825


### Inference on a test review



In [0]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'


In [26]:
from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([word_index[word] for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

[[14, 880, 700, 62, 468, 167, 621, 453, 3069, 64, 62, 952, 26, 340, 547, 346, 700, 223, 506, 621, 64, 14, 142, 453, 611]]


In [0]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    
    net.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print(feature_tensor.shape)
    print(output.shape)
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")
    
        

In [0]:
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'


In [30]:
# call function
# try negative and positive reviews!
seq_length=200
predict(net, test_review_neg, seq_length)

torch.Size([1, 200])
torch.Size([1])
Prediction value, pre-rounding: 0.001398
Negative review detected.
