# End text-classification model

This notebook contains the end-LSTM model for the relation extraction purpose.
This code is originally run on Google colab.
Thus, it might need some minor changes based on the use

The output of the label model are the set of probabilities among the binary choice. Those probabilities(train labels) still contain noises. To achieve a high accuracy of the model, we can utilize the tokens of the sentences to train our end extraction model.

## Import module

In [1]:
#from google.colab import drive
import pandas as pd
import numpy as np
import spacy
import snorkel
from sklearn.model_selection import train_test_split
import torch
from torchtext import data, vocab

In [297]:
#drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Text pre-processing

This text pre-processing steps require pytorch's text module called torch text.<br>
It can simply be installed using the command "pip install torchtext"<br><br>
Our inputs contain two types of text: left_text, right_text based on the location of the target entity.
We instantiate our data objects using torchtext as below.

Both left and right sentences are tokenized by word by spaCy's tokenizer <br>
For the embedding of the text, we ustilze the glove's 300-d pretrained word-vectors. It can be downloaded easily online. Or, it can be replaced by any word-vectors.



In [6]:
def text_processing():

    # Define our Torch Data objects
    TEXT = data.Field(tokenize = "spacy", batch_first = True, include_lengths = True)

    LABEL = data.Field(dtype = torch.float, sequential = False, use_vocab=False, batch_first = True)

    # Define the fields to be used later for data-processing
    fields = [(None, None), ('left_text',TEXT), ('right_text',TEXT), ('label',LABEL)]

    # we use the dataset that contains the columns (idx, left_sentence, right_sentence, supply_label)
    # we make the dataset into torchtext's Dataset object using the fields defined above
    #training_data = data.TabularDataset(path = 'drive/My Drive/Data/partitioned_sents_final.csv', format='csv', fields=fields, skip_header=True)
    training_data = data.TabularDataset(path = 'partitioned_sents_final.csv', format='csv', fields=fields, skip_header=True)

    # We split the training and valid dataset (We use 80-20 ratio)
    train_data, valid_data = training_data.split(split_ratio=0.8)

    # For each sentence, we want to embed them using the pre-trained Glove vector (300-dimension)
    #embeddings = vocab.Vectors('glove.6B.300d.txt', 'drive/My Drive/Data/')
    embeddings = vocab.Vectors('glove.6B.300d.txt', 'glove.6B/')

    # Build the vocab based on the Glove vector. Min_freq:3, Max_size=5,000
    TEXT.build_vocab(train_data, min_freq = 3, max_size = 5000, vectors = embeddings)
    LABEL.build_vocab(train_data)

    # Store the vocab size. Note that the vocab contains extra 2 words ( <UNKNOWN>, <PAD>))
    vocab_size = len(TEXT.vocab)
    
    return TEXT, LABEL, train_data, valid_data

In [8]:
TEXT, LABEL, train_data, valid_data = text_processing()

## Define the End LSTM model

For the end classification model, we will define a 2-layer Bi-directional LSTM model followed by a fully-connected layer for the final classification. Please note that we use the sigmoid function for the activation funcation considering that this is a binary-classification. 

Given the embeddings of the two sentences, we utilize torch's pack_padded_sequence method to keep them uniform in length. Please note that those sentences shorter than the max_length will be padded with pads that are not used as inputs to the LSTM layers. <br>
Also, the final hidden layers of both left and right sentences become concatenated and used for the final dense layer.

In [300]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence

class SupplyClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        # Define the Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Define the LSTM layer for each of the text using the hyparameters defined
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers = n_layers, bidirectional = bidirectional,
                           dropout = dropout, batch_first = True)

        # Define the dense layer
        # Note the input dimension : 2*2*hidden_dim because we concat left and right sentences which is Bi-drectional
        self.fc = nn.Linear(2 * 2 * hidden_dim, output_dim)

        # Define the activation function. Since it is a binary-classification, we use sigmoid
        self.act = nn.Sigmoid()

    def forward(self, left_text, right_text, left_text_lengths, right_text_lengths):
        
        # embedded vectors for the left and right text
        left_embedded = self.embedding(left_text)
        right_embedded = self.embedding(right_text)
        
        """ We use the pack_padded_sequence to make all the sentence unform in the length
            Sentences that are shorter than text_lengths are filled with <pad>
            Note that <pad> is not used as a part of inputs to the LSTM model
            By setting batch_first = True, the inputs have the shape [batch size, sentence length, embedding dim]
        """
        
        left_packed_emb = pack_padded_sequence(left_embedded, left_text_lengths, batch_first = True, enforce_sorted=False)
        right_packed_emb = pack_padded_sequence(right_embedded, right_text_lengths, batch_first = True, enforce_sorted=False)
        
        
        # we store the outputs, hidden_states and cell_states for each of the sentence
        left_out, (l_hidden, l_cell) = self.lstm(left_packed_emb)
        right_out, (r_hidden, r_cell) = self.lstm(right_packed_emb)
        
        # Since our model is Bi-LSTM, we need to concatenate the hidden states for each direction
        l_hidden = torch.cat((l_hidden[-2,:,:], l_hidden[-1,:,:]), dim = 1)
        r_hidden = torch.cat((r_hidden[-2,:,:], r_hidden[-1,:,:]), dim = 1)
        
        # Finally, we concatenate the hidden states for both left and right sentence
        hidden = torch.cat((l_hidden, r_hidden), dim = 1)
        
        # We input the concatenated hidden states into out Fully-connected layer
        fc_out = self.fc(hidden)
        
        # Then, we acquire the final results by putting the fc_out into our sigmoid activation function
        final = self.act(fc_out)
        
        return final
        

In [301]:
def binary_accuracy(preds, y):
    """
        This function computes the binary accuracy.
    """
    # round the preds to 0 or 1
    preds = torch.round(preds)
    
    # check if the preds are correct
    check = (preds == y).float()
    
    # accuracy
    acc = check.sum() / len(check)
    
    return acc

## Model hyper parameters

In [None]:
# GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
vocab_size = len(TEXT.vocab)
embedding_dim = 300
hidden_nodes = 32
output_nodes = 1
num_layers = 2
bidirectional = True
dropout = 0.2

## Instantiate  the model

Using the parameters and hyper-parameters defined above, we instantiate the model.  <br>


In [302]:
import torch.optim as optim


# instantiate the end-model
model = SupplyClassifier(vocab_size = vocab_size,
                        embedding_dim = embedding_dim,
                        hidden_dim = hidden_nodes,
                        output_dim = output_nodes,
                        n_layers = num_layers,
                        bidirectional = bidirectional,
                        dropout = dropout)

# use GPU if available
model = model.to(device)

# Store the pre-trained embeddings for each word and input into our model
gloves = TEXT.vocab.vectors
model.embedding.weight.data.copy_(gloves)

# Initialize the pretrained embedding
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

# Define the batch size
BATCH_SIZE = 64

# Define the optimizer. We use ADAM.
optimizer = optim.Adam(model.parameters())

# Define the loss function. We use BCELoss since it is a binary classification
loss_criterion = nn.BCELoss()
loss_criterion = loss_criterion.to(device)

In [303]:
def train(model, iterator, optimizer, loss_criterion):
    """
        This model trains the end model given the iterator, optimizer and loss_criterion
    """
    # set the accuracy and loss to zero every iteration
    loss_epoch = 0
    acc_epoch = 0
    
    # Initialize the training phase
    model.train()
    for batch in iterator:
        
        # zero out the gradients
        optimizer.zero_grad()
        
        # for each batch, store the text and the length of the sentences
        left_text, left_text_len = batch.left_text
        right_text, right_text_len = batch.right_text

        # flatten to 1-D
        preds = model(left_text, right_text, left_text_len, right_text_len).squeeze()

        # compute the loss of the batch
        loss = loss_criterion(preds, batch.label.squeeze()) 

        # compute the accuracy for the batch
        acc = binary_accuracy(preds, batch.label)

        # perform back-prop
        loss.backward()
        
        # update weights
        optimizer.step()
        
        # accumulate the loss and accuracy
        loss_epoch += loss.item()
        acc_epoch += acc.item()
    return loss_epoch / len(iterator), acc_epoch / len(iterator)

In [304]:
def evaluate(model, iterator, loss_criterion):
    
    # set the accuracy and loss to zero every iteration
    loss_epoch = 0
    acc_epoch = 0
    
    #Initialize the evaluation phase
    model.eval()
    
    # We don't need to record the grads for this evaluation process
    with torch.no_grad():
        
        for batch in iterator:
            
            # for each batch, store the text and the length of the sentences
            left_text, left_text_len = batch.left_text
            right_text, right_text_len = batch.right_text            
            # flatten to 1-D
            preds = model(left_text, right_text, left_text_len, right_text_len).squeeze(1)

            # compute the loss of the batch
            loss = loss_criterion(preds, batch.label)
          
            # compute the accuracy for the batch
            acc = binary_accuracy(preds, batch.label)

            # accumulate the loss and accuracy
            loss_epoch += loss.item()
            acc_epoch += acc.item()
            
    return loss_epoch / len(iterator), acc_epoch / len(iterator)

In [305]:
# Define the train and valid iterator using BucketIterator
train_iterator, valid_iterator = data.BucketIterator.splits(
        (train_data, valid_data),
        batch_size = BATCH_SIZE,
        sort_key = lambda x: len(x.left_text), # Sort the batches by text length size
        sort_within_batch = True,
        device = device)

In [306]:
N_EPOCHS = 10
best_val_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, loss_criterion)
    
    # evaluate the model
    val_loss, val_acc = evaluate(model, valid_iterator, loss_criterion)
    
    # Save the best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
        
    print(f'Epoch {epoch + 1} Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'Epoch {epoch + 1} Validation Loss: {val_loss:.3f} |  Val. Acc: {val_acc*100:.2f}%')

Epoch 1 Train Loss: 0.046 | Train Acc: 98.31%
Epoch 1 Validation Loss: 0.021 |  Val. Acc: 99.30%
Epoch 2 Train Loss: 0.016 | Train Acc: 99.52%
Epoch 2 Validation Loss: 0.015 |  Val. Acc: 99.52%
Epoch 3 Train Loss: 0.009 | Train Acc: 99.73%
Epoch 3 Validation Loss: 0.013 |  Val. Acc: 99.62%
Epoch 4 Train Loss: 0.006 | Train Acc: 99.84%
Epoch 4 Validation Loss: 0.015 |  Val. Acc: 99.60%
Epoch 5 Train Loss: 0.004 | Train Acc: 99.89%
Epoch 5 Validation Loss: 0.021 |  Val. Acc: 99.62%
Epoch 6 Train Loss: 0.003 | Train Acc: 99.92%
Epoch 6 Validation Loss: 0.025 |  Val. Acc: 99.62%
Epoch 7 Train Loss: 0.002 | Train Acc: 99.94%
Epoch 7 Validation Loss: 0.027 |  Val. Acc: 99.58%
Epoch 8 Train Loss: 0.002 | Train Acc: 99.96%
Epoch 8 Validation Loss: 0.035 |  Val. Acc: 99.59%
Epoch 9 Train Loss: 0.002 | Train Acc: 99.96%
Epoch 9 Validation Loss: 0.033 |  Val. Acc: 99.62%
Epoch 10 Train Loss: 0.001 | Train Acc: 99.96%
Epoch 10 Validation Loss: 0.038 |  Val. Acc: 99.57%
Epoch 11 Train Loss: 0.001 |

KeyboardInterrupt: ignored

In [314]:
#load weights
path='/content/saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();

import spacy
nlp = spacy.load('en')

def predict(model, df):

    result = []
    probs_result = []
    for idx, row in df.iterrows():
        left_sent = row['left_sents']
        right_sent = row['right_sents']

        left_tokenized = [tok.text for tok in nlp.tokenizer(left_sent)]
        right_tokenized = [tok.text for tok in nlp.tokenizer(right_sent)]

        left_indexed = [TEXT.vocab.stoi[t] for t in left_tokenized]
        right_indexed = [TEXT.vocab.stoi[t] for t in right_tokenized]

        l_length = [len(left_indexed)] 
        r_length = [len(right_indexed)] 

        l_tensor = torch.LongTensor(left_indexed).to(device)              #convert to tensor
        l_tensor = l_tensor.unsqueeze(1).T 

        r_tensor = torch.LongTensor(right_indexed).to(device)
        r_tensor = r_tensor.unsqueeze(1).T

        l_len = torch.LongTensor(l_length)
        r_len = torch.LongTensor(r_length)

        prediction = model(l_tensor, r_tensor, l_len, r_len)
        probs_result.append([1 - prediction.item(), prediction.item()])
        result.append(round(prediction.item()))
    return result, probs_result

## Evaluation

We load the train dataset to evaluate our trained-classifier.

In [319]:
from snorkel.analysis import metric_score
from snorkel.utils import probs_to_preds

# Load the test-set
df = pd.read_csv("drive/My Drive/Data/test_dataset.csv", index_col = 0)
df['left_sents'] = df['left_sents'].astype('str')
df['right_sents'] = df['right_sents'].astype('str')
df['supply'] = df['supply'].astype('float')


# compute the answers and probability of the test-set
ans, probs = predict(model, df)
print(
      f"Label model accuracy score: {metric_score(df['supply'], np.array(ans), metric='accuracy')}"

)
print(
    f"Label model f1 score: {metric_score(df['supply'], np.array(ans), metric='f1')}"
)
print(
    f"Label model roc-auc score: {metric_score(df['supply'], probs = np.array(probs), metric='roc_auc')}"
)

Label model accuracy score: 0.8821331521739131
Label model f1 score: 0.593200468933177
Label model roc-auc score: 0.9381681937395687


# GPU Information

In [None]:
! nvidia-smi

Sat May 16 10:39:27 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru