<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/Question_Tagging_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Objective

The objective of notebook is to build a model to automatically predict tags for a given a StackExchange question by using the text of the question in PyTorch.
![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

__Dataset Specs__: Over 85,000 questions and over 1300 unique tags

[Download Link](https://www.kaggle.com/stackoverflow/statsquestions#Questions.csv)


# Steps To Follow


1. Load Data and Import Libraries

2. Dataset Preparation

      2.1 Loading the Data

      2.2 Merge Tags with Questions

      2.3 Filter Questions with respect to Top-10 Tags
      
3. Text Preprocessing

      3.1 Text Representation

4. Model Building

      4.1 Model Architecture

5. Model Training and Model Evaluation

6. Model Building for LSTM and Model Evaluation for LSTM

#1. Importing Libraries

In [1]:
#string matching
import re

#reading files
import pandas as pd
#array processing
import numpy as np

#handling html data
from bs4 import BeautifulSoup

#visualization
import matplotlib.pyplot as plt

#for metrics
from sklearn import metrics

#for seed
import random

# to one hot encode labels
from sklearn.preprocessing import MultiLabelBinarizer

#defining tensors
import torch

#layers
from torch import nn

#layers and wrappers
from torch.nn import Sequential, Linear,  ReLU, Sigmoid, Dropout, BCELoss, Embedding, RNN, LSTM

#handling text data
from torchtext import data



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!unzip '/content/drive/MyDrive/stats_questions.zip'

Archive:  /content/drive/MyDrive/stats_questions.zip
  inflating: Answers.csv             
  inflating: Questions.csv           
  inflating: Tags.csv                
  inflating: database.sqlite         


#2. Data Preparation

### 2.1 Loading the data

In [6]:
# load the stackoverflow questions dataset
questions_df = pd.read_csv('/content/Questions.csv',encoding='latin-1')
# load the tags dataset
tags_df = pd.read_csv('/content/Tags.csv')

In [7]:
# Display the first five rows of the dataset
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demog...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,<p>How would you describe in plain English the...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,<p>After taking a statistics course and then t...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,"<p>There is an old saying: ""Correlation does n..."


In [8]:
# Display the first five rows
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [9]:
# No. of unique tags
len(tags_df['Tag'].unique())

1315

### 2.2 Merge Tags with Questions

In [10]:
# remove "-" from the tags
tags_df['Tag'] = tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))

In [11]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [12]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [13]:
# fetch required columns
df = df[['Id','Body','tags']]

In [14]:
#first 5 rows
df.head()

Unnamed: 0,Id,Body,tags
0,6,"<p>Last year, I read a blog post from <a href=...",[machine learning]
1,21,<p>What are some of the ways to forecast demog...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the...,"[bayesian, frequentist]"
3,31,<p>After taking a statistics course and then t...,"[hypothesis testing, t test, p value, interpre..."
4,36,"<p>There is an old saying: ""Correlation does n...","[correlation, teaching]"


In [15]:
#shape of the dataset
df.shape

(85085, 3)

### 2.3 Filter Questions with respect to Top-10 Tags


In [16]:
# check occurence of each tag
freq={}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

In [17]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [18]:
# Top 10 most frequent tags
common_tags = list(freq.keys())[:10]
print(common_tags)

['r', 'regression', 'machine learning', 'time series', 'probability', 'hypothesis testing', 'self study', 'distributions', 'logistic', 'classification']


In [19]:
#finding queries associated with common tags
x=[]
y=[]

for i in range(len(df['tags'])):

  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)

  #if common tags are more than 1
  if(len(temp)>1):
    x.append(df['Body'][i])
    y.append(temp)

In [20]:
# number of questions left
len(x)

11106

In [21]:
#first 5 tags
y[:5]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series']]

In [22]:
#combining the labels by space
y = [ ",".join([str(j) for j in i ]) for i in y]

In [23]:
#labels after converting to string
y[:5]

['r,time series',
 'regression,distributions',
 'distributions,probability,hypothesis testing',
 'hypothesis testing,self study',
 'r,regression,time series']

In [24]:
#save to dataframe
dframe = pd.DataFrame({'query':x,'tags':y})

In [25]:
#first 5 rows
dframe.head()

Unnamed: 0,query,tags
0,<p>I recently started working for a tuberculos...,"r,time series"
1,<p>Am I looking for a better behaved distribut...,"regression,distributions"
2,<p>There are many ways to measure how similar ...,"distributions,probability,hypothesis testing"
3,<blockquote>\n <p>A Lab has been asked to eva...,"hypothesis testing,self study"
4,<p>How would we measure the predictive power o...,"r,regression,time series"


In [26]:
#save to csv
dframe.to_csv('stack.csv',index=False)

In [27]:
dframe.shape

(11106, 2)

# 3.Preprocessing

### 3.1 Text Representation

In [28]:
def cleaner(text):

  text = BeautifulSoup(text).get_text()

  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()

  return text

In [29]:
dframe['query'] = dframe['query'].apply(lambda x: cleaner(x))

In [30]:
from sklearn.model_selection import train_test_split
dframe_train, dframe_test = train_test_split(dframe, test_size = 0.2, random_state = 42)

In [31]:
dframe_train.shape, dframe_test.shape

((8884, 2), (2222, 2))

In [33]:
dframe_test.head()

Unnamed: 0,query,tags
8772,suppose you have data in the following format ...,"regression,self study"
9847,i have a question on how a statistician would ...,"r,self study"
3265,assume that there are n realisations of five...,"machine learning,probability"
2319,first let me start off by saying i know the co...,"r,regression"
9298,i create time series model via model sarima...,"r,time series"


In [34]:
#preparing the output labels
train_tags_list=[i.split(",") for i in dframe_train.tags]
test_tags_list=[i.split(",") for i in dframe_test.tags]

In [35]:
# Using MultilabelBinarizer to convert the list of tags to numberical format
from sklearn.preprocessing import MultiLabelBinarizer

# Example labels
labels = train_tags_list

# Create a MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Transform the labels into a binary matrix
mlb.fit(labels)

train_labels = mlb.transform(train_tags_list)
test_labels = mlb.transform(test_tags_list)

In [36]:
train_labels[:5]

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1]])

In [37]:
# Dopping the 'tag' column from the datasets
dframe_train.drop(columns = ['tags'], inplace = True)
dframe_test.drop(columns = ['tags'], inplace = True)

In [38]:
# Adding the new numerically converted tags to the data
dframe_train['Fit_tags'] = list(train_labels)
dframe_test['Fit_tags'] = list(test_labels)

In [39]:
# Display the first five rows of the data
dframe_train.head()

Unnamed: 0,query,Fit_tags
7911,i have two data sets that each have the follow...,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0]"
10808,suppose we have a historical panel longitudin...,"[0, 0, 0, 0, 0, 0, 1, 1, 0, 0]"
1508,is it possible to say that my samples are sign...,"[0, 1, 0, 0, 0, 0, 1, 0, 0, 0]"
5095,i have built a logistic regression where the o...,"[0, 0, 1, 1, 0, 0, 1, 0, 0, 0]"
3853,i fit a simple linear model y bx to a data...,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 1]"


<b>Custom data and DataLoaders for PyTorch 2.1.1</b>

In [40]:
def Making_custom_data(df):
  custom_dataset = []
  for row in df.values:
    tupp = (row[1], row[0])
    custom_dataset.append(tupp)
  return custom_dataset

In [41]:
df_train = Making_custom_data(dframe_train)
df_test = Making_custom_data(dframe_test)

In [42]:
df_train[1]

(array([0, 0, 0, 0, 0, 0, 1, 1, 0, 0]),
 'suppose we have a historical  panel longitudinal  dataset on the number of buildings in each sub region  this is a made up dataset to explain the concept      the variable    year    ranges from   to    and it represents the year that each data point belongs to   the variable    sub region    ranges from   to    and it represents the sub region the data was collected from  the variable    type    ranges from   to   and it represents the type of each building  say  office car or residential   the variable    group    ranges from   to   and it represents the age group of each building  say    years       years      years old   the variable    count    is the dependent variable  y  and it represents the number of each building group of each type in each sub region at any specific year  the variable    population    is one of the independent variables  x   and it represent the population size in each sub region  note  it has the same value for each


<b>Making Iterable Custom Dataset.</b>



This code defines a custom iterator class named CustomIterator that enables iterating over a dataset. It initializes the iterator with the dataset and provides methods for checking the dataset length, accessing individual data points, and iterating through the dataset<br>__init__ Method: This method initializes the iterator with the dataset and sets the initial index to 0.

__iter__ Method: This method returns the iterator object itself, enabling it to be used in a for loop.

__len__ Method: This method returns the length of the dataset, allowing for checking the dataset size.

__getitem__ Method: This method takes an index idx and returns the corresponding data point from the dataset.

__next__ Method: This method implements the iterator behavior. It checks if the current index is within the dataset's length. If so, it returns the data point at the current index, increments the index, and proceeds. Otherwise, it raises StopIteration to indicate the end of the dataset.



In [43]:
class CustomIterator:
    def __init__(self, dataset):
        self.dataset = dataset
        self.current_index = 0

    def __iter__(self):
        return self

    def __len__(self):
      return len(self.dataset)

    def __getitem__(self, idx):
        return self.dataset[idx]

    def __next__(self):
        if self.current_index < len(self.dataset):
            data_point = self.dataset[self.current_index]
            self.current_index += 1
            return data_point
        else:
            # If we've reached the end of the dataset, raise StopIteration
            raise StopIteration

# Example usage:
# Assuming your dataset is a list of tuples (label, text)

# Create an instance of the custom iterator
train_iter = CustomIterator(df_train)
test_iter = CustomIterator(df_test)

In [44]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>","<pad>"])
vocab.set_default_index(vocab["<unk>"])



In [45]:
list(vocab.get_stoi().items())[:10]

[('zwet', 24972),
 ('zval', 24971),
 ('zuur', 24970),
 ('zugdmqipoleu', 24967),
 ('ztest', 24965),
 ('zph', 24964),
 ('zoubin', 24962),
 ('zkdicw', 24959),
 ('zingales', 24956),
 ('zhu', 24953)]

In [46]:
vocab(['here', 'is', 'an', 'example',''])

[96, 7, 50, 108, 0]

In [47]:
text_pipeline = lambda x: vocab(tokenizer(x))

In [48]:
text_pipeline('here is the an example')

[96, 7, 2, 50, 108]

<b>Making DataLoader</b>

In [49]:
from torch.utils.data import DataLoader

In [50]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list = [], []
    for _label, _text in batch:
        label_list.append(_label)
        temp_text = text_pipeline(_text)
        if(len(temp_text) < 100):
          to_append = [1]*(100 - len(temp_text))
          temp_text += to_append
        else:
          temp_text = temp_text[:100]
        processed_text = torch.tensor(temp_text, dtype=torch.int64)
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.stack(text_list)
    return label_list.to(device), text_list

dataloader_train = DataLoader(train_iter, batch_size=128, shuffle=False, collate_fn=collate_batch, drop_last = True)
dataloader_test = DataLoader(test_iter, batch_size=128, shuffle=False, collate_fn=collate_batch, drop_last = True)

In [51]:
# Unpack the batch into individual components
data_iter = iter(dataloader_train)
data_iter_test = iter(dataloader_test)
labels, text = next(data_iter)
labels_test, text_test = next(data_iter_test)

  label_list = torch.tensor(label_list, dtype=torch.int64)


In [52]:
text.shape, labels.shape

(torch.Size([128, 100]), torch.Size([128, 10]))

In [53]:
text_test.shape, labels_test.shape

(torch.Size([128, 100]), torch.Size([128, 10]))

#4. Model Building

<b>super(Net, self).__init__()</b><br>
This line initializes the Net class instance and calls the __init__() method of the parent class nn.Module

<b>self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)</b>

We are adding an embedding layer which will make the embeddings using the nn.Embedding module. The embedding layer converts each word in the vocabulary (size vocab_size) into a vector of size embedding_dim.

<b>self.rnn_layer = nn.RNN(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)</b>

In this line we are addin an RNN layer. The RNN layer processes sequences of embedded word vectors. embedding_dim represents the input size for each word vector, hidden_size determines the complexity of the RNN's internal state representation, and batch_first=True indicates that the batch dimension is at the first position of the input tensor.

<b>self.fc = nn.Sequential( nn.Linear(hidden_size, 128), nn.ReLU(), nn.Linear(128, output_size), nn.Sigmoid() )</b>

This line creates a fully connected layer using the nn.Sequential module. The fully connected layer takes the last output from the RNN layer as input and transforms it into a probability distribution over output_size classes. The intermediate layers (with hidden size 128) introduce non-linearity using nn.ReLU(), and the output layer applies a sigmoid activation function.

<b>def forward(self, x):</b>

This line defines the forward() method, which is responsible for performing the forward pass through the neural network.

<b>embedded = self.embedding(x)</b>

This line applies the embedding layer to the input sequence x, converting each word into its corresponding embedding vector.

<b>rnn_output, _ = self.rnn_layer(embedded)</b>

This line applies the RNN layer to the embedded word vectors, producing an output sequence rnn_output and an updated hidden state (not used here).

<b>rnn_output = rnn_output[:, -1]</b>

This line extracts the last output from the RNN layer's output sequence, representing the final state of the RNN after processing the entire sentence.

<b>output = self.fc(rnn_output)</b>

This line passes the last RNN output through the fully connected layer, transforming it into a probability distribution over the output classes.

<b>return output</b>

This line returns the final output of the neural network, representing the predicted class probabilities for the input sentence.

### 4.1 Model Architecture

In [54]:
class Net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(Net, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.rnn_layer = nn.RNN(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 128),
            nn.ReLU(),
            nn.Linear(128, output_size),
            nn.Sigmoid()
        )

    def forward(self, x):
        embedded = self.embedding(x)
        rnn_output, _ = self.rnn_layer(embedded)
        rnn_output = rnn_output[:, -1]  # Considering the last output of the sequence
        output = self.fc(rnn_output)
        return output


In [55]:
#define the model
model = Net(len(vocab), 50, 128, 10)

In [56]:
#model layers
model

Net(
  (embedding): Embedding(24973, 50)
  (rnn_layer): RNN(50, 128, batch_first=True)
  (fc): Sequential(
    (0): Linear(in_features=128, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=10, bias=True)
    (3): Sigmoid()
  )
)

<b>Checking the model on one of the Batch from our data</b>

In [57]:
data_iter = iter(dataloader_train)
batch = next(data_iter)

# Unpack the batch into individual omponents
label_list, text_list = batch

In [58]:
# #pass an text to the model to understand the output
# deactivates autograd
with torch.no_grad():
  pred = model(text_list)
  print(pred)

tensor([[0.4642, 0.4858, 0.5135,  ..., 0.4432, 0.4890, 0.5074],
        [0.4691, 0.4894, 0.5281,  ..., 0.4724, 0.4973, 0.5047],
        [0.4621, 0.4552, 0.5566,  ..., 0.4604, 0.5354, 0.5220],
        ...,
        [0.4806, 0.4720, 0.5371,  ..., 0.4713, 0.4966, 0.4842],
        [0.4548, 0.4648, 0.5435,  ..., 0.4521, 0.5255, 0.4856],
        [0.4806, 0.4720, 0.5371,  ..., 0.4713, 0.4966, 0.4842]])


In [59]:
#define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = BCELoss()

# checking if GPU is available
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

<b>The train()</b> function is responsible for training the neural network model using a given dataloader and optimizer. It iterates over the batches in the dataloader, computes the loss for each batch, performs the backward propagation, and updates the model's weights using the optimizer.

<b>The model.train()</b> call sets the model to training mode, enabling dropout and other regularization techniques specific to the training process.

<b>epoch_loss</b> is initialized to 0 to accumulate the loss over the entire epoch.
<b>no_of_batches</b> is initialized to 0 to count the total number of batches processed.

<b>Batch Iteration:</b>
The for loop here, iterates over the batches provided by the dataloader.

<b>Unpack Batch:</b>
The batch is unpacked into batch_y (labels) and batch_x (text data).

<b>Convert Labels to Float:</b>
The labels batch_y are converted to Float tensors to match the data type expected by the loss function.

<b>Push to CUDA (if available):</b>
If a CUDA GPU is available, the tensors batch_x and batch_y are transferred to the GPU for faster computations.

<b>Clear Gradients:</b>
The gradients accumulated from the previous backward pass are cleared using optimizer.zero_grad(), ensuring that the gradients are only computed for the current batch.

<b>Forward Pass:</b>
The input text data batch_x is passed through the neural network model model, producing the output predictions outputs.

<b>Convert to 1D Tensor:</b>
The output predictions outputs are squeezed to a one-dimensional tensor, as the loss function expects a flattened output.

<b>Calculate Loss and Accuracy:</b>
The loss function criterion is applied to the output predictions outputs and the target labels batch_y, resulting in the loss value loss.

<b>Backward Pass:</b>
The loss loss is propagated back through the network using loss.backward(), updating the gradients of the network's parameters.

<b>Update Weights:</b>
The optimizer optimizer performs a weight update step using optimizer.step(), adjusting the network's parameters based on the accumulated gradients.

<b>Track Epoch Loss:</b>
The current batch loss loss.item() is added to the accumulated epoch_loss to keep track of the overall loss for the epoch.

<b>Count Batches:</b>
The no_of_batches counter is incremented to track the total number of batches processed during the epoch.

<b>Return Epoch Loss:</b>
The function returns the average loss for the epoch, calculated as epoch_loss / no_of_batches.

In [60]:
def train(dataloader, batch_size):

    # Activate training phase
    model.train()

    # Initialization
    epoch_loss = 0
    no_of_batches = 0

    # Iterate over the dataloader
    count = 0
    for batch in dataloader:
        # Unpack the batch into text and labels
        batch_y, batch_x = batch
        print('Batch_no: ', count)
        count += 1

        # Convert labels to Float
        batch_y = batch_y.float()

        # Push to CUDA
        if torch.cuda.is_available():
            batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

        # Clear gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_x)

        # Converting to a 1-dimensional tensor
        outputs = outputs.squeeze()

        # Calculate loss and accuracy
        loss = criterion(outputs, batch_y)

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Keep track of the loss and accuracy of a epoch
        epoch_loss = epoch_loss + loss.item()

        # No. of batches
        no_of_batches = no_of_batches + 1

    return epoch_loss / no_of_batches


<b>The evaluate()</b> function is responsible for evaluating the performance of the trained neural network model on a given dataloader. It iterates over the batches in the dataloader, computes the loss for each batch, and accumulates the loss to calculate the average loss for the evaluation dataset.

<b>Deactivate Training Phase:</b>
The model.eval() call sets the model to evaluation mode, disabling dropout and other regularization techniques specific to the training process.

<b>Initialization:</b>
epoch_loss is initialized to 0 to accumulate the loss over the entire evaluation set.
no_of_batches is initialized to 0 to count the total number of batches processed.

<b>Batch Iteration:</b>
A for loop iterates over the batches provided by the dataloader.
Unpack Batch:

The batch is unpacked into batch_y (labels) and batch_x (text data).
Convert Labels to Float:

The labels batch_y are converted to Float tensors to match the data type expected by the loss function.
Move Tensors to GPU (if available):

<b>Deactivate Autograd:</b>
The with torch.no_grad() context disables gradient calculation, as we are only interested in evaluating the model's performance, not updating its parameters.

<b>Forward Pass:</b>
The input text data batch_x is passed through the neural network model model, producing the output predictions outputs.

<b>Convert Outputs to 1D Tensor:</b>
The output predictions outputs are squeezed to a one-dimensional tensor, as the loss function expects a flattened output.
Calculate Loss:

The loss function criterion is applied to the output predictions outputs and the target labels batch_y, resulting in the loss value loss.

<b>Keep Track of Loss:</b>
The current batch loss loss.item() is added to the accumulated epoch_loss to keep track of the overall loss for the evaluation set.

<b>Count Batches:</b>
The no_of_batches counter is incremented to track the total number of batches processed during the evaluation set.

<b>Calculate Average Loss:</b>
The function returns the average loss for the evaluation set, calculated as epoch_loss / no_of_batches.

In [61]:
def evaluate(dataloader, batch_size):
    # Deactivate training phase
    model.eval()

    # Initialization
    epoch_loss = 0
    no_of_batches = 0

    count = 0
    # Iterate over the dataloader
    for batch in dataloader:
        print('Batch No: ', count)
        count+=1
        # Unpack the batch into text and labels
        batch_y, batch_x = batch

        # Convert labels to Float
        batch_y = batch_y.float()

        # Move tensors to GPU if available
        if torch.cuda.is_available():
            batch_x = batch_x.cuda()
            batch_y = batch_y.cuda()

        # Deactivate autograd
        with torch.no_grad():
            # Forward pass
            outputs = model(batch_x)

            # Convert outputs to 1-dimensional tensor
            outputs = outputs.squeeze()

            # Calculate loss
            loss = criterion(outputs, batch_y)

            # Keep track of loss
            epoch_loss += loss.item()

            # No. of batches
            no_of_batches += 1

    # Calculate average loss
    epoch_loss /= no_of_batches

    return epoch_loss

In [62]:
def predict(dataloader, batch_size):
    # Deactivate training phase
    model.eval()

    count = 0
    # Initialize empty list for predictions
    predictions = []

    # Iterate over the dataloader
    for batch in dataloader:
        # Unpack the batch into text
        batch_y, batch_x = batch
        print('Batch No: ', count)
        count += 1
        # Move tensor to GPU if available
        if torch.cuda.is_available():
            batch_x = batch_x.cuda()

        # Deactivate autograd
        with torch.no_grad():
            # Forward pass
            outputs = model(batch_x)

            # Convert outputs to 1-dimensional tensor
            outputs = outputs.squeeze()

            # Convert to numpy array and append to predictions list
            prediction = outputs.data.cpu().numpy()
            predictions.append(prediction)

    # Concatenate predictions into a single numpy array
    predictions = np.concatenate(predictions, axis=0)

    return predictions


 Evaluating

In [63]:
N_EPOCHS = 10
batch_size = 128

# Initialization
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    # Train the model
    train_loss = train(dataloader_train, batch_size)

    # Evaluate the model
    valid_loss = evaluate(dataloader_test, batch_size)

    print('\nEpoch :', epoch,
          '\tTraining loss:', round(train_loss, 4),
          '\tValidation loss:', round(valid_loss, 4))

    # Save the best model
    if best_valid_loss >= valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
        print("\n----------------------------------------------------Saved best model------------------------------------------------------------------")


Batch_no:  0
Batch_no:  1
Batch_no:  2
Batch_no:  3
Batch_no:  4
Batch_no:  5
Batch_no:  6
Batch_no:  7
Batch_no:  8
Batch_no:  9
Batch_no:  10
Batch_no:  11
Batch_no:  12
Batch_no:  13
Batch_no:  14
Batch_no:  15
Batch_no:  16
Batch_no:  17
Batch_no:  18
Batch_no:  19
Batch_no:  20
Batch_no:  21
Batch_no:  22
Batch_no:  23
Batch_no:  24
Batch_no:  25
Batch_no:  26
Batch_no:  27
Batch_no:  28
Batch_no:  29
Batch_no:  30
Batch_no:  31
Batch_no:  32
Batch_no:  33
Batch_no:  34
Batch_no:  35
Batch_no:  36
Batch_no:  37
Batch_no:  38
Batch_no:  39
Batch_no:  40
Batch_no:  41
Batch_no:  42
Batch_no:  43
Batch_no:  44
Batch_no:  45
Batch_no:  46
Batch_no:  47
Batch_no:  48
Batch_no:  49
Batch_no:  50
Batch_no:  51
Batch_no:  52
Batch_no:  53
Batch_no:  54
Batch_no:  55
Batch_no:  56
Batch_no:  57
Batch_no:  58
Batch_no:  59
Batch_no:  60
Batch_no:  61
Batch_no:  62
Batch_no:  63
Batch_no:  64
Batch_no:  65
Batch_no:  66
Batch_no:  67
Batch_no:  68
Batch No:  0
Batch No:  1
Batch No:  2
Batch

### 5.2 Checking the performance of the model

In [64]:
#load weights of best model
path='saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [65]:
#predict probabilities
batch_size = 128
y_pred_prob = []
y_pred_prob.append(predict(dataloader_test, batch_size))

Batch No:  0
Batch No:  1
Batch No:  2
Batch No:  3
Batch No:  4
Batch No:  5
Batch No:  6
Batch No:  7
Batch No:  8
Batch No:  9
Batch No:  10
Batch No:  11
Batch No:  12
Batch No:  13
Batch No:  14
Batch No:  15
Batch No:  16


In [66]:
y_pred_prob = np.concatenate(y_pred_prob, axis = 0)
y_pred_prob = np.concatenate(y_pred_prob, axis = 0)

In [67]:
y_pred_prob.shape

(21760,)

In [68]:
y_pred_prob[:10]

array([0.12655829, 0.16578057, 0.1173611 , 0.15160848, 0.19196936,
       0.18465072, 0.3293871 , 0.42862576, 0.19954805, 0.1490999 ],
      dtype=float32)

In [69]:
#actual tags
y_temp = []
for x in range(len(dataloader_test)):
  data_iter = iter(dataloader_test)
  batch = next(data_iter)

  # Unpack the batch into individual omponents
  label_list, text_list = batch
  y_temp.append(label_list.cpu().numpy())

y_temp = np.concatenate(y_temp, axis=0)
y_true = []
for x in y_temp:
  y_true.append(x)
y_true = np.concatenate(y_true, axis=0)

In [70]:
y_true.shape

(21760,)

In [71]:
#define candidate threshold values
threshold  = np.arange(0,0.5,0.01)
print(threshold)

[0.   0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1  0.11 0.12 0.13
 0.14 0.15 0.16 0.17 0.18 0.19 0.2  0.21 0.22 0.23 0.24 0.25 0.26 0.27
 0.28 0.29 0.3  0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4  0.41
 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49]


In [72]:
# convert probabilities into classes or tags based on a threshold value
def classify(y_pred_prob, thresh):
  y_pred = np.where(y_pred_prob<thresh, 0, 1)

  return np.array(y_pred)

In [73]:
score=[]

for thresh in threshold:

    #classes for each threshold
    y_pred = classify(y_pred_prob, thresh)

    score.append(metrics.f1_score(y_true, y_pred))

In [74]:
# find the optimal threshold
opt = threshold[score.index(max(score))]
print(opt)

0.27


In [75]:
y_pred_prob.shape

(21760,)

In [76]:
#predictions for optimal threshold
y_pred = classify(y_pred_prob, opt)

In [77]:
#Classification report
print(metrics.classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.84      0.85     17187
           1       0.43      0.45      0.44      4573

    accuracy                           0.76     21760
   macro avg       0.64      0.65      0.64     21760
weighted avg       0.76      0.76      0.76     21760



In [78]:
y_pred_labels_numeric = np.array(np.split(y_pred, 10))
y_true_labels_numeric = np.array(np.split(y_true, 10))

In [79]:
y_pred_labels_numeric = np.array(y_pred_labels_numeric.transpose())
y_true_labels_numeric = np.array(y_true_labels_numeric.transpose())

In [80]:
y_pred_labels_numeric.shape, y_true_labels_numeric.shape

((2176, 10), (2176, 10))

In [81]:
#convert back to tags
y_pred_label = mlb.inverse_transform(y_pred_labels_numeric)
y_true_label = mlb.inverse_transform(y_true_labels_numeric)

# # get all validation text
# queries = [" ".join(i) for i in dframe_test.query]

# create a dataframe to show the data and prediction side by side
df = pd.DataFrame({'Questions':dframe_test['query'][:2176],'Actual Tags':y_true_label,'Predicted Tags':y_pred_label})

# print first five rows
df.head()

Unnamed: 0,Questions,Actual Tags,Predicted Tags
8772,suppose you have data in the following format ...,"(distributions,)","(distributions, r)"
9847,i have a question on how a statistician would ...,"(distributions, time series)","(distributions, r)"
3265,assume that there are n realisations of five...,"(probability, regression)","(machine learning, time series)"
2319,first let me start off by saying i know the co...,"(machine learning, regression)","(distributions, machine learning, time series)"
9298,i create time series model via model sarima...,"(hypothesis testing, logistic)","(hypothesis testing, regression)"


# 6. LSTM Model Building

In [82]:
text_list.shape

torch.Size([128, 100])

In [83]:
class Net(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(Net, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.lstm_layer = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 128),
            nn.ReLU(),
            nn.Linear(128, output_size),
            nn.Sigmoid()
        )

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_output, (hidden_state, cell_state) = self.lstm_layer(embedded)
        lstm_output = lstm_output[:, -1, :]  # Considering the last output of the sequence
        output = self.fc(lstm_output)
        return output


In [84]:
#define the model
model = Net(len(vocab), 50, 128, 10)

In [85]:
#model layers
model

Net(
  (embedding): Embedding(24973, 50)
  (lstm_layer): LSTM(50, 128, batch_first=True)
  (fc): Sequential(
    (0): Linear(in_features=128, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=10, bias=True)
    (3): Sigmoid()
  )
)

In [86]:
data_iter = iter(dataloader_train)
batch = next(data_iter)

# Unpack the batch into individual omponents
label_list, text_list = batch

In [87]:
label_list.shape

torch.Size([128, 10])

In [88]:
# #pass an text to the model to understand the output
#deactivates autograd
with torch.no_grad():
  pred = model(text_list)
  print(pred)

tensor([[0.4760, 0.5144, 0.5009,  ..., 0.5104, 0.4829, 0.5135],
        [0.4800, 0.5312, 0.4789,  ..., 0.5070, 0.4796, 0.5179],
        [0.4696, 0.5165, 0.4936,  ..., 0.5038, 0.4861, 0.5237],
        ...,
        [0.4791, 0.5144, 0.4851,  ..., 0.5065, 0.4859, 0.5094],
        [0.4716, 0.5067, 0.4873,  ..., 0.5133, 0.4947, 0.5117],
        [0.4791, 0.5144, 0.4851,  ..., 0.5065, 0.4859, 0.5094]])


In [89]:
#define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = BCELoss()

# checking if GPU is available
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

In [90]:
def train(dataloader, batch_size):

    # Activate training phase
    model.train()

    # Initialization
    epoch_loss = 0
    no_of_batches = 0

    # Iterate over the dataloader
    count = 0
    for batch in dataloader:
        # Unpack the batch into text and labels
        batch_y, batch_x = batch
        print('Batch_no: ', count)
        count += 1

        # Convert labels to Float
        batch_y = batch_y.float()

        # Push to CUDA
        if torch.cuda.is_available():
            batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

        # Clear gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_x)

        # Converting to a 1-dimensional tensor
        outputs = outputs.squeeze()

        # Calculate loss and accuracy
        loss = criterion(outputs, batch_y)

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Keep track of the loss and accuracy of a epoch
        epoch_loss = epoch_loss + loss.item()

        # No. of batches
        no_of_batches = no_of_batches + 1

    return epoch_loss / no_of_batches


In [91]:
def evaluate(dataloader, batch_size):
    # Deactivate training phase
    model.eval()

    # Initialization
    epoch_loss = 0
    no_of_batches = 0

    count = 0
    # Iterate over the dataloader
    for batch in dataloader:
        print('Batch No: ', count)
        count+=1
        # Unpack the batch into text and labels
        batch_y, batch_x = batch

        # Convert labels to Float
        batch_y = batch_y.float()

        # Move tensors to GPU if available
        if torch.cuda.is_available():
            batch_x = batch_x.cuda()
            batch_y = batch_y.cuda()

        # Deactivate autograd
        with torch.no_grad():
            # Forward pass
            outputs = model(batch_x)

            # Convert outputs to 1-dimensional tensor
            outputs = outputs.squeeze()

            # Calculate loss
            loss = criterion(outputs, batch_y)

            # Keep track of loss
            epoch_loss += loss.item()

            # No. of batches
            no_of_batches += 1

    # Calculate average loss
    epoch_loss /= no_of_batches

    return epoch_loss

In [92]:
def predict(dataloader, batch_size):
    # Deactivate training phase
    model.eval()

    count = 0
    # Initialize empty list for predictions
    predictions = []

    # Iterate over the dataloader
    for batch in dataloader:
        # Unpack the batch into text
        batch_y, batch_x = batch
        print('Batch No: ', count)
        count += 1
        # Move tensor to GPU if available
        if torch.cuda.is_available():
            batch_x = batch_x.cuda()

        # Deactivate autograd
        with torch.no_grad():
            # Forward pass
            outputs = model(batch_x)

            # Convert outputs to 1-dimensional tensor
            outputs = outputs.squeeze()

            # Convert to numpy array and append to predictions list
            prediction = outputs.data.cpu().numpy()
            predictions.append(prediction)

    # Concatenate predictions into a single numpy array
    predictions = np.concatenate(predictions, axis=0)

    return predictions


<b>Model Evaluation for LSTM</b>

In [93]:
N_EPOCHS = 10
batch_size = 128

# Initialization
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    # Train the model
    train_loss = train(dataloader_train, batch_size)

    # Evaluate the model
    valid_loss = evaluate(dataloader_test, batch_size)

    print('\nEpoch :', epoch,
          '\tTraining loss:', round(train_loss, 4),
          '\tValidation loss:', round(valid_loss, 4))

    # Save the best model
    if best_valid_loss >= valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights_lstm.pt')
        print("\n----------------------------------------------------Saved best model------------------------------------------------------------------")


Batch_no:  0
Batch_no:  1
Batch_no:  2
Batch_no:  3
Batch_no:  4
Batch_no:  5
Batch_no:  6
Batch_no:  7
Batch_no:  8
Batch_no:  9
Batch_no:  10
Batch_no:  11
Batch_no:  12
Batch_no:  13
Batch_no:  14
Batch_no:  15
Batch_no:  16
Batch_no:  17
Batch_no:  18
Batch_no:  19
Batch_no:  20
Batch_no:  21
Batch_no:  22
Batch_no:  23
Batch_no:  24
Batch_no:  25
Batch_no:  26
Batch_no:  27
Batch_no:  28
Batch_no:  29
Batch_no:  30
Batch_no:  31
Batch_no:  32
Batch_no:  33
Batch_no:  34
Batch_no:  35
Batch_no:  36
Batch_no:  37
Batch_no:  38
Batch_no:  39
Batch_no:  40
Batch_no:  41
Batch_no:  42
Batch_no:  43
Batch_no:  44
Batch_no:  45
Batch_no:  46
Batch_no:  47
Batch_no:  48
Batch_no:  49
Batch_no:  50
Batch_no:  51
Batch_no:  52
Batch_no:  53
Batch_no:  54
Batch_no:  55
Batch_no:  56
Batch_no:  57
Batch_no:  58
Batch_no:  59
Batch_no:  60
Batch_no:  61
Batch_no:  62
Batch_no:  63
Batch_no:  64
Batch_no:  65
Batch_no:  66
Batch_no:  67
Batch_no:  68
Batch No:  0
Batch No:  1
Batch No:  2
Batch

### 6.2 Exercise:  Check the performance of the LSTM model just like we did in the RNN model