<a href="https://colab.research.google.com/github/abhialag/iiscdlfa_kaggle_grp4/blob/main/Abhay_Group_4_M3_Mini_Hackathon_Irrelevant_Inappropriate_Questions_Classification_exp_v5_lstmcnn_nlpaug.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.


## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json).

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"abhaykumardnnai","key":"1716a26a8649843ef484f1b554327b9f"}'}

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

kaggle.json  [0m[01;34msample_data[0m/


In [None]:
!pip install urllib3==1.25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting urllib3==1.25
  Downloading urllib3-1.25-py2.py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.9/149.9 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Broken release[0m[33m
[0mInstalling collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.15
    Uninstalling urllib3-1.26.15:
      Successfully uninstalled urllib3-1.26.15
Successfully installed urllib3-1.25


### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.8/118.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Building wheel for slugify (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.6.1 requires urllib3>=1.25, but you have urllib3 1.24.3 which is incompatible.[0m[31m
[0m

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

Downloading toxic-questions-classification.zip to /content
 99% 60.0M/60.6M [00:01<00:00, 50.5MB/s]
100% 60.6M/60.6M [00:01<00:00, 45.4MB/s]


In [None]:
!unzip /content/toxic-questions-classification.zip

Archive:  /content/toxic-questions-classification.zip
  inflating: sample_submission.csv   
  inflating: test_dataset.csv        
  inflating: train_dataset.csv       


## YOUR CODING STARTS FROM HERE

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m109.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

## Import required packages

In [None]:
# Import required packages
import pandas as pd
import numpy as np

import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Model
from keras.layers import Input, Embedding, concatenate, Dense, Bidirectional, Dropout, Flatten, Conv1D, MaxPooling1D
from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.tensorboard import SummaryWriter
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
import pickle
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

In [None]:
# @title Download the glove embedding Dataset
!wget -qq https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/glove.6B.zip
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [None]:
# YOUR CODE HERE

# Loading the train_dataset
df_train = pd.read_csv('train_dataset.csv')
print('Length of train data',len(df_train),'\n')
print(df_train.head(),'\n') # looking at the data and field structure
print(df_train['target'].value_counts(),'\n')  #looking at the spread of target variable
print(df_train.isnull().values.any(),'\n') # zero null values


Length of train data 1044897 

                    qid                                      question_text  \
0  2549b81c4adff1849a7f                          Is CSE at bit Meara good?   
1  0558ed93a4630e68f7ac  Is it better to exercise before or after the b...   
2  5d72d5233059e44f8a8e  Can character naming in writing infringe on tr...   
3  3968636ac28841d0c901  Why does everyone making YouTube videos in Jap...   
4  201d2b9a777bbf25443f  Is there any relation between horse power and ...   

   target  
0       0  
1       0  
2       0  
3       0  
4       0   

0    980293
1     64604
Name: target, dtype: int64 

False 



In [None]:
print(df_train[df_train['target']==1].head()) # to see if 1 means negative or positive
# target 1 means negative, irrelevant and inappropriate question

                     qid                                      question_text  \
16  8ea797496fc68c9d8d98             Why are black people always tormented?   
28  72e1085eab12b6aa55e2                              How do you spell aye?   
29  8137a860b078efcadd4c  Why do Conservatives want all news to be conse...   
55  4233e8ed3bbbf5b8a242  Are we all for calling the people born in the ...   
67  4c4e07c6a1723d0fe649  Why did the frustrated Catholics of South Indi...   

    target  
16       1  
28       1  
29       1  
55       1  
67       1  


In [None]:
# import nltk
# from nltk.corpus import stopwords
# nltk.download('stopwords')
# stopwords = set(stopwords.words('english'))

In [None]:
# removal of stop words
def stopwordsremoval(sentence):
  sentence = sentence.lower()
  words = sentence.split()
  filtered_words = [word for word in words if word.lower() not in stopwords]
  filtered_sentence = ' '.join(filtered_words)
  return filtered_sentence

In [None]:
def cleaning_dataset(df):

    # Pre-Processing
    # converat all sentences to string format
    df['question_text'] = df['question_text'].astype(str)

    # convert all sentences to lower case
    df['question_text'] = df['question_text'].apply(lambda sentence_A: sentence_A.lower())
    # df['question_text'] = df['question_text'].apply(lambda sentence: stopwordsremoval(sentence))
    return df

In [None]:
# cleaning the questions column by lowering
df_train_cleaned = cleaning_dataset(df_train)
df_train_cleaned.drop(['qid'],axis=1,inplace=True)
df_train_cleaned.head(2)


Unnamed: 0,question_text,target
0,is cse at bit meara good?,0
1,is it better to exercise before or after the b...,0


In [None]:
!pip install nlpaug

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [None]:
from transformers import DistilBertModel, DistilBertTokenizer, DistilBertForSequenceClassification
# from imblearn.over_sampling import SMOTE
import nlpaug
import nlpaug.augmenter.word as naw
from nlpaug.augmenter.word import SynonymAug

In [None]:
# sentences_pos = df_train_cleaned[df_train_cleaned['target']==0]['question_text'].tolist()
sentences_neg = df_train_cleaned[df_train_cleaned['target']==1]['question_text'].tolist()
labels = df_train_cleaned['target'].tolist()
# sentences_pos[0:5],sentences_neg[0:5],labels[0:3]

In [None]:
# Define the NLP augmentation function
def augment_sentence(sentence, num_aug=10):
    # aug = SynonymAug(aug_max=4)
    aug = SynonymAug(aug_max=4)
    augmented_texts = aug.augment(sentence, n=num_aug)
    return augmented_texts

In [None]:
sentences_neg_augmented = [x for sent in sentences_neg for x in augment_sentence(sent)]
print(sentences_neg_augmented[8:12])
label_neg_augmented = [1 for x in sentences_neg_augmented]
# print(label_neg_augmented[0:15],len(label_neg_augmented))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


["wherefore are black people e'er torture?", 'why are black masses constantly rag?', 'how coif you spell aye?', 'how do you spell aye?']


In [None]:
tot_sentences = df_train_cleaned['question_text'].tolist()
tot_sentences.extend(sentences_neg_augmented)
tot_labels = df_train_cleaned['target'].tolist()
tot_labels.extend(label_neg_augmented)
len(tot_sentences),len(tot_labels)

(1690937, 1690937)

In [None]:
dict = {'question_text':tot_sentences,'target':tot_labels}
# df_aug = pd.DataFrame(dict)
df_train_cleaned = pd.DataFrame(dict) #temporary until we are using nlp aug instead of smote
# len(df_aug),df_aug.tail(4)

In [None]:
# df_train_cleaned = df_aug

In [None]:
#### finsihed nlp augmentation

In [None]:
# Tokenizer class from the keras.preprocessing.text module creates a word-to-index integer dictionary
# Vectorize the text samples
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train_cleaned['question_text'])

In [None]:
train_ques_seq = tokenizer.texts_to_sequences(df_train_cleaned['question_text'])

In [None]:
max_len = 50

In [None]:
# padding to 50 lengths to make uniform vectors
max_len =80 #earlier 50 dim
train_ques_seq = pad_sequences(train_ques_seq, maxlen=max_len, padding='post')

In [None]:
print(train_ques_seq[0:2])

[[    4  1246    51  1295 99198    74     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [    4    17   139     2   755   179    27    83     1  4607     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]]


In [None]:
print(train_ques_seq.shape)

(1690937, 80)


### Load the GloVe word embeddings
Now, let us load the 50-dimensional GloVe embeddings.

In [None]:
embeddings_index = {}
# Loading the 300-dimensional vector of the model
f = open('/content/glove.6B.200d.txt') #earlier 100d
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [None]:
print(len(tokenizer.word_index))

202108


In [None]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Vocab Size %d' % vocab_size)
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_dim = 200 #earlier 50

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i >= vocab_size:
        continue
    embedding_vector = embeddings_index.get(word)
    if (embedding_vector is not None) and len(embedding_vector) > 0:
                embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word)

print(embedding_matrix.shape)

Vocab Size 202109
Loaded 400000 word vectors.
(202109, 200)


In [None]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.071549  ,  0.093459  ,  0.023738  , ...,  0.33616999,
         0.030591  ,  0.25577   ],
       [ 0.57345998,  0.54170001, -0.23477   , ...,  0.54417998,
        -0.23069   ,  0.34946999],
       ...,
       [ 0.16487999,  0.14946   , -0.23224001, ...,  0.20412999,
        -0.13065   , -0.50190997],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.15442   , -1.14240003, -0.15925001, ...,  0.25330001,
        -0.37436   , -0.18716   ]])

# **Above glove vectorization done for CNN and LSTM processing**

In [None]:
X = train_ques_seq
Y = df_train_cleaned['target']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, stratify=Y, test_size=0.18, shuffle=True)
# Check for the shape of train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1386568, 80) (304369, 80) (1386568,) (304369,)


In [None]:
print(torch.tensor(list(y_train)))
torch.tensor(X_train)

tensor([0, 1, 1,  ..., 1, 1, 1])


tensor([[   4, 4344,  239,  ...,    0,    0,    0],
        [  30,    3,   58,  ...,    0,    0,    0],
        [  10,   31,  361,  ...,    0,    0,    0],
        ...,
        [  40,    9,  611,  ...,    0,    0,    0],
        [  30,    3,  817,  ...,    0,    0,    0],
        [  10,    9,  276,  ...,    0,    0,    0]], dtype=torch.int32)

In [None]:
from imblearn.over_sampling import SMOTE
print(X_train.shape,'\n',y_train.value_counts())

(1386568, 80) 
 0    803840
1    582728
Name: target, dtype: int64


In [None]:
# Apply SMOTE to oversample the minority classes
# smote = SMOTE(sampling_strategy={1: 700000})
# X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# print(len(X_train),len(X_train_resampled),y_train_resampled.shape,y_train_resampled[0:1],y_train_resampled.value_counts())

In [None]:
X_train_resampled, y_train_resampled = X_train, y_train #in case not running smote but just nlpaug

In [None]:
print(X_train_resampled.shape)

(1386568, 80)


In [None]:
# Custom dataset class
class SentenceClassDataset(Dataset):
    def __init__(self, questions, labels):
        self.questions = questions
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        questions = self.questions[idx]
        labels = self.labels[idx]
        return questions,labels

In [None]:
# train_dataset = SentenceClassDataset(torch.tensor(X_train),torch.tensor(list(y_train)))
train_dataset = SentenceClassDataset(torch.tensor(X_train_resampled),torch.tensor(list(y_train_resampled)))
test_dataset = SentenceClassDataset(torch.tensor(X_test),torch.tensor(list(y_test)))


In [None]:
class CNNSentenceClassifier(nn.Module):
    def __init__(self, embedding_dim, num_classes, vocab_size, pretrained_embeddings):
        super(CNNSentenceClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        self.conv1 = nn.Conv1d(embedding_dim, 128, kernel_size=3)
        self.conv2 = nn.Conv1d(128, 256, kernel_size=3)
        self.fc1 = nn.Linear(256, 128)
        self.fc2 = nn.Linear(128, num_classes)
        self.softmax = nn.Softmax()

    def forward(self, x):
        embedded = self.embedding(x)
        embedded = embedded.permute(0, 2, 1)
        conv1_out = F.relu(self.conv1(embedded))
        conv2_out = F.relu(self.conv2(conv1_out))
        pooled = F.max_pool1d(conv2_out, kernel_size=conv2_out.size(2)).squeeze(2)
        fc1_out = F.relu(self.fc1(pooled))
        fc2_out = self.fc2(fc1_out)
        logits = self.softmax(fc2_out)
        return logits

class CNNSentenceClassifier_v2(nn.Module):
    def __init__(self, embedding_dim, num_classes, vocab_size, pretrained_embeddings):
        super(CNNSentenceClassifier_v2, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        self.conv1 = nn.Conv1d(embedding_dim, 128, kernel_size=3)
        self.conv2 = nn.Conv1d(128, 256, kernel_size=3)
        self.fc1 = nn.Linear(256, 128)
        self.fc2 = nn.Linear(128, num_classes)
        self.dropout = nn.Dropout(p=0.1) # 0.5 was initial
        self.batch_norm1 = nn.BatchNorm1d(128)
        self.batch_norm2 = nn.BatchNorm1d(256)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        embedded = self.embedding(x)
        embedded = embedded.permute(0, 2, 1)
        conv1_out = F.relu(self.batch_norm1(self.conv1(embedded)))
        conv2_out = F.relu(self.batch_norm2(self.conv2(conv1_out)))
        pooled = F.max_pool1d(conv2_out, kernel_size=conv2_out.size(2)).squeeze(2)
        fc1_out = F.relu(self.dropout(self.fc1(pooled)))
        fc2_out = self.fc2(fc1_out)
        logits = self.softmax(fc2_out)
        return logits


class LSTMAttnCNNSentenceClassifier_2(nn.Module):
    def __init__(self, embedding_matrix, hidden_size, num_classes):
        super(LSTMAttnCNNSentenceClassifier_2, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(
            torch.FloatTensor(embedding_matrix), freeze=True)

        self.conv1 = nn.Conv1d(embedding_dim, 128, kernel_size=4)
        self.conv2 = nn.Conv1d(128, 256, kernel_size=4)
        self.fc1 = nn.Linear(256, 128)
        self.fc2 = nn.Linear(128, num_classes)
        self.dropout = nn.Dropout(p=0.1) # 0.5 was initial
        self.batch_norm1 = nn.BatchNorm1d(128)
        self.batch_norm2 = nn.BatchNorm1d(256)

        self.lstm1 = nn.LSTM(embedding_matrix.shape[1], hidden_size,
                            batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(hidden_size*2, hidden_size,
                            batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(0.2) #0.5 was initial
        self.batch_norm = nn.BatchNorm1d(hidden_size*2)
        self.fc = nn.Linear(hidden_size*2, num_classes)
        self.attention = nn.Linear(hidden_size*2, 1)
        self.softmax = nn.Softmax()

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, _) = self.lstm1(embedded)
        output = self.dropout(output)
        output = output.permute(0, 2, 1)
        output = self.batch_norm(output)
        output = output.permute(0, 2, 1)
        output, (hidden, _) = self.lstm2(output)
        output = self.dropout(output)
        attention_weights = F.softmax(self.attention(output), dim=1)
        context_vector = torch.sum(output * attention_weights, dim=1)
        fc_out = self.fc(context_vector)
        logits_lstm = self.softmax(fc_out)

        # embedded = self.embedding(x)
        embedded = embedded.permute(0, 2, 1)
        conv1_out = F.relu(self.batch_norm1(self.conv1(embedded)))
        conv2_out = F.relu(self.batch_norm2(self.conv2(conv1_out)))
        pooled = F.max_pool1d(conv2_out, kernel_size=conv2_out.size(2)).squeeze(2)
        fc1_out = F.relu(self.dropout(self.fc1(pooled)))
        fc2_out = self.fc2(fc1_out)
        logits_cnn = self.softmax(fc2_out)
        # logits = []
        # for value1, value2 in zip(logits_lstm,logits_cnn):
        #   high_logit = max(value1,value2)
        #   logits.append(high_logit)

        # logits = torch.max(logits_lstm,logits_cnn)
        logits = (logits_lstm + logits_cnn)/2

        return logits


In [None]:

# Define the hyperparameters
embedding_dim = 200 #earlier 50
num_classes = 2
vocab_size = vocab_size
pretrained_embeddings = embedding_matrix  # Provide your pretrained GloVe embeddings
num_epochs = 10
batch_size = 100
hidden_size = 64
learning_rate = 0.001

# Create an instance of the CNN model
cnn_model = CNNSentenceClassifier(embedding_dim, num_classes, vocab_size, pretrained_embeddings)
cnn_model = CNNSentenceClassifier_v2(embedding_dim, num_classes, vocab_size, pretrained_embeddings)

# Create an instance of the LSTM + Attn model
# lstmattn_model = LSTMAttnSentenceClassifier(pretrained_embeddings,hidden_size,num_classes)
# lstmattn_model = LSTMAttnSentenceClassifier_2(pretrained_embeddings,hidden_size,num_classes)

lstmattnCNN_model = LSTMAttnCNNSentenceClassifier_2(pretrained_embeddings,hidden_size,num_classes)


In [None]:
# Define the loss function and optimizer
# model = cnn_model  # if CNN model
# model = lstmattn_model #if LSTM ATTn model
model = lstmattnCNN_model # if LSTM Attn CNN model

model
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)

cuda


LSTMAttnCNNSentenceClassifier_2(
  (embedding): Embedding(202109, 200)
  (conv1): Conv1d(200, 128, kernel_size=(4,), stride=(1,))
  (conv2): Conv1d(128, 256, kernel_size=(4,), stride=(1,))
  (fc1): Linear(in_features=256, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (batch_norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (batch_norm2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lstm1): LSTM(200, 64, batch_first=True, bidirectional=True)
  (lstm2): LSTM(128, 64, batch_first=True, bidirectional=True)
  (batch_norm): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc): Linear(in_features=128, out_features=2, bias=True)
  (attention): Linear(in_features=128, out_features=1, bias=True)
  (softmax): Softmax(dim=None)
)

In [None]:
# Create data loader

train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_data_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [None]:
# for batch in train_data_loader:
#     print(tuple(t.to(device) for t in batch))
#     questions, labels = tuple(t.to(device) for t in batch)

In [None]:
# Example code for training the model
for epoch in range(num_epochs):
  model.train()
  total_loss = 0

  # Training loop
  for batch in train_data_loader:
    # print(tuple(t.to(device) for t in batch))
    inputs, labels = tuple(t.to(device) for t in batch)
    optimizer.zero_grad()
    outputs = model(inputs)  # inputs is your input sentence tensor
    loss = criterion(outputs, labels)  # labels is your target class tensor
    total_loss += loss.item()
    # print('batch loss',loss.item())

    loss.backward()
    optimizer.step()
    # Perform evaluation, validation, or other tasks as needed

  # Calculate average training loss
  avg_train_loss = total_loss / len(train_data_loader)

  # Validation loop
  model.eval()
  val_loss = 0
  val_accuracy = 0
  val_steps = 0
  total_f1_score = 0
# Example code for inference/prediction
  for batch in test_data_loader:
    inputs, labels = tuple(t.to(device) for t in batch)
    with torch.no_grad():
      outputs = model(inputs)  # inputs is your input sentence tensor
      loss = criterion(outputs, labels)
      val_loss += loss.item()
      # print('Val Loss',loss.item())
      logits = outputs
      _, predictions = torch.max(logits, dim=1)
      val_accuracy += torch.sum(predictions == labels).item()
      val_steps += labels.size(0)
      # f1 = f1_score(labels, predictions) # while gpu commenting out
      # print("F1 score:", f1) #while gpu commenting out
      # total_f1_score += f1 #while gpu commenting out

  # Calculate average validation loss and accuracy
  avg_val_loss = val_loss / len(test_data_loader)
  avg_val_accuracy = val_accuracy / val_steps
  # avg_total_f1_score = total_f1_score/val_steps #while gpu commenting out

  print(f'Epoch {epoch+1}/{num_epochs}')
  print(f'Training loss: {avg_train_loss}')
  print(f'Validation loss: {avg_val_loss}')
  print(f'Validation accuracy: {avg_val_accuracy}')
  # print(f'Validation F1 Score: {avg_total_f1_score}') # while gpu commenting out
  print()


  logits_lstm = self.softmax(fc_out)
  logits_cnn = self.softmax(fc2_out)


Epoch 1/10
Training loss: 0.39574488776295486
Validation loss: 0.3776740433547876
Validation accuracy: 0.937243937457494

Epoch 2/10
Training loss: 0.3768528299759464
Validation loss: 0.37183259663498824
Validation accuracy: 0.9416891996228263

Epoch 3/10
Training loss: 0.3707646455502383
Validation loss: 0.3704015528501725
Validation accuracy: 0.9437097733343409

Epoch 4/10
Training loss: 0.3669398521063356
Validation loss: 0.37076165301923525
Validation accuracy: 0.9425927081930157

Epoch 5/10
Training loss: 0.3639828323991237
Validation loss: 0.3668832438220802
Validation accuracy: 0.9462855941308084

Epoch 6/10
Training loss: 0.36181586976091223
Validation loss: 0.3645767149528912
Validation accuracy: 0.949199819955383

Epoch 7/10
Training loss: 0.35988624274223446
Validation loss: 0.3633427641864832
Validation accuracy: 0.9508064224674655

Epoch 8/10
Training loss: 0.3586485675729131
Validation loss: 0.3627122314389839
Validation accuracy: 0.9512532485239955

Epoch 9/10
Training l

In [None]:
print(logits)

In [None]:
len([x for x in predictions if x==1])

33

In [None]:
# saving the model, hyperparameters and embedding matrix as pretrained embeddings
torch.save(model, 'cnnlstm_model_v6.pth')  # saving the model parameters
# torch.save(model.state_dict(), 'cnn_model.pth')  # saving the model parameters

hyperparameters = {
    'embedding_dim': embedding_dim,
    'hidden_dim': hidden_size,
    'num_classes': num_classes,
    # 'dropout': dropout,
    'learning_rate': learning_rate,
    'vocab_size': vocab_size,
    'pretrained_embeddings':pretrained_embeddings,
    'num_epochs':num_epochs,
    'batch_size': batch_size,
    'max_token_len': max_len
}
torch.save(hyperparameters, 'lstmcnn_hyperparameters_v6.pth')  #saving the model hyperparameters

np.save('lstmcnn_pretrained_embeddings_v6.npy', pretrained_embeddings) #saving the pretrained glove embeddings as embedding matrix
# saving fitted tokenizer
with open('lstmcnn_tokenizer_v6.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)





***For Prediction***

In [None]:
# # Function for inference/prediction on a new dataset
# def preprocess_input(input_sentences):
#     input_tensor = []
#     for sentence in input_sentences:
#         word_indices = []
#         for word in sentence.split():
#             if word in word_to_index:
#                 word_indices.append(word_to_index[word])
#             else:
#                 word_indices.append(0)  # Assign index 0 for unknown words
#         input_tensor.append(word_indices)
#     input_tensor = torch.LongTensor(input_tensor)
#     return input_tensor

In [None]:
## Loading all the model params
def load_model():
  model = torch.load('cnnlstm_model_v4.pth') # keep the model class defined as global
  hyperparameters = torch.load('lstmcnn_hyperparameters_v4.pth')
  pretrained_embeddings = np.load('lstmcnn_pretrained_embeddings_v4.npy')
  with open('lstmcnn_tokenizer_v4.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

  return(model,hyperparameters,pretrained_embeddings,tokenizer)

In [None]:
## Run class CNNSentenceClassifier(nn.Module) block above

In [None]:
model,hyperparameters,pretrained_embeddings,tokenizer = load_model()

In [None]:
print(hyperparameters)

{'embedding_dim': 200, 'hidden_dim': 64, 'num_classes': 2, 'learning_rate': 0.001, 'vocab_size': 202061, 'pretrained_embeddings': array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.071549  ,  0.093459  ,  0.023738  , ...,  0.33616999,
         0.030591  ,  0.25577   ],
       [ 0.57345998,  0.54170001, -0.23477   , ...,  0.54417998,
        -0.23069   ,  0.34946999],
       ...,
       [ 0.17603   ,  0.061422  ,  0.080036  , ...,  0.041023  ,
         0.33221999, -0.1688    ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]]), 'num_epochs': 5, 'batch_size': 100, 'max_token_len': 80}


In [None]:
df_pred = pd.read_csv('test_dataset.csv')
df_pred.head(3)
print(df_pred.head(3),'\n',df_pred.shape)

                    qid                                      question_text
0  d5cacbea9be29bd47a78                               Is Minance any good?
1  5650c4a236fe3b555c31            Do computers have reserved key strokes?
2  b778db4f09f9326195ea  When was the last time that the US had such a ... 
 (261221, 2)


In [None]:
## Run cleaning_dataset function

In [None]:
# cleaning the questions column by lowering
df_pred_cleaned = cleaning_dataset(df_pred)
# df_pred_cleaned.drop(['qid'],axis=1,inplace=True)
df_pred_cleaned.head(2)


Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,is minance any good?
1,5650c4a236fe3b555c31,do computers have reserved key strokes?


In [None]:
pred_ques_seq = tokenizer.texts_to_sequences(df_pred_cleaned['question_text'])

In [None]:
# padding to 50 lengths to make uniform vectors
max_token_len = hyperparameters['max_token_len']
pred_ques_seq = pad_sequences(pred_ques_seq, maxlen=max_token_len, padding='post')

In [None]:
print(pred_ques_seq.shape,pred_ques_seq[0:2])

(261221, 80) [[    4 49336    63    74     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]
 [    9  3520    24  6065  1487 18188     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     

In [None]:
pred_ques_seq = torch.tensor(pred_ques_seq)
print(pred_ques_seq.shape,pred_ques_seq[0:2])

torch.Size([261221, 80]) tensor([[    4, 49336,    63,    74,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    9,  3520,    24,  6065,  1487, 18188,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,    

In [None]:
pred_batch_size = 5000

In [None]:
# Custom dataset class
class PredSentenceClassDataset(Dataset):
    def __init__(self, questions):
        self.questions = questions

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        questions = self.questions[idx]
        return questions
pred_dataset = PredSentenceClassDataset(pred_ques_seq)
pred_data_loader = torch.utils.data.DataLoader(pred_dataset, batch_size=pred_batch_size, shuffle=False)



In [None]:
model.eval()
final_pred_class_value = []
for batch_questions in pred_data_loader:
  print(batch_questions.shape)
  batch_questions = batch_questions.to(device)
  outputs = model(batch_questions)
  _, pred_class_value = torch.max(outputs, dim=1)
  final_pred_class_value.extend(pred_class_value)
  print(len(final_pred_class_value))

torch.Size([5000, 80])
5000
torch.Size([5000, 80])


  logits_lstm = self.softmax(fc_out)
  logits_cnn = self.softmax(fc2_out)


10000
torch.Size([5000, 80])
15000
torch.Size([5000, 80])
20000
torch.Size([5000, 80])
25000
torch.Size([5000, 80])
30000
torch.Size([5000, 80])
35000
torch.Size([5000, 80])
40000
torch.Size([5000, 80])
45000
torch.Size([5000, 80])
50000
torch.Size([5000, 80])
55000
torch.Size([5000, 80])
60000
torch.Size([5000, 80])
65000
torch.Size([5000, 80])
70000
torch.Size([5000, 80])
75000
torch.Size([5000, 80])
80000
torch.Size([5000, 80])
85000
torch.Size([5000, 80])
90000
torch.Size([5000, 80])
95000
torch.Size([5000, 80])
100000
torch.Size([5000, 80])
105000
torch.Size([5000, 80])
110000
torch.Size([5000, 80])
115000
torch.Size([5000, 80])
120000
torch.Size([5000, 80])
125000
torch.Size([5000, 80])
130000
torch.Size([5000, 80])
135000
torch.Size([5000, 80])
140000
torch.Size([5000, 80])
145000
torch.Size([5000, 80])
150000
torch.Size([5000, 80])
155000
torch.Size([5000, 80])
160000
torch.Size([5000, 80])
165000
torch.Size([5000, 80])
170000
torch.Size([5000, 80])
175000
torch.Size([5000, 80]

In [None]:
# pred_batch_size = 5000
# final_pred_class_value = []
# total_pred_run = int(len(pred_ques_seq)/pred_batch_size)
# print(total_pred_run)
# for i in range(total_pred_run):
#   # print(i)
#   if(i<total_pred_run-1):
#     outputs = model(pred_ques_seq[i*pred_batch_size:(i+1)*pred_batch_size-1,:])
#   else:
#     outputs = model(pred_ques_seq[i*pred_batch_size:])

#   print(i*pred_batch_size,(i+1)*pred_batch_size-1)

#   _, pred_class_value = torch.max(outputs, dim=1)
#   final_pred_class_value.extend(pred_class_value)
#   print(len(final_pred_class_value))

# # print(len(final_pred_class_value),final_pred_class_value)


In [None]:
# print(final_pred_class_value)
print([x for x in final_pred_class_value if x == 1])
final_pred_class_value = [int(x) for x in final_pred_class_value]
final_pred_class_value[0:3]

[tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='c

[0, 0, 0]

In [None]:
# print(df_pred.head(2))
df_pred['target'] = pd.Series(final_pred_class_value)
df_pred.tail(2)

Unnamed: 0,qid,question_text,target
261219,4c6218c04aff5e60bebb,what are the best fandom shirts?,0
261220,dae65fdd97e961ee7f02,how can i approach a bank to grant me access t...,0


In [None]:
df_pred[df_pred['target']==1]

Unnamed: 0,qid,question_text,target
13,2300536249628f6e8ba9,was the award for the most polluted cities in ...,1
22,8d006c69df0152efcc34,if psychopaths lack shame then why don't they ...,1
24,55adc6b9879746a8590e,does islam allow my parents to punish me for n...,1
38,ccd2222b5a38d73c9654,"in questioning the trump - russia connection, ...",1
81,847a8a8045a2e72b2b4d,why do male porn stars initially pound the wom...,1
...,...,...,...
261147,7ca760b5506ad5d30c6c,what if muslims exceed hindus in india?,1
261158,91cc1a26b03c126f3274,why do people on quora ask silly questions or ...,1
261165,68e13801f25f321b0805,"should even the nice guys, who've never been i...",1
261184,0f20ad60f68e3ff78341,why do indians in quora regard chinese roads a...,1


In [None]:
df_pred[['qid','target']].to_csv('Group4_Pred_Submission_v77_lC.csv')
# df_pred.to_csv('Group4_Pred_Submission_v75_lC_withques.csv')

In [None]:

# submit the file to kaggle
!kaggle competitions submit toxic-questions-classification -f Group4_Pred_Submission_v7_lC.csv -m "Model dr+bn+cn+mk"

In [None]:
from google.colab import files
# files.download('Group4_Pred_Submission_v75_lC_withques.csv')
files.download('Group4_Pred_Submission_v77_lC.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **FOR BERT PROCESSING**

In [None]:
from transformers import DistilBertModel, DistilBertTokenizer


In [None]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Assuming binary classification (appropriate vs. inappropriate)


In [None]:
# Load the DistilBERT pre-trained model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)

In [None]:
# df_train_cleaned

category_A_limit = 100000  # Specify the desired limit for category A
category_A = 0  # Specify the target category appropriate = 0

# Filter the DataFrame to keep only rows with target category A
df_category_A = df_train_cleaned[df_train_cleaned['target'] == category_A]

# Shuffle the filtered DataFrame
df_category_A = df_category_A.sample(frac=1, random_state=42)

# Keep only the limited number of rows for category A
df_category_A = df_category_A[:category_A_limit]

# Filter out the rows with target category A from the original DataFrame
df_other_category = df_train_cleaned[df_train_cleaned['target'] != category_A]

# Concatenate the limited category A DataFrame with the other category DataFrame
df_final = pd.concat([df_category_A, df_other_category], ignore_index=True)

# Shuffle the final DataFrame
df_final = df_final.sample(frac=1, random_state=42)
print(df_final.shape,df_final.head(2))

In [None]:
df_train_cleaned=df_final
print(df_train_cleaned.shape)

In [None]:
print(list(df_train_cleaned['question_text'])[0:2])
ques_sentence_data = list(df_train_cleaned['question_text'])
labels = list(df_train_cleaned['target'])
print(labels[0:3])

In [None]:
# Tokenize the sentence pairs and encode labels
def bert_tokenize(ques_sentence_data,labels):
  input_ids = []
  attention_masks = []
  for ques in ques_sentence_data:
      encoded_dict = tokenizer.encode_plus(
          ques,
          add_special_tokens=True,
          max_length=60, # default 128
          padding='max_length',
          truncation=True,
          return_attention_mask=True,
          return_tensors='pt'
      )
      input_ids.append(encoded_dict['input_ids'])
      attention_masks.append(encoded_dict['attention_mask'])

  input_ids = torch.cat(input_ids, dim=0)
  attention_masks = torch.cat(attention_masks, dim=0)
  labels = torch.tensor(labels)
  return(input_ids, attention_masks,labels)

# # Create a DataLoader for the dataset
# dataset = TensorDataset(input_ids, attention_masks, labels)
# batch_size = 40
# dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


In [None]:
# PREPROCESS your dataset by encoding BERT wise, do this for prediction data as well
input_ids, attention_masks,labels = bert_tokenize(ques_sentence_data,labels)
print(input_ids[0:2],labels[0:2])

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(input_ids, labels, test_size=0.2, random_state=42)
train_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids, test_size=0.2, random_state=42)


In [None]:
# Define batch size and create DataLoader
batch_size = 4000

In [None]:
from torch.utils.data import TensorDataset, DataLoader


train_dataset = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataset = TensorDataset(val_inputs, val_masks, val_labels)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)


In [None]:
# Set up optimizer and training parameters
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 2
criterion = nn.CrossEntropyLoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)


In [None]:

# Before training loop
torch.cuda.empty_cache()

In [None]:
# Fine-tuning loop


for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    # Training loop
    for batch in train_dataloader:
        batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
        optimizer.zero_grad()

        outputs = model(input_ids=batch_inputs, attention_mask=batch_masks)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()

        # outputs = model(input_ids=batch_inputs, attention_mask=batch_masks, labels=batch_labels)
        # loss = outputs.loss

        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss
    avg_train_loss = total_loss / len(train_dataloader)

    # Validation loop
    model.eval()
    val_loss = 0
    val_accuracy = 0
    val_steps = 0

    for batch in val_dataloader:
        batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)

        with torch.no_grad():
          outputs = model(input_ids=batch_inputs, attention_mask=batch_masks)

            # outputs = model(input_ids=batch_inputs, attention_mask=batch_masks, labels=batch_labels)

        val_loss += outputs.loss.item()

        logits = outputs.logits
        _, predictions = torch.max(logits, dim=1)
        val_accuracy += torch.sum(predictions == batch_labels).item()
        val_steps += batch_labels.size(0)

    # Calculate average validation loss and accuracy
    avg_val_loss = val_loss / len(val_dataloader)
    avg_val_accuracy = val_accuracy / val_steps

    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Training loss: {avg_train_loss}')
    print(f'Validation loss: {avg_val_loss}')
    print(f'Validation accuracy: {avg_val_accuracy}')
    print()


##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


In [None]:
# YOUR CODE HERE

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
# YOUR CODE HERE

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [None]:
# YOUR CODE HERE

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








In [None]:
# YOUR CODE HERE