<a href="https://colab.research.google.com/github/abhialag/iiscdlfa_kaggle_grp4/blob/main/Abhay_Group_4_M3_Mini_Hackathon_Irrelevant_Inappropriate_Questions_Classification_exp_v1_1306_DistilBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.


## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json).

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"abhaykumardnnai","key":"1716a26a8649843ef484f1b554327b9f"}'}

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

kaggle.json  [0m[01;34msample_data[0m/


In [None]:
!pip install urllib3==1.25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting urllib3==1.25
  Downloading urllib3-1.25-py2.py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.9/149.9 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Broken release[0m[33m
[0m[31mERROR: Operation cancelled by user[0m[31m
[0m

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.8/118.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Building wheel for slugify (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.6.1 requires urllib3>=1.25, but you have urllib3 1.24.3 which is incompatible.[0m[31m
[0m

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

Downloading toxic-questions-classification.zip to /content
 81% 49.0M/60.6M [00:01<00:00, 43.4MB/s]
100% 60.6M/60.6M [00:01<00:00, 52.4MB/s]


In [None]:
!unzip /content/toxic-questions-classification.zip

Archive:  /content/toxic-questions-classification.zip
  inflating: sample_submission.csv   
  inflating: test_dataset.csv        
  inflating: train_dataset.csv       


## YOUR CODING STARTS FROM HERE

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m90.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

## Import required packages

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import random
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Model
from keras.layers import Input, Embedding, concatenate, Dense, Bidirectional, Dropout, Flatten, Conv1D, MaxPooling1D
from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.tensorboard import SummaryWriter
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
import pickle
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

In [None]:
!pip install nlpaug

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [None]:
from transformers import DistilBertModel, DistilBertTokenizer, DistilBertForSequenceClassification
from imblearn.over_sampling import SMOTE
import nlpaug
import nlpaug.augmenter.word as naw

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [None]:
# Set random seeds for reproducibility
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


In [None]:
# YOUR CODE HERE

# Loading the train_dataset
df_train = pd.read_csv('train_dataset.csv')
print('Length of train data',len(df_train),'\n')
print(df_train.head(),'\n') # looking at the data and field structure
print(df_train['target'].value_counts(),'\n')  #looking at the spread of target variable
print(df_train.isnull().values.any(),'\n') # zero null values


Length of train data 1044897 

                    qid                                      question_text  \
0  2549b81c4adff1849a7f                          Is CSE at bit Meara good?   
1  0558ed93a4630e68f7ac  Is it better to exercise before or after the b...   
2  5d72d5233059e44f8a8e  Can character naming in writing infringe on tr...   
3  3968636ac28841d0c901  Why does everyone making YouTube videos in Jap...   
4  201d2b9a777bbf25443f  Is there any relation between horse power and ...   

   target  
0       0  
1       0  
2       0  
3       0  
4       0   

0    980293
1     64604
Name: target, dtype: int64 

False 



In [None]:
print(df_train[df_train['target']==1].head()) # to see if 1 means negative or positive
# target 1 means negative, irrelevant and inappropriate question

                     qid                                      question_text  \
16  8ea797496fc68c9d8d98             Why are black people always tormented?   
28  72e1085eab12b6aa55e2                              How do you spell aye?   
29  8137a860b078efcadd4c  Why do Conservatives want all news to be conse...   
55  4233e8ed3bbbf5b8a242  Are we all for calling the people born in the ...   
67  4c4e07c6a1723d0fe649  Why did the frustrated Catholics of South Indi...   

    target  
16       1  
28       1  
29       1  
55       1  
67       1  


In [None]:
# removal of stop words
def stopwordsremoval(sentence):
  sentence = sentence.lower()
  words = sentence.split()
  filtered_words = [word for word in words if word.lower() not in stopwords]
  filtered_sentence = ' '.join(filtered_words)
  return filtered_sentence

In [None]:
def cleaning_dataset(df):

    # Pre-Processing
    # converat all sentences to string format
    df['question_text'] = df['question_text'].astype(str)

    # convert all sentences to lower case
    df['question_text'] = df['question_text'].apply(lambda sentence_A: sentence_A.lower())
    # df['question_text'] = df['question_text'].apply(lambda sentence: stopwordsremoval(sentence))
    return df

In [None]:
# cleaning the questions column by lowering
df_train_cleaned = cleaning_dataset(df_train)
df_train_cleaned.drop(['qid'],axis=1,inplace=True)
df_train_cleaned.head(2)


Unnamed: 0,question_text,target
0,is cse at bit meara good?,0
1,is it better to exercise before or after the b...,0


In [None]:
from transformers import DistilBertModel, DistilBertTokenizer, DistilBertForSequenceClassification
# from imblearn.over_sampling import SMOTE
import nlpaug
import nlpaug.augmenter.word as naw
from nlpaug.augmenter.word import SynonymAug

In [None]:
# sentences_pos = df_train_cleaned[df_train_cleaned['target']==0]['question_text'].tolist()
sentences_neg = df_train_cleaned[df_train_cleaned['target']==1]['question_text'].tolist()
labels = df_train_cleaned['target'].tolist()
# sentences_pos[0:5],sentences_neg[0:5],labels[0:3]

In [None]:
# Define the NLP augmentation function
def augment_sentence(sentence, num_aug=10):
    # aug = SynonymAug(aug_max=4)
    aug = SynonymAug(aug_max=4)
    augmented_texts = aug.augment(sentence, n=num_aug)
    return augmented_texts

In [None]:
sentences_neg_augmented = [x for sent in sentences_neg for x in augment_sentence(sent)]
print(sentences_neg_augmented[8:12])
label_neg_augmented = [1 for x in sentences_neg_augmented]
# print(label_neg_augmented[0:15],len(label_neg_augmented))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


['why are black multitude constantly crucify?', 'why represent black masses always torment?', 'how do you spell aye?', 'how do you import aye?']


In [None]:
tot_sentences = df_train_cleaned['question_text'].tolist()
tot_sentences.extend(sentences_neg_augmented)
tot_labels = df_train_cleaned['target'].tolist()
tot_labels.extend(label_neg_augmented)
len(tot_sentences),len(tot_labels)

(1690937, 1690937)

In [None]:
# dict = {'question_text':tot_sentences,'target':tot_labels}
# # df_aug = pd.DataFrame(dict)
# df_train_cleaned = pd.DataFrame(dict) #temporary until we are using nlp aug instead of smote
# # len(df_aug),df_aug.tail(4)

FOR BERT Processing Approach 1

In [None]:
# from transformers import BertModel, BertTokenizer

# # Load pre-trained BERT model and tokenizer
# pretrained_model_name = 'bert-base-uncased'
# tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
# bert_model = BertModel.from_pretrained(pretrained_model_name)
# bert_model
# # Freeze all BERT parameters except the last layer
# for param in bert_model.parameters():
#     param.requires_grad = False

# # Define your custom classifier
# class Classifier(nn.Module):
#     def __init__(self, input_size, hidden_size, num_classes):
#         super(Classifier, self).__init__()
#         self.fc = nn.Linear(input_size, hidden_size)
#         self.relu = nn.ReLU()
#         self.dropout = nn.Dropout(0.1)
#         self.output_layer = nn.Linear(hidden_size, num_classes)

#     def forward(self, x):
#         x = self.fc(x)
#         x = self.relu(x)
#         x = self.dropout(x)
#         x = self.output_layer(x)
#         return x


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Load DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Define the DistilBERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.config.output_hidden_states = False

for param in model.parameters():
  param.requires_grad = False

model.classifier = nn.Linear(model.config.hidden_size,2)

# # Define the loss function and optimizer
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.AdamW(model.parameters(),lr=learning_rate)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.

In [None]:
# Example usage:
input_size = 768  # Size of BERT's output
hidden_size = 256
num_classes = 2  # Binary classification

In [None]:
# import random
data = list(zip(tot_sentences,labels))
random.shuffle(data)

In [None]:
split_ratio = 0.8
split_index = int(len(data)*split_ratio)
train_data = data[:split_index]
valid_data = data[split_index:]
train_inputs,train_labels = zip(*train_data)
valid_inputs,valid_labels = zip(*valid_data)


In [None]:
len(train_data)

835917

In [None]:
# Tokenize and encode train inputs
train_input_ids = []
train_attention_masks = []
for input_text in train_inputs:
    encoded_inputs = tokenizer.encode_plus(
        input_text,
        add_special_tokens=True,
        max_length=128,  # Maximum length of input sequences
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    train_input_ids.append(encoded_inputs['input_ids'])
    train_attention_masks.append(encoded_inputs['attention_mask'])

train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)
train_labels = torch.tensor(train_labels)

# Tokenize and encode validation inputs
valid_input_ids = []
valid_attention_masks = []
for input_text in valid_inputs:
    encoded_inputs = tokenizer.encode_plus(
        input_text,
        add_special_tokens=True,
        max_length=128,  # Maximum length of input sequences
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    valid_input_ids.append(encoded_inputs['input_ids'])
    valid_attention_masks.append(encoded_inputs['attention_mask'])

valid_input_ids = torch.cat(valid_input_ids, dim=0)
valid_attention_masks = torch.cat(valid_attention_masks, dim=0)
valid_labels = torch.tensor(valid_labels)


In [None]:
 torch.cuda.empty_cache()

In [None]:
# Define training parameters
batch_size = 200
num_epochs = 2
learning_rate = 0.001
# Define loss function and optimizer
# criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
# Create data loaders for batching
train_dataset = torch.utils.data.TensorDataset(train_input_ids, train_attention_masks, train_labels)
valid_dataset = torch.utils.data.TensorDataset(valid_input_ids, valid_attention_masks, valid_labels)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)

In [None]:
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(),lr=learning_rate)

In [None]:
# model = Classifier(input_size, hidden_size, num_classes)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# import tqdm
# num_epochs = 2

In [None]:
for epoch in range(num_epochs):
  model.train()
  running_loss = 0.0
  correct_predictions = 0

  count_exit = 0
  for inputs,attention,labels in train_dataloader:
    count_exit +=1
    if(count_exit>0):
      inputs = inputs.to(device)
      attention = attention.to(device)
      labels = labels.to(device)

      optimizer.zero_grad()

      outputs = model(inputs,attention)[0]
      _, predicted_labels = torch.max(outputs, 1)
      correct_predictions += torch.sum(predicted_labels == labels).item()

      loss = criterion(outputs, labels)
      print(f"Training batch Loss: {loss:.4f}")
      loss.backward()
      optimizer.step()

      running_loss += loss.item() * inputs.size(0)
    else:
      break

  epoch_loss = running_loss / len(train_dataloader.dataset)
  epoch_accuracy = correct_predictions / len(train_dataloader.dataset)

  print(f"Epoch {epoch+1}/{num_epochs} - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}")

  # Evaluation
  model.eval()
  valid_loss = 0.0
  valid_correct_predictions = 0

  count_exit = 0
  with torch.no_grad():
    for inputs,attention,labels in valid_dataloader:
      count_exit +=1
      if(count_exit<2):
        inputs = inputs.to(device)
        attention = attention.to(device)
        labels = labels.to(device)

        outputs = model(inputs,attention)[0]
        _, predicted_labels = torch.max(outputs, 1)
        valid_correct_predictions += torch.sum(predicted_labels == labels).item()

        loss = criterion(outputs, labels)
        valid_loss += loss.item() * inputs.size(0)

        print(f"Validation batch Loss: {loss:.4f}")
      else:
        break

  valid_loss /= len(valid_dataloader.dataset)
  valid_accuracy = valid_correct_predictions / len(valid_dataloader.dataset)

  print(f"Valid Loss: {valid_loss:.4f} - Valid Accuracy: {valid_accuracy:.4f}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Training batch Loss: 0.1118
Training batch Loss: 0.1743
Training batch Loss: 0.1514
Training batch Loss: 0.1310
Training batch Loss: 0.1019
Training batch Loss: 0.1925
Training batch Loss: 0.1896
Training batch Loss: 0.1819
Training batch Loss: 0.1429
Training batch Loss: 0.1673
Training batch Loss: 0.1574
Training batch Loss: 0.1549
Training batch Loss: 0.1503
Training batch Loss: 0.1724
Training batch Loss: 0.2207
Training batch Loss: 0.1444
Training batch Loss: 0.1569
Training batch Loss: 0.1275
Training batch Loss: 0.1337
Training batch Loss: 0.1370
Training batch Loss: 0.1238
Training batch Loss: 0.0968
Training batch Loss: 0.1408
Training batch Loss: 0.2106
Training batch Loss: 0.2236
Training batch Loss: 0.1744
Training batch Loss: 0.1018
Training batch Loss: 0.1301
Training batch Loss: 0.1549
Training batch Loss: 0.1410
Training batch Loss: 0.0896
Training batch Loss: 0.1493
Training batch Loss: 0.1732
Training ba

In [None]:
df_pred = pd.read_csv('test_dataset.csv')
df_pred.head(3)
print(df_pred.head(3),'\n',df_pred.shape)

                    qid                                      question_text
0  d5cacbea9be29bd47a78                               Is Minance any good?
1  5650c4a236fe3b555c31            Do computers have reserved key strokes?
2  b778db4f09f9326195ea  When was the last time that the US had such a ... 
 (261221, 2)


In [None]:
## Run cleaning_dataset function

In [None]:
# cleaning the questions column by lowering
df_pred_cleaned = cleaning_dataset(df_pred)
# df_pred_cleaned.drop(['qid'],axis=1,inplace=True)
df_pred_cleaned.head(2)


Unnamed: 0,qid,question_text
0,d5cacbea9be29bd47a78,is minance any good?
1,5650c4a236fe3b555c31,do computers have reserved key strokes?


In [None]:
pred_sentences = df_pred_cleaned['question_text'].tolist()
# Tokenize and encode pred inputs
pred_input_ids = []
pred_attention_masks = []
for input_text in pred_sentences:
    encoded_inputs = tokenizer.encode_plus(
        input_text,
        add_special_tokens=True,
        max_length=128,  # Maximum length of input sequences
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    pred_input_ids.append(encoded_inputs['input_ids'])
    pred_attention_masks.append(encoded_inputs['attention_mask'])

pred_input_ids = torch.cat(pred_input_ids, dim=0)
pred_attention_masks = torch.cat(pred_attention_masks, dim=0)

In [None]:
pred_dataset = torch.utils.data.TensorDataset(pred_input_ids, pred_attention_masks)
pred_data_loader = torch.utils.data.DataLoader(pred_dataset, batch_size=batch_size, shuffle=True)

In [None]:
model.eval()
final_pred_class_value = []
with torch.no_grad():
  for batch_questions,batch_attn in pred_data_loader:
    print(batch_questions.shape)

    batch_questions = batch_questions.to(device)
    batch_attn = batch_attn.to(device)
    outputs = model(batch_questions,batch_attn)[0]
    _, pred_class_value = torch.max(outputs, dim=1)
    final_pred_class_value.extend(pred_class_value)
    print(len(final_pred_class_value))


torch.Size([200, 128])
200
torch.Size([200, 128])
400
torch.Size([200, 128])
600
torch.Size([200, 128])
800
torch.Size([200, 128])
1000
torch.Size([200, 128])
1200
torch.Size([200, 128])
1400
torch.Size([200, 128])
1600
torch.Size([200, 128])
1800
torch.Size([200, 128])
2000
torch.Size([200, 128])
2200
torch.Size([200, 128])
2400
torch.Size([200, 128])
2600
torch.Size([200, 128])
2800
torch.Size([200, 128])
3000
torch.Size([200, 128])
3200
torch.Size([200, 128])
3400
torch.Size([200, 128])
3600
torch.Size([200, 128])
3800
torch.Size([200, 128])
4000
torch.Size([200, 128])
4200
torch.Size([200, 128])
4400
torch.Size([200, 128])
4600
torch.Size([200, 128])
4800
torch.Size([200, 128])
5000
torch.Size([200, 128])
5200
torch.Size([200, 128])
5400
torch.Size([200, 128])
5600
torch.Size([200, 128])
5800
torch.Size([200, 128])
6000
torch.Size([200, 128])
6200
torch.Size([200, 128])
6400
torch.Size([200, 128])
6600
torch.Size([200, 128])
6800
torch.Size([200, 128])
7000
torch.Size([200, 128])
7

In [None]:
# print(final_pred_class_value)
print([x for x in final_pred_class_value if x == 1])
final_pred_class_value = [int(x) for x in final_pred_class_value]
final_pred_class_value[0:3]

[tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='cuda:0'), tensor(1, device='c

[0, 0, 0]

In [None]:
# print(df_pred.head(2))
df_pred['target'] = pd.Series(final_pred_class_value)
df_pred.tail(2)

Unnamed: 0,qid,question_text,target
261219,4c6218c04aff5e60bebb,what are the best fandom shirts?,0
261220,dae65fdd97e961ee7f02,how can i approach a bank to grant me access t...,0


In [None]:
df_pred[df_pred['target']==1]

Unnamed: 0,qid,question_text,target
58,6261ab7856529e366f49,which is best mobile under 30000?,1
73,683b2ffa45dd5f500581,does oppo f1s support vr?,1
83,eaa62c7dd8495b59db88,how many degrees celsius is the visible part o...,1
104,b5898cf9fc477563ff43,are vegetables bad for dogs?,1
110,efc4989139c790ac0b23,what std can be transmitted through saliva?,1
...,...,...,...
260829,e6b22d947cf1188e70b7,what is the best thing that you have ever done...,1
260870,96d3da7a93052bc9724f,why did marvel make the living tribunal much w...,1
260969,8beed858384dfc29294c,how do documentary film makers earn money?,1
261033,4620f66afaeffbcbeb07,what kinds of lawyer get the most income?,1


In [None]:
df_pred[['qid','target']].to_csv('Group4_Pred_Submission_distilbert_v80_lC.csv')
# df_pred.to_csv('Group4_Pred_Submission_distilbert_v80_lC.csv')

In [None]:
from google.colab import files
# files.download('Group4_Pred_Submission_distilbert_v80_lC.csv')
files.download('Group4_Pred_Submission_distilbert_v80_lC.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


In [None]:
# YOUR CODE HERE

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
# YOUR CODE HERE

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [None]:
# YOUR CODE HERE

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








In [None]:
# YOUR CODE HERE