<a href="https://colab.research.google.com/github/VellummyilumVinoth/Toxic_Comment_Classification/blob/main/Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
# Step 1: Install the required libraries
! pip install transformers 

from transformers import DistilBertTokenizer, DistilBertModel,DistilBertForSequenceClassification


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [32]:
from torch import cuda
device = torch.device('cuda' if cuda.is_available() else 'cpu')

print(f"Current device: {device}")

Current device: cuda


In [33]:
# Step 2: Load the fine-tuned DistilBERT model and its tokenizer

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",num_labels=6)
tokenizer = DistilBertTokenizer.from_pretrained('/content/drive/MyDrive/finetuned_distilbert')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

In [34]:
# Step 3: Load and preprocess your CSV dataset

import pandas as pd
predict_data = pd.read_csv("/content/drive/MyDrive/preprocessed_Reddit_Data_1.csv")


In [35]:
predict_data

Unnamed: 0.1,Unnamed: 0,Title,Author,ID,clean_title
0,0,UkrainianConflict Discussion Megathread,humanlikecorvus,y7gz80,"['ukrainianconflict', 'discussion', 'megathread']"
1,1,Zelenskyy survives over 12 assassination attem...,Far-Childhood9338,10e17wq,"['zelenskyy', 'survives', 'assassination', 'at..."
2,2,In the first round of presidential elections i...,RevealDisinfo,10digs3,"['first', 'round', 'presidential', 'elections'..."
3,3,"A further 20,000 Ukrainian recruits will be tr...",tedwja,10dv085,"['ukrainian', 'recruits', 'trained', 'uk', 'ye..."
4,4,"Zelensky: ""Tanks, APCs and artillery are exact...",zizp,10duei9,"['zelensky', 'tanks', 'apcs', 'artillery', 'ex..."
...,...,...,...,...,...
971,971,Hundreds of US military vehicles arrive in Dut...,Standard_Spaniard,109fbyj,"['hundreds', 'us', 'military', 'vehicles', 'ar..."
972,972,BREAKING: Poland will deliver a company of Leo...,rulepanic,1096adh,"['breaking', 'poland', 'deliver', 'company', '..."
973,973,Russian airline aircraft suffer massive breakd...,Breech_Loader,109envz,"['russian', 'airline', 'aircraft', 'suffer', '..."
974,974,"The Russian Federation declared that it ""has t...",RevealDisinfo,1095iye,"['russian', 'federation', 'declared', 'right',..."


In [36]:
columns_to_keep = ['ID', 'Title']
predict_data = predict_data[columns_to_keep]
predict_data.rename(columns={"Title": "comment_text"}, inplace=True)
titles = predict_data['comment_text'].tolist()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predict_data.rename(columns={"Title": "comment_text"}, inplace=True)


In [37]:
titles

['UkrainianConflict Discussion Megathread',
 'Zelenskyy survives over 12 assassination attempts since start of full-scale invasion',
 'In the first round of presidential elections in the Czech Republic, retired general Petr Pavel won. He advocates the declaration of a no-fly zone or the introduction of NATO troops into Ukraine to protect humanitarian corridors.',
 'A further 20,000 Ukrainian recruits will be trained in the UK this year.',
 'Zelensky: "Tanks, APCs and artillery are exactly what Ukraine needs to restore its territorial integrity. Thank you @RishiSunak, thank you @BWallaceMP, thank you British people for this powerful contribution to our common victory over tyranny."',
 'Should Canada send Leopard battle tanks to Ukraine? â\x80\x98Not there yet,â\x80\x99 says Trudeau - National | Globalnews.ca',
 'â\x9a¡ï¸\x8fUkraine will also receive 100 units of armored vehicles from the UK, including FV432 Mk.3 "Bulldog" armored personnel carriers, dozens of drones, 100 "advanced missi

In [38]:
# Step 5: Tokenize the titles
encoded_inputs = tokenizer(titles, padding=True, truncation=True, return_tensors='pt')

# Step 6: Pass the tokenized inputs through the model to get the predictions
with torch.no_grad():
    model.eval()
    outputs = model(**encoded_inputs)
    predicted_labels = outputs.logits



In [39]:

# Step 7: Map the predicted labels back to their respective class names
class_names = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# Step 7: Apply a threshold and convert to binary values
threshold = 0.5  # Adjust the threshold as needed

predicted_labels = (torch.sigmoid(predicted_labels) > threshold).to(torch.int).tolist()

# Create a new DataFrame with the predicted labels
label_df = pd.DataFrame(predicted_labels, columns=class_names)

# Concatenate the original DataFrame with the label DataFrame
result_df = pd.concat([predict_data, label_df], axis=1)

# Save the result DataFrame as a new CSV file
result_df.to_csv('prediction.csv', index= False)

In [40]:
result_df

Unnamed: 0,ID,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,y7gz80,UkrainianConflict Discussion Megathread,1,0,1,1,1,0
1,10e17wq,Zelenskyy survives over 12 assassination attem...,0,0,1,1,1,0
2,10digs3,In the first round of presidential elections i...,0,0,1,1,1,0
3,10dv085,"A further 20,000 Ukrainian recruits will be tr...",0,0,1,1,1,0
4,10duei9,"Zelensky: ""Tanks, APCs and artillery are exact...",0,0,1,1,1,0
...,...,...,...,...,...,...,...,...
971,109fbyj,Hundreds of US military vehicles arrive in Dut...,0,0,1,1,1,0
972,1096adh,BREAKING: Poland will deliver a company of Leo...,0,0,1,1,1,0
973,109envz,Russian airline aircraft suffer massive breakd...,0,0,1,1,1,0
974,1095iye,"The Russian Federation declared that it ""has t...",0,0,1,1,1,0
