#### AUTHOR : VAISHNAV KRISHNA P
##### TITLE : TOXIC COMMENT CLASSIFICATION
##### DATASET LINK : https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge

#### LOADING NECESSORY DEPENDENCIES

In [56]:
# importing the necessory libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Nltk related dependencie
import re
import nltk
from nltk import download
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer,PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [58]:
# Loading the dataset
train_df = pd.read_csv('train.csv')

In [59]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


#### LABEL THE TEXT

In [60]:
# default Labelling the dataframe
train_df['label'] = 0

# labels for the text
labels = ['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']


# Update label to 1 if any of the label columns is 1
train_df['label'] = train_df[labels].max(axis=1)

In [62]:
train_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,143346
1,16225


#### APPLYTING DOWN SAMPLING TO AVOID OVERFITTING

In [61]:
# since, the dataset contain large amount of clean text we use 30K clean text & 16.2K toxic text for our model
clean_text = train_df[train_df['label'] == 0].sample(n=32000)
toxic_text = train_df[train_df['label'] == 1]

In [63]:
# first 10 clean text
for i in range(10):
  print(f"Clean Text{i}: {clean_text['comment_text'].values[i]}\n")

Clean Text0: "Jason Smith was not the captain we're looking for.  Don't have time today to create a proper article for ""our"" Jason Smith, and not sure how to make it not link to the actor.   15:29, 3 Apr 2005 (UTC)"

Clean Text1: YES YES. finally someone tells the sheeple. NK has beeen moving against the new world order that the east coast usa intelligentsia tried to push w the help of its jews alies

Clean Text2: Incest laws are purely motivated by eugenic principles, so quite the opposite is true.
The genius project was a success since it broke the taboo for other sperm banks with most now having eugenic principles to not accept donors who have genetic diseases running in the family.
Unless you want to steer the article further down the path of intellectual dishonesty I'd say this article has an anti-eugenics stance.

Clean Text3: So although YOU are 'well aware' that Vampires versus Zombies 'sucked mightily', you consider it 'vandalisation' to use the word 'shoddy' to describe it 

In [51]:
# first 10 toxic text
for i in range(10):
  print(f"Clean Text{i}: {toxic_text['comment_text'].values[i]}\n")

Clean Text0: COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK

Clean Text1: Hey... what is it..
@ | talk .
What is it... an exclusive group of some WP TALIBANS...who are good at destroying, self-appointed purist who GANG UP any one who asks them questions abt their ANTI-SOCIAL and DESTRUCTIVE (non)-contribution at WP?


Clean Text2: Bye! 

Don't look, come or think of comming back! Tosser.

Clean Text3: You are gay or antisemmitian? 

Archangel WHite Tiger

Meow! Greetingshhh!

Uh, there are two ways, why you do erased my comment about WW2, that holocaust was brutally slaying of Jews and not gays/Gypsys/Slavs/anyone...

1 - If you are anti-semitian, than shave your head bald and go to the skinhead meetings!

2 - If you doubt words of the Bible, that homosexuality is a deadly sin, make a pentagram tatoo on your forehead go to the satanistic masses with your gay pals!


Beware of the Dark Side!

Clean Text4: FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!

Clean Text5: I'm Sorry 

I'm sorry I screw

#### CONCATINATING DATAFRAMES

In [64]:
# concatinating the clean text & toxic text
training_df = pd.concat([clean_text, toxic_text], axis=0)

In [65]:
training_df['label'].value_counts() # sucessfully concatinated

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,32000
1,16225


#### TEXT CLEANING
1. Lowering Case
2. Remove numbers, punctuations and special charectors
3. stopword removal
4. spellcorrection
5. tokenization
6. Lematization

In [66]:
# function for preprocessing the text

# downloading
download('punkt_tab')
download('stopwords')

# stemmer object
stemmer = PorterStemmer()

def text_cleaning(text):
    # converting to lower case
    text = text.lower()

    # Remove the hyper links
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)

    # Removing all numbers, punctuation marks , and all the irrelavant symbols etx
    text = re.sub(r"[^a-z\s]", "", text)

    # apply the word tokenization
    word_tokens = word_tokenize(text)

    # removing the stop words
    clean_tokens = [word for word in word_tokens if word not in stopwords.words("english")]

    # applying the Lematization
    stemmed_tokem = [stemmer.stem(word) for word in clean_tokens]

    return " ".join(stemmed_tokem)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [67]:
# Cleaning the text
training_df["clean_text"] = training_df['comment_text'].apply(text_cleaning)

In [130]:
# dependent & independenet features
X = training_df['clean_text'].values
y = training_df['label'].values

In [131]:
# Tf-Idf vectorizer
vectorizer = TfidfVectorizer(max_features=10000)

X = vectorizer.fit_transform(X)

In [132]:
# training and testing split
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)

In [133]:
# Training using a simple logistic regression
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train,y_train)

In [134]:
# predictions
predictions = model.predict(X_test)

In [135]:
# Model evaluations
from sklearn.metrics import accuracy_score,confusion_matrix

accuracy_score(y_test,predictions)

0.9054979579013509

In [153]:
text = "You are a worst person"

text_transform = vectorizer.transform([text])
y_label = model.predict(text_transform)

if(y_label == 0):
  print("Model Predicted as : Clean Text")
else:
  print("Model Predicted as: Toxic Text")

Model Predicted as: Toxic Text
