### Author info : Vaishnav Krishna P
#### dataset url : https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge?select=train.csv
- Dataset is taken from the kaggle website.

In [147]:
# import necessory libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [220]:
# importing the dataset 
dataset = pd.read_csv('../Dataset/train.csv')[:100]

In [221]:
# first 5 records 
dataset.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [222]:
# shape of the dataset 
dataset.shape

(100, 8)

In [223]:
# description of the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             100 non-null    object
 1   comment_text   100 non-null    object
 2   toxic          100 non-null    int64 
 3   severe_toxic   100 non-null    int64 
 4   obscene        100 non-null    int64 
 5   threat         100 non-null    int64 
 6   insult         100 non-null    int64 
 7   identity_hate  100 non-null    int64 
dtypes: int64(6), object(2)
memory usage: 6.4+ KB


In [224]:
# Lets print some of the text's 
for i in range(5,10):
    print(f"\nText{i}: {dataset.iloc[i,1]}\n")


Text5: "

Congratulations from me as well, use the tools well.  · talk "


Text6: COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK


Text7: Your vandalism to the Matt Shirvington article has been reverted.  Please don't do it again, or you will be banned.


Text8: Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169


Text9: alignment on this subject and which are contrary to those of DuLithgow



### TEXT CLEANING
1. Lowering Case
2. Remove numbers, punctuations and special charectors
3. stopword removal
4. spellcorrection
5. tokenization
6. Lematization

In [226]:
# Lowering the case 
text = "This is a devil's House"

# lower case 
text.lower()

"this is a devil's house"

In [232]:
# Removal of numbers and special charectors 
text = "12345hello.:'? this ()&45 May."

import re
text = re.sub(r'[^a-z\s]', "",text.lower())
print(text)

hello this  may


In [234]:
# Auto spelling correction
from autocorrect import Speller 

speller = Speller(lang='en')

In [237]:
speller('misleding') # correct : misleading 

'misleading'

In [239]:
# Word tokenization
from nltk.tokenize import word_tokenize

text = "this is a final day of the collage"
tokenized_words = word_tokenize(text)

In [241]:
tokenized_words # tokenized words

['this', 'is', 'a', 'final', 'day', 'of', 'the', 'collage']

In [242]:
# removing the stop words
from nltk.corpus import stopwords

In [244]:
# stopwords 
stop_words = stopwords.words("english")

In [247]:
stop_words[:10] # sample stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [249]:
# function for preprocessing the text 
import re 
import nltk 
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer

# speller object 
speller = Speller(lang='en')

# lematization object 
lemmatizer = WordNetLemmatizer()

def text_cleaning(text):
    # converting to lower case 
    text = text.lower()

    # Removing all numbers, punctuation marks , and all the irrelavant symbols etx
    text = re.sub(r"[^a-z\s]", "", text)

    # apply the word tokenization
    word_tokens = word_tokenize(text)

    # applying for the text correction
    word_tokens = [speller(word) for word in word_tokens]

    # removing the stop words
    clean_tokens = [word for word in word_tokens if word not in stopwords.words("english")]

    # applying the Lematization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in clean_tokens]

    return " ".join(lemmatized_tokens)

In [251]:
# Calling the function 
text_cleaning("please call me @ 854345678")

'please call'

In [253]:
dataset["clean_text"] = dataset['comment_text'].apply(text_cleaning)

In [260]:
for i in range(10):
    print(f"\nOrginal text : {dataset['comment_text'].iloc[i]}\n\nClean Text: {dataset['clean_text'].iloc[i]}\n")


Orginal text : Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27

Clean Text: explanation edits made username hardcore metallica fan reverted vandalism closure gas voted new york doll fac please dont remove template talk page since im retired


Orginal text : D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)

Clean Text: dawn match background colour im seemingly stuck thanks talk january utc


Orginal text : Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.

Clean Text: hey man im really trying edit war guy constantly removing rele