# Data Cleaning: Exploratory Analysis

<hr>

## To Do

-  Remove punctuations.
-  Remove stop words.
-  Normalize (lowercase) data.
-  Remove non letter characters (numbers, symbols, emojis, etc).
-  Remove URLs.
-  Remove hashtags.
-  Remove mentions and usernames.
-  Remove HTML tags.
-  Implement spell correction.

In [1]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,torch,lightning --conda

# OR
from watermark import watermark


print(watermark(packages="polars,scikit-learn,torch,lightning", python=True))

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.23.0

numpy    : 1.26.4
pandas   : 2.2.2
polars   : 0.20.21
torch    : 2.2.2
lightning: 2.2.2

conda environment: n/a

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.23.0

polars      : 0.20.21
scikit-learn: 1.4.2
torch       : 2.2.2
lightning   : 2.2.2



In [2]:
# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl
from rich.console import Console
from rich.theme import Theme

custom_theme = Theme(
    {
        "info": "#76FF7B",
        "warning": "#FBDDFE",
        "error": "#FF0000",
    }
)
console = Console(theme=custom_theme)

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")


# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

In [3]:
from toxic_classifier.utilities.dataloaders import (
    CyberBullyDataLoader,
    GHCDataLoader,
    ToxicCommentsDataLoader,
)

In [4]:
# Train data
path: str = "../data/ghc_data/**/*.tsv"


ghc_dataloader: GHCDataLoader = GHCDataLoader(path=path, stratify=True)

X_train, X_test, y_train, y_test = ghc_dataloader.prepare_data()
X_train.head()

text,dataset
str,str
"""You think I'm …","""gabe_hate_corp…"
"""That's Joseph …","""gabe_hate_corp…"
"""U didn’t know …","""gabe_hate_corp…"
"""#RareGabby""","""gabe_hate_corp…"
"""DOG DOXXED Thi…","""gabe_hate_corp…"


In [53]:
X_train.to_pandas()

Unnamed: 0,text,dataset
0,"You think I'm suppose to blame them? I blame my government for allowing it to happen. If I was in their shoes and saw big opportunity, I'd be swimming the Rio Grande too.",gabe_hate_corpus
1,That's Joseph Goebbels.,gabe_hate_corpus
2,"U didn’t know area (Douma) u just said Assad made no headway, the rebels (US) kept civilians captive there 3 years. US moderate rebels defeated n Douma day of gas attack on social media (fact). So 1. Assad gassed Syrians to celebrate 2. Assad gas Douma to hit the rebels, accidentally hit Syrians 3. Rebels gas Syrians to frame Assad",gabe_hate_corpus
3,#RareGabby,gabe_hate_corpus
4,DOG DOXXED This will be the Fate of you and all the world's Jew and Israel haters SHORTLY die Dog die https://www.youtube.com/watch?v=YcR9k8o4I0w,gabe_hate_corpus
...,...,...
22031,Were you expecting striptease going on in the corner?,gabe_hate_corpus
22032,Great tips. Never of lasagna gardening before.,gabe_hate_corpus
22033,This is my first #painting that I fully completed and am actually really proud of! It's based on a Skyrim screenshot (and Skyrim is my favorite game). I'm seriously so glad with how this one turned out!!!,gabe_hate_corpus
22034,sanders is a fucking menshevik,gabe_hate_corpus


In [95]:
X_train.with_columns(
    cleaned_text=pl.col("text").str.replace_all(pattern=r"https?://\S{1,150}", value="")
).to_pandas()

Unnamed: 0,text,dataset,cleaned_text
0,"You think I'm suppose to blame them? I blame my government for allowing it to happen. If I was in their shoes and saw big opportunity, I'd be swimming the Rio Grande too.",gabe_hate_corpus,"You think I'm suppose to blame them? I blame my government for allowing it to happen. If I was in their shoes and saw big opportunity, I'd be swimming the Rio Grande too."
1,That's Joseph Goebbels.,gabe_hate_corpus,That's Joseph Goebbels.
2,"U didn’t know area (Douma) u just said Assad made no headway, the rebels (US) kept civilians captive there 3 years. US moderate rebels defeated n Douma day of gas attack on social media (fact). So 1. Assad gassed Syrians to celebrate 2. Assad gas Douma to hit the rebels, accidentally hit Syrians 3. Rebels gas Syrians to frame Assad",gabe_hate_corpus,"U didn’t know area (Douma) u just said Assad made no headway, the rebels (US) kept civilians captive there 3 years. US moderate rebels defeated n Douma day of gas attack on social media (fact). So 1. Assad gassed Syrians to celebrate 2. Assad gas Douma to hit the rebels, accidentally hit Syrians 3. Rebels gas Syrians to frame Assad"
3,#RareGabby,gabe_hate_corpus,#RareGabby
4,DOG DOXXED This will be the Fate of you and all the world's Jew and Israel haters SHORTLY die Dog die https://www.youtube.com/watch?v=YcR9k8o4I0w,gabe_hate_corpus,DOG DOXXED This will be the Fate of you and all the world's Jew and Israel haters SHORTLY die Dog die
...,...,...,...
22031,Were you expecting striptease going on in the corner?,gabe_hate_corpus,Were you expecting striptease going on in the corner?
22032,Great tips. Never of lasagna gardening before.,gabe_hate_corpus,Great tips. Never of lasagna gardening before.
22033,This is my first #painting that I fully completed and am actually really proud of! It's based on a Skyrim screenshot (and Skyrim is my favorite game). I'm seriously so glad with how this one turned out!!!,gabe_hate_corpus,This is my first #painting that I fully completed and am actually really proud of! It's based on a Skyrim screenshot (and Skyrim is my favorite game). I'm seriously so glad with how this one turned out!!!
22034,sanders is a fucking menshevik,gabe_hate_corpus,sanders is a fucking menshevik


In [99]:
import pkg_resources
from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
input_term = (
    "whereis th elove hehad dated forImuch of thepast who "
    "couqdn'tread in sixtgrade and ins pired him"
)
# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=2)
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
    print(suggestion)

where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 9, 0


In [137]:
import pkg_resources
from symspellpy import SymSpell


class SpellChecker:
    def __init__(
        self, max_dictionary_edit_distance: int = 2, prefix_length: int = 7
    ) -> None:
        """SymSpellChecker constructor. It uses SymSpellPy to spell check the data."""
        self.max_dictionary_edit_distance = max_dictionary_edit_distance
        self.prefix_length = prefix_length
        self.dictionary_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_dictionary_en_82_765.txt"
        )
        self.bigram_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
        )

    def _load_sym_spell_model(self) -> SymSpell:
        """This loads the sym spell model."""
        sym_spell: SymSpell = SymSpell(
            max_dictionary_edit_distance=self.max_dictionary_edit_distance,
            prefix_length=self.prefix_length,
        )
        sym_spell.load_dictionary(self.dictionary_path, term_index=0, count_index=1)
        sym_spell.load_bigram_dictionary(self.bigram_path, term_index=0, count_index=2)
        return sym_spell

    def __call__(self, *args, **kwargs) -> Any:
        sym_spell = self._load_sym_spell_model()
        return sym_spell.lookup_compound(max_edit_distance=2, *args, **kwargs)[0].term

In [144]:
my_spell_checker = SpellChecker()
suggestions: list[Any] = my_spell_checker(
    phrase="I like to prai and communcate with my cleator"
)

suggestions

'i like to pray and communicate with my creator'

In [124]:
[str(suggestion) for suggestion in suggestions][0].split(",")[0]

'i like to pray and communicate with my creator'

In [149]:
from toxic_classifier.utilities.datacleaners import TextDataCleaner


t = TextDataCleaner(X_train)
t.prepare_data()

text,dataset
str,str
"""think im suppo…","""gabe_hate_corp…"
"""thats joseph g…","""gabe_hate_corp…"
"""u didnt know a…","""gabe_hate_corp…"
"""raregabby""","""gabe_hate_corp…"
"""dog doxxed fat…","""gabe_hate_corp…"
…,…
"""expecting stri…","""gabe_hate_corp…"
"""great tips las…","""gabe_hate_corp…"
"""painting fully…","""gabe_hate_corp…"
"""sanders fuckin…","""gabe_hate_corp…"


In [88]:
X_train.with_columns(
    text=pl.col("text").str.replace_all(pattern=r"[0-9]", value="")
).to_pandas()

Unnamed: 0,text,dataset
0,"You think I'm suppose to blame them? I blame my government for allowing it to happen. If I was in their shoes and saw big opportunity, I'd be swimming the Rio Grande too.",gabe_hate_corpus
1,That's Joseph Goebbels.,gabe_hate_corpus
2,"U didn’t know area (Douma) u just said Assad made no headway, the rebels (US) kept civilians captive there years. US moderate rebels defeated n Douma day of gas attack on social media (fact). So . Assad gassed Syrians to celebrate . Assad gas Douma to hit the rebels, accidentally hit Syrians . Rebels gas Syrians to frame Assad",gabe_hate_corpus
3,#RareGabby,gabe_hate_corpus
4,DOG DOXXED This will be the Fate of you and all the world's Jew and Israel haters SHORTLY die Dog die https://www.youtube.com/watch?v=YcRkoIw,gabe_hate_corpus
...,...,...
22031,Were you expecting striptease going on in the corner?,gabe_hate_corpus
22032,Great tips. Never of lasagna gardening before.,gabe_hate_corpus
22033,This is my first #painting that I fully completed and am actually really proud of! It's based on a Skyrim screenshot (and Skyrim is my favorite game). I'm seriously so glad with how this one turned out!!!,gabe_hate_corpus
22034,sanders is a fucking menshevik,gabe_hate_corpus


In [None]:
# Train data
path: str = "../data/toxic_comment_data/train.csv"
labels_path: str = "../data/toxic_comment_data/test_labels.csv"
other_path: str = "../data/toxic_comment_data/test.csv"

path: str = "../data/cyberbully_data/cyberbullying_tweets.csv"

ghc_dataloader: CyberBullyDataLoader = CyberBullyDataLoader(
    path=path,
    # labels_path=labels_path,
    # other_path=other_path,
    separator=",",
    stratify=True,
)

ghc_dataloader.prepare_data()