# NLP project - Hate Speech

HateHaters
- Bittner Bodowin,
- Langela Regincos Jan,
- Marcon Elias 

NLP projects play a vital role in combating hate speech online, fostering a safer and more inclusive digital environment
Robust NLP models are crucial for accurately detecting and addressing instances of hate speech across various online platforms
Detection tools help mitigate the harmful effects of hate speech, promoting respectful and constructive online interactions

Used datasets:

- Paul Rottger – Hate Check (https://github.com/paul-rottger/hatecheck-data)
- Dynamically Generated Hate Speech Dataset(https://github.com/paul-rottger/hatecheck-data/blob/main/test_suite_cases.csv)



In [1]:
import pandas as pd

# Load the datasets
test_suite_df = pd.read_csv('test_suite_cases.csv')
hate1_df = pd.read_csv('hate1.csv')



In [2]:
# Displaying the top 5 rows of each dataframe including the column names
top5_test_suite = test_suite_df.head()
top5_hate1 = hate1_df.head()

display(top5_test_suite) # top5_hate1

Unnamed: 0.1,Unnamed: 0,functionality,case_id,test_case,label_gold,target_ident,direction,focus_words,focus_lemma,ref_case_id,ref_templ_id,templ_id,case_templ
0,0,derog_neg_emote_h,1,I hate women.,hateful,women,general,hate,hate,,,1,I hate [IDENTITY_P].
1,1,derog_neg_emote_h,2,I hate trans people.,hateful,trans people,general,hate,hate,,,1,I hate [IDENTITY_P].
2,2,derog_neg_emote_h,3,I hate gay people.,hateful,gay people,general,hate,hate,,,1,I hate [IDENTITY_P].
3,3,derog_neg_emote_h,4,I hate black people.,hateful,black people,general,hate,hate,,,1,I hate [IDENTITY_P].
4,4,derog_neg_emote_h,5,I hate disabled people.,hateful,disabled people,general,hate,hate,,,1,I hate [IDENTITY_P].


In [3]:
display(top5_hate1)

Unnamed: 0.1,Unnamed: 0,acl.id,X1,text,label,type,target,level,split,round.base,annotator,round,acl.id.matched
0,1,acl1,1,dalits are lowlives,hate,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
1,2,acl10,2,gay people are a burden to our society,hate,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
2,3,acl100,3,Arabs are not welcome,hate,notgiven,notgiven,original,train,1,vDe7GN0NrL,1,
3,4,acl1000,4,I'm not saying we should actually eliminate he...,hate,notgiven,notgiven,original,train,1,oemYWm1Tjg,1,
4,5,acl10000,5,bananas are for black people,hate,notgiven,notgiven,original,test,1,QiOKkCi7F8,1,


### What we want to keep for test_cases:
- text
- label

In [4]:
test_suite = test_suite_df[["label_gold", "test_case"]]

test_suite.rename(columns={'label_gold': 'label', 'test_case': 'text'}, inplace=True)

display(test_suite.head())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_suite.rename(columns={'label_gold': 'label', 'test_case': 'text'}, inplace=True)


Unnamed: 0,label,text
0,hateful,I hate women.
1,hateful,I hate trans people.
2,hateful,I hate gay people.
3,hateful,I hate black people.
4,hateful,I hate disabled people.


In [5]:
hate1 = hate1_df[["label", "text"]]

display(hate1.head())

Unnamed: 0,label,text
0,hate,dalits are lowlives
1,hate,gay people are a burden to our society
2,hate,Arabs are not welcome
3,hate,I'm not saying we should actually eliminate he...
4,hate,bananas are for black people


In [6]:
df_combine = pd.concat([test_suite, hate1], axis=0)

display(df_combine.head())

Unnamed: 0,label,text
0,hateful,I hate women.
1,hateful,I hate trans people.
2,hateful,I hate gay people.
3,hateful,I hate black people.
4,hateful,I hate disabled people.


In [7]:
df_combine.label.value_counts()

hate           22175
nothate        18969
hateful         2563
non-hateful     1165
Name: label, dtype: int64

In [8]:

# Replace the values
df_combine['label'].replace({
    'hateful': 'hate',
    'nothate': 'not_hate',
    'non-hateful': 'not_hate'
}, inplace=True)

df_combine.label.value_counts()

hate        24738
not_hate    20134
Name: label, dtype: int64

## Pre-processing standard Workflow

In [9]:
#Text Cleaning (Removing special characters & numbers and handling contractions)

In [10]:
import re

# Function to clean text: remove special characters, numbers, and expand contractions
def clean_text(text):
    # Dictionary of English Contractions
    contractions_dict = {
        "I'm": "I am",
        "you're": "you are",
        "he's": "he is",
        "she's": "she is",
        "it's": "it is",
        "we're": "we are",
        "they're": "they are",
        "don't": "do not",
        "can't": "cannot",
        "won't": "will not",
        "isn't": "is not",
        "aren't": "are not",
        "didn't": "did not",
        "haven't": "have not",
        "wouldn't": "would not",
        "shouldn't": "should not",
        "couldn't": "could not"
            # Add more contractions as needed
    }
    # Regular expression for finding contractions
    contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

    # Function for expanding contractions
    def expand_contractions(s, contractions_dict=contractions_dict):
        def replace(match):
            return contractions_dict[match.group(0)]
        return contractions_re.sub(replace, s)

    # Expand Contractions
    text = expand_contractions(text)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
    return text

# Apply the cleaning function to both DataFrames
df_combine['text'] = df_combine['text'].apply(clean_text)

# Display the head of the combined DataFrame to verify the changes
display(df_combine.head())


Unnamed: 0,label,text
0,hate,I hate women
1,hate,I hate trans people
2,hate,I hate gay people
3,hate,I hate black people
4,hate,I hate disabled people


### Normalization

In [11]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources
nltk.download('punkt')  # For tokenization
nltk.download('wordnet')  # For lemmatization

# Now, initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bodowin1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Bodowin1\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
# Download the 'omw-1.4' resource to fix an error that I encountered further down below
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Bodowin1\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [13]:
# Convert text to lowercase
df_combine['text'] = df_combine['text'].str.lower()

# Initialize the NLTK WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize text
def lemmatize_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Lemmatize each word in the text
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Rejoin lemmatized tokens into a single string
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

# Apply the lemmatization function to the 'text' column
df_combine['text'] = df_combine['text'].apply(lemmatize_text)

# Display the head of the DataFrame to verify the changes
df_combine.head()


Unnamed: 0,label,text
0,hate,i hate woman
1,hate,i hate trans people
2,hate,i hate gay people
3,hate,i hate black people
4,hate,i hate disabled people


### Tokenization

In [14]:
from nltk.tokenize import word_tokenize
import nltk

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

# Assuming df_combine is your DataFrame and it has a column named 'text' with normalized text
# Define a function to tokenize text
def tokenize_text(text):
    # Use NLTK's word_tokenize function to split the text into tokens
    tokens = word_tokenize(text)
    return tokens

# Apply the tokenization function to each row in the 'text' column
df_combine['tokens'] = df_combine['text'].apply(tokenize_text)

# Display the first few rows to check the tokenized text
print(df_combine.head())



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bodowin1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  label                    text                       tokens
0  hate            i hate woman             [i, hate, woman]
1  hate     i hate trans people     [i, hate, trans, people]
2  hate       i hate gay people       [i, hate, gay, people]
3  hate     i hate black people     [i, hate, black, people]
4  hate  i hate disabled people  [i, hate, disabled, people]


### Removing Stop Words

In [15]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bodowin1\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords

# Get the list of English stop words
stop_words = set(stopwords.words('english'))

def remove_stop_words(tokens):
    """Remove stop words from a list of tokens"""
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return filtered_tokens


In [17]:
#Applying it to tokens column
df_combine['tokens_filtered'] = df_combine['tokens'].apply(remove_stop_words)

# Display the DataFrame to verify stop words are removed
print(df_combine.head())

  label                    text                       tokens  \
0  hate            i hate woman             [i, hate, woman]   
1  hate     i hate trans people     [i, hate, trans, people]   
2  hate       i hate gay people       [i, hate, gay, people]   
3  hate     i hate black people     [i, hate, black, people]   
4  hate  i hate disabled people  [i, hate, disabled, people]   

            tokens_filtered  
0             [hate, woman]  
1     [hate, trans, people]  
2       [hate, gay, people]  
3     [hate, black, people]  
4  [hate, disabled, people]  


### Vectorization

In [18]:
# Join the tokens back into a string
df_combine['processed_text'] = df_combine['tokens_filtered'].apply(lambda tokens: ' '.join(tokens))


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limiting to 1000 features for simplicity

# Fit and transform the 'processed_text' column
tfidf_matrix = tfidf_vectorizer.fit_transform(df_combine['processed_text'])

# tfidf_matrix is a sparse matrix of shape (n_samples, n_features)
print(tfidf_matrix.shape)


(44872, 1000)


In [21]:
#handling imbalanced data?