# Project 2 NLP: Hatespeech Classifier

## Authors:

Adrian Obermühlner & Freja Rasmussen

## Resarch Question:

How do different preprocessing methods (nothing, stop word removal, lemming, stemming,…) affect the result of a hate speech classifier?

## Imports

In [64]:
# Imports
import pandas as pd
import numpy as np
import torch
import regex as re
import matplotlib.pyplot as plt

# Preprocessing imports
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Tokenizing
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer

import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator


In [2]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: NVIDIA GeForce GTX 1650 Ti


## Data Import


In [3]:
RANDOM_SEED = 42
BINARY_LABEL = "is_hate"
CATEGORIES = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

np.random.seed(RANDOM_SEED)  # set random seed for reproducibility
# Make the labels into hate and no hate as 1 and 0

def binarize_labels(df):
    return (df[CATEGORIES].sum(axis=1) > 0).astype(int)

data_train = pd.read_csv("./data/train/train.csv", index_col=0)
data_train[BINARY_LABEL] = binarize_labels(data_train)

data_test = pd.read_csv("./data/test/test.csv", index_col=0).join(
    pd.read_csv("./data/test_labels/test_labels.csv", index_col=0)
)
data_test.drop(data_test[data_test["toxic"] == -1].index, inplace=True)
data_test[BINARY_LABEL] = binarize_labels(data_test)

In [4]:
data_train['comment_text'].head(10)

id
0000997932d777bf    Explanation\nWhy the edits made under my usern...
000103f0d9cfb60f    D'aww! He matches this background colour I'm s...
000113f07ec002fd    Hey man, I'm really not trying to edit war. It...
0001b41b1c6bb37e    "\nMore\nI can't make any real suggestions on ...
0001d958c54c6e35    You, sir, are my hero. Any chance you remember...
00025465d4725e87    "\n\nCongratulations from me as well, use the ...
0002bcb3da6cb337         COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
00031b1e95af7921    Your vandalism to the Matt Shirvington article...
00037261f536c51d    Sorry if the word 'nonsense' was offensive to ...
00040093b2687caa    alignment on this subject and which are contra...
Name: comment_text, dtype: object

In [5]:
data_test.head(10)

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,is_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0,0
000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0,0
0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0,0
0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0,0
00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0,0
000663aff0fffc80,this other one from 1897,0,0,0,0,0,0,0
000689dd34e20979,== Reason for banning throwing == \n\n This ar...,0,0,0,0,0,0,0
000844b52dee5f3f,|blocked]] from editing Wikipedia. |,0,0,0,0,0,0,0
00091c35fa9d0465,"== Arabs are committing genocide in Iraq, but ...",1,0,0,0,0,0,1
000968ce11f5ee34,Please stop. If you continue to vandalize Wiki...,0,0,0,0,0,0,0


In [6]:
# get the distribution of the labels to see if roughly similar for both

is_hate_count_train = data_train['is_hate'].value_counts()
ratio_train = is_hate_count_train/ len(data_train)

is_hate_count_test = data_test['is_hate'].value_counts()
ratio_test = is_hate_count_test/ len(data_test)

print('Ratio of no/is hate for train set: ', ratio_train)
print('Ratio of no/is hate for test set: ', ratio_test)

Ratio of no/is hate for train set:  0    0.898321
1    0.101679
Name: is_hate, dtype: float64
Ratio of no/is hate for test set:  0    0.90242
1    0.09758
Name: is_hate, dtype: float64


## Representation

## Data Preprocessing

**Note**: We would need to make a loop for the different combinations of 
preprocessing (none, only stemming, only lemming, only stop word removal and every combination of this)
Either as coloumns that can be used to iterate over for the model training and validation, or make the preprocessing
and then go further and repeat from beginning.


In [73]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def handle_negations(tokens):
    negation_words = {'not', "n't", 'no', 'never', 'none'}
    processed_tokens = []
    skip_next = False

    for i, word in enumerate(tokens):
        if skip_next:
            skip_next = False
            continue

        if word in negation_words and i + 1 < len(tokens):
            processed_tokens.append(word + '_' + tokens[i + 1])
            skip_next = True
        else:
            processed_tokens.append(word)

    return processed_tokens

import re
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


# Assuming stop_words is defined somewhere

def get_wordnet_pos(treebank_tag):
    """Converts treebank tags to wordnet tags."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def preprocess_text(text, use_lower=True, remove_stopwords=False, use_stemming=False, combine_negations=False, keep_semantic_punctuation=True, rare_words=None):
    if use_lower:
        text = text.lower()
    
    tokens = word_tokenize(text)
    
    if combine_negations:
        tokens = handle_negations(tokens)

    if remove_stopwords:
        tokens = [word for word in tokens if word.lower() not in stop_words]
    
    if not keep_semantic_punctuation:
        tokens = [re.sub(r'[^\w\s]', '', word) for word in tokens]
    
    if use_stemming:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]
    else:
        lemmatizer = WordNetLemmatizer()
        tagged_tokens = pos_tag(tokens)
        tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_tokens]
    
    filtered_tokens = [word for word in tokens if word.isalpha()]

    return ' '.join(filtered_tokens)


# Apply the preprocessing function to the training and test datasets
# We don't pass the rare_words parameter, so rare word removal is not performed
#data_train['comment_text_clean_2'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_stemming=False))
#data_test['comment_text_clean_2'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_stemming=False))



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [11]:
# Lemmatization (default behavior, without stemming)
print(data_train['comment_text'][:10000].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False)))

id
0000997932d777bf    explanation edits make username hardcore metal...
000103f0d9cfb60f    daww match background colour m seemingly stuck...
000113f07ec002fd    hey man m really try edit war s guy constantly...
0001b41b1c6bb37e    ca nt make real suggestion improvement wonder ...
0001d958c54c6e35                      sir hero chance remember page s
                                          ...                        
1a790ff1007a10e3    number may either list separately begin stuck ...
1a7a4868968e2b9e                                 two love disagree nt
1a7c3bec9a71415d    change lance thomas lance thomas link american...
1a7c9c14b0cf0fe0    state court put article deal state law state c...
1a7d550fec6e9777                               buddy thing nottingham
Name: comment_text, Length: 10000, dtype: object


In [12]:
data_train.shape

(159571, 8)

In [9]:
# Lemmatization (default behavior, without stemming)
data_train['text_lemmatization'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))

In [10]:
# Keeping semantic punctuation (keeping ! and ?)
data_train['text_punctuation'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=True, use_stemming=False, combine_negations=False))


In [13]:
data_train.to_csv('train_all_coloumns.csv')

In [186]:

# Removing all punctuation
data_train['text_no_punctuation'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))
data_train.to_csv('train_all_coloumns.csv')

KeyboardInterrupt: 

In [12]:
# Keep negations
data_train['text_negations'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=True))
data_train.to_csv('train_all_coloumns.csv')

In [14]:
# Only lowercase
data_train['data_text_1'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=False, keep_semantic_punctuation=True, use_stemming=False))
data_train.to_csv('train_all_coloumns.csv')

In [15]:

# Lowercase and stopwords removal
data_train['data_text_2'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=True, use_stemming=False))
data_train.to_csv('train_all_coloumns.csv')

In [16]:

# Add punctuation handling
data_train['data_text_3'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False))
data_train.to_csv('train_all_coloumns.csv')

In [17]:

# Incorporate lemmatization
data_train['data_text_4'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=True))
data_train.to_csv('train_all_coloumns.csv')

In [None]:

# Incorporate stemming
data_train['data_text_5'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=True, combine_negations=True))
data_train.to_csv('train_all_coloumns.csv')

In [10]:
testing = data_train['comment_text'][:100000].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=True, combine_negations=True))
print(testing)

id
0000997932d777bf    explan edit made usernam hardcor metallica fan...
000103f0d9cfb60f    daww match background colour m seemingli stuck...
000113f07ec002fd    hey man m realli edit war s guy constantli rem...
0001b41b1c6bb37e    ca real suggest improv wonder section statist ...
0001d958c54c6e35                         sir hero chanc rememb page s
                                          ...                        
172e07f9ab48bb39    refer book kemantney languag howev figur confi...
172e16e4f8de66fe    line text three section list offic proport rea...
1730ef89d087b7c8                                  say countri countri
1731c1aae5f1cb32    renata go russian websit find work publish eve...
1731fcac2661469b    stop remov matter talk perman makeup edit war ...
Name: comment_text, Length: 100000, dtype: object


In [None]:
data_train = pd.read_csv("./train_all_coloumns.csv", index_col=0)

data_test = pd.read_csv("./test_all_coloumns.csv", index_col=0)
data_train.head(3)

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,is_hate,text_lemmatization,text_punctuation,text_no_punctuation,text_negations,data_text_1,data_text_2,data_text_3,data_text_4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,0,explanation edits make username hardcore metal...,explanation edits make username hardcore metal...,explanation edits make username hardcore metal...,explanation edits make username hardcore metal...,explanation why the edits make under my userna...,explanation edits make username hardcore metal...,explanation edits make username hardcore metal...,explanation edits make username hardcore metal...
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,0,daww match background colour m seemingly stuck...,match background colour seemingly stick thanks...,daww match background colour m seemingly stuck...,daww match background colour m seemingly stuck...,he match this background colour i seemingly st...,match background colour seemingly stick thanks...,daww match background colour m seemingly stuck...,daww match background colour m seemingly stuck...
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,0,hey man m really try edit war s guy constantly...,hey man really try edit war guy constantly rem...,hey man m really try edit war s guy constantly...,hey man m really edit war s guy constantly rem...,hey man i really not try to edit war it just t...,hey man really try edit war guy constantly rem...,hey man m really try edit war s guy constantly...,hey man m really edit war s guy constantly rem...


In [6]:
print(data_train['data_text_4'][6], data_train.shape, data_test.shape)

cocksucker piss around work (159571, 16) (63978, 16)


In [None]:

# Incorporate stemming
data_train['data_text_5'] = data_train['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=True, combine_negations=True))

In [74]:
# Lemmatization (default behavior, without stemming)
data_test['text_lemmatization'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))
data_test.to_csv('test_all_coloumns.csv')


In [8]:
# Lemmatization (default behavior, without stemming)
data_test['text_punctuation'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))
data_test.to_csv('test_all_coloumns.csv')

In [75]:

# Removing all punctuation
data_test['text_punctuation'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=True, use_stemming=False, combine_negations=False))
data_test.to_csv('test_all_coloumns.csv')

In [9]:

# Stemming (enabling stemming, no lemmatization)
data_test['text_stemming'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=True, combine_negations=False))
data_test.to_csv('test_all_coloumns.csv')

In [185]:
# Removing all punctuation
data_test['text_no_punctuation'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))
data_test.to_csv('train_all_coloumns.csv')

In [10]:

# Keep negations
data_test['text_negations'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=True))
data_test.to_csv('test_all_coloumns.csv')

In [11]:

# Removing negations
data_test['text_no_negations'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=False))
data_test.to_csv('test_all_coloumns.csv')

In [12]:

# Only lowercase
data_test['data_text_1'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=False, keep_semantic_punctuation=True, use_stemming=False))
data_test.to_csv('test_all_coloumns.csv')

In [13]:

# Lowercase and stopwords removal
data_test['data_text_2'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=True, use_stemming=False))
data_test.to_csv('test_all_coloumns.csv')

In [14]:

# Add punctuation handling
data_test['data_text_3'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False))
data_test.to_csv('test_all_coloumns.csv')

In [15]:

# Incorporate lemmatization
data_test['data_text_4'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=False, combine_negations=True))
data_test.to_csv('test_all_coloumns.csv')

In [None]:

# Incorporate stemming
data_test['data_text_5'] = data_test['comment_text'].apply(lambda x: preprocess_text(x, use_lower=True, remove_stopwords=True, keep_semantic_punctuation=False, use_stemming=True, combine_negations=True))

data_test.to_csv('test_all_coloumns.csv')

: 

In [None]:
data_test.to_csv('test_all_coloumns.csv')
data_train.to_csv('train_all_coloumns.csv')

In [76]:

data_test['text_punctuation']

id
0001ea8717f6de06    thank understanding think highly would revert ...
000247e83dcc1211                               dear god site horrible
0002f87b16116a7f    somebody invariably try add religion really me...
0003e1cccfd5a40a    say right type type institution need case thre...
00059ace3e3e9a53    add new product list make sure relevant add ne...
                                          ...                        
fff8f64043129fa2    jerome see never get around surprise looked ex...
fff9d70fe0722906                   http heh famous kida envy congrats
fffa8a11c4378854                              want speak gay romanian
fffac2a094c8e0e2    mel gibson nazi bitch make shitty movie much b...
fffb5451268fb5ba    unicorn lair discovery supposedly lair discove...
Name: text_punctuation, Length: 63978, dtype: object

In [8]:
print(data_train.isna().sum())

comment_text             0
toxic                    0
severe_toxic             0
obscene                  0
threat                   0
insult                   0
identity_hate            0
is_hate                  0
text_lemmatization      63
text_punctuation       155
text_no_punctuation     63
text_negations         115
data_text_1             65
data_text_2            155
data_text_3             63
data_text_4            115
dtype: int64


In [9]:
# Train and Test set
# Define test and train set
def datasetDefinition(columnName):
    X_train = data_train[columnName]
    y_train = data_train["is_hate"]

    X_test = data_test[columnName]
    y_test = data_test["is_hate"]
    return X_train, y_train, X_test, y_test

## Word Embedding



**Notes**: Tokenizing with TF-IDF

In [161]:
tfidf_vectorizer = TfidfVectorizer(max_features=2000)

In [162]:
def makeToTensors(X_train, y_train, X_test, y_test):
    # Make the test and train sets to tensors and apply TF-IDF
    X_train.fillna('', inplace=True)
    X_test.fillna('', inplace=True)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

    # Transform testing data
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # Convert TF-IDF matrices to PyTorch tensors
    X_train_tensor = torch.tensor(X_train_tfidf.toarray(), dtype=torch.float32)
    X_test_tensor = torch.tensor(X_test_tfidf.toarray(), dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.long)
    y_test_tensor = torch.tensor(y_test, dtype=torch.long)
    return X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor

## Model Implementation & Test with Testset

**Note**: Does a CNN makes sense for sentiment analysis? or a simpler model?

**Answers and additional Notes**:
Make a CNN with PyTorch using skorch as wrapper to make it possible to use sklearn.pipeline with the model
This way gridsearch for hyper parameters is possible and tfidfVectorizer can be used for tf-idf
CNN: vector size 300, conv. layer of some size, flatten, relu, end with softmax or something
Example: https://www.kaggle.com/code/raviusz/jigsaw-toxic-comment
example look very good to get basics and then change some of architecture
hyperparameter tuning for each model? only if time permits, alt. tune on best model and use for rest

**Note**: We will use the given test set to compare the different approaches. Make a dataframe with all the results
in accuracy, f1, recall, etc. 

In [182]:

# CNN: The basic model

class CNN(nn.Module):
    def __init__(self, dropout_prob=0.5):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=64, kernel_size=5)
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=5)
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.conv_output_size = self._get_conv_output_size(2000)
        
        self.fc1 = nn.Linear(self.conv_output_size, 64)
        self.dropout = nn.Dropout(dropout_prob)
        self.fc2 = nn.Linear(64, 2)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = nn.functional.relu(self.conv1(x))
        x = self.pool(x)
        x = nn.functional.relu(self.conv2(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        x = nn.functional.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x
        
    def _get_conv_output_size(self, input_size):
        x = torch.randn(1, 1, input_size)  # Add channel dimension
        x = nn.functional.relu(self.conv1(x))
        x = self.pool(x)
        x = nn.functional.relu(self.conv2(x))
        x = self.pool(x)
        return x.view(1, -1).size(1)


batchSize = 25

def trainCNN(X_train, y_train, X_test, y_test,  batch_size=batchSize, epochs=7, learning_rate=0.001):

    # Step 4: Train the model
    train_dataset = TensorDataset(X_train, y_train)  # Assuming X_train_tensor and y_train_tensor are tensors
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    train_loss = []
    train_accuracy = []
    # Move model to GPU
    model = CNN().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    print(f'learning rate: {learning_rate}, total number of epochs: {epochs}')
    for epoch in range(int(epochs)):
        model.train()
        total_loss = 0.0
        correct = 0
        total = 0
        epoch_accuracy = 0

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        
        epoch_loss = total_loss / len(train_dataset)
        epoch_accuracy = correct / total
        train_loss.append(epoch_loss)
        train_accuracy.append(epoch_accuracy)
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss}, Accuracy: {epoch_accuracy}')

    # Step 5: Evaluate the model
    # Assuming X_test_tensor and y_test_tensor are tensors
    test_dataset = TensorDataset(X_test, y_test)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    model.eval()
    total_correct = 0
    total_predicted = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total_correct += torch.sum(predicted == labels).item()
            total_predicted += len(predicted)
    # Calculate evaluation metrics
    accuracy = total_correct / total_predicted
    

    return accuracy, model, train_loss, train_accuracy
    print(f'Test Accuracy: {accuracy}')


In [183]:
# for all the columns load them into X_train
# make the tensors, train the cnn and thenn take the outcome and add it to a df
columnName = ['text_lemmatization', 'text_punctuation', 'text_no_punctuation', 'text_negations', 'text_no_negations', 'data_text_1', 'data_text_2',
            'data_text_2', 'data_text_3', 'data_text_4']
results = []
all_train_loss = []
all_train_accuracy = []
for name in columnName:
    X_train, y_train, X_test, y_test = datasetDefinition(name)
    X_train_T, y_train_T, X_test_T, y_test_T = makeToTensors(X_train, y_train, X_test, y_test)
    
    # Get the test accuracy using the best model
    accuracy, model, train_loss, train_accuracy = trainCNN(X_train_T, y_train_T, X_test_T, y_test_T, epochs=15, learning_rate=0.0005)
    
    all_train_loss.append(train_loss)
    all_train_accuracy.append(train_accuracy)

    plt.plot(train_loss, label=f'{name} Loss')
    plt.plot(train_accuracy, label=f'{name} Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Value')
    plt.title(f'Loss and Accuracy vs. Epochs for {name}')
    plt.legend()
    plt.show()

    results.append({'Column': name, 'Test_Accuracy': accuracy})

results_df = pd.DataFrame(results)
results_df.to_csv('hyperparameter_tuning_results.csv', index=False)

learning rate: 0.0005, total number of epochs: 15
Epoch [1/15], Loss: 0.4151755427423488, Accuracy: 0.8982145878637096
Epoch [2/15], Loss: 0.4149526729235221, Accuracy: 0.8983211235124177
Epoch [3/15], Loss: 0.4149413854670959, Accuracy: 0.8983211235124177


KeyboardInterrupt: 

In [None]:
columnName = ['text_lemmatization', 'text_punctuation', 'text_no_punctuation', 'text_negations', 'text_no_negations', 'data_text_1', 'data_text_2',
            'data_text_2', 'data_text_3', 'data_text_4']

for i, name in enumerate(columnName):
    plt.plot(all_train_loss[i], label=f'{name} Loss')

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Loss vs. Epochs for Different Preprocessing Methods')
plt.legend()
plt.show()

for i, name in enumerate(columnName):
    plt.plot(all_train_accuracy[i], label=f'{name} Accuracy')

plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training Accuracy vs. Epochs for Different Preprocessing Methods')
plt.legend()
plt.show()

IndexError: list index out of range

In [193]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

columnName = ['text_lemmatization', 'text_punctuation', 'text_no_punctuation', 'text_negations', 'text_no_negations', 'data_text_1', 'data_text_2',
            'data_text_2', 'data_text_3', 'data_text_4']

classification_reports = []
linear_pipeline = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression(solver='sag', max_iter=1000)
)
for name in columnName:
    X_train, y_train, X_test, y_test = datasetDefinition(name)
    linear_pipeline.fit(X_train, y_train)
    y_pred = linear_pipeline.predict(X_test)
    print(name, classification_report(y_test, y_pred))
    report = classification_report(y_test, y_pred, output_dict=True)
    report_df = pd.DataFrame(report).transpose()
    classification_reports.append(report_df)
    
classification_reports_df = pd.concat(classification_reports, keys=columnName)

# Save the DataFrame to a CSV file
classification_reports_df.to_csv('classification_reports.csv')

plt.figure(figsize=(12, 8))
for name in columnName:
    plt.plot(classification_reports_df.loc[name]['accuracy'], label=name, marker='o')
plt.title('Accuracy')
plt.xlabel('Class')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot recall for each dataset
plt.figure(figsize=(12, 8))
for name in columnName:
    plt.plot(classification_reports_df.loc[name]['recall'], label=name, marker='o')
plt.title('Recall')
plt.xlabel('Class')
plt.ylabel('Recall')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot F1-score for each dataset
plt.figure(figsize=(12, 8))
for name in columnName:
    plt.plot(classification_reports_df.loc[name]['f1-score'], label=name, marker='o')
plt.title('F1-score')
plt.xlabel('Class')
plt.ylabel('F1-score')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

text_lemmatization               precision    recall  f1-score   support

           0       0.97      0.95      0.96     57735
           1       0.64      0.74      0.68      6243

    accuracy                           0.93     63978
   macro avg       0.80      0.85      0.82     63978
weighted avg       0.94      0.93      0.94     63978

text_punctuation               precision    recall  f1-score   support

           0       0.97      0.95      0.96     57735
           1       0.64      0.74      0.68      6243

    accuracy                           0.93     63978
   macro avg       0.80      0.85      0.82     63978
weighted avg       0.94      0.93      0.94     63978

text_no_punctuation               precision    recall  f1-score   support

           0       0.97      0.95      0.96     57735
           1       0.64      0.74      0.68      6243

    accuracy                           0.93     63978
   macro avg       0.80      0.85      0.82     63978
weighted avg      

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [195]:
data_train['text_negations']

id
0000997932d777bf    explanation edits make username hardcore metal...
000103f0d9cfb60f    daww match background colour m seemingly stuck...
000113f07ec002fd    hey man m really edit war s guy constantly rem...
0001b41b1c6bb37e    ca real suggestion improvement wonder section ...
0001d958c54c6e35                      sir hero chance remember page s
                                          ...                        
ffe987279560d7ff    second time ask view completely contradict cov...
ffea4adeee384e90                 ashamed horrible thing put talk page
ffee36eab5c267c9    spitzer umm theres article prostitution ring c...
fff125370e4aaaf3    look like actually put speedy first version de...
fff46fc426af1f9a    really understand come idea bad right away kin...
Name: text_negations, Length: 159571, dtype: object