# **1. Preprocessing for Tree-Based Models**

## **Preprocessing for Tree-Based Toxicity Models**

This notebook performs full preprocessing of the **Jigsaw Toxic Comment Classification** dataset in preparation for tree-based models such as XGBoost or Random Forests.

It includes:
- Cleaning and transforming comment text
- Handling missing values and identity labels
- Text normalization (tokenization, lemmatization)
- Sentence vector generation using Word2Vec

The final cleaned dataset is exported for use in downstream modeling.

---

## **Contents**

1. **Data Load & Inspection**  
   1.1 Mount Google Drive and load raw `train.csv` / `test.csv`  
   1.2 Inspect missing values and data structure  

2. **Missing Value Treatment**  
   2.1 Fill `parent_id`, `comment_text`, and identity columns  

3. **Text Cleaning & Normalization**  
   3.1 Remove hashtags, mentions, links, punctuation  
   3.2 Convert to lowercase, tokenize, and remove stopwords  
   3.3 Lemmatize tokens  

4. **Sentence Vectorization**  
   4.1 Train Word2Vec on tokenized comments  
   4.2 Generate average vector per comment  
   4.3 Store sentence vectors in `sentence_vector` column  

5. **Export Cleaned Dataset**  
   5.1 Save as `train_cleaned.csv` for modeling


### **1.1 Load Raw Data from Google Drive**

- Mounts Google Drive and loads the `train.csv` and `test.csv` files.
- Performs basic inspection (`.head()`, `.info()`, and `.isnull().sum()`) to understand structure and detect missing values.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
data_path = "/content/drive/My Drive/Jigsaw/"
train = pd.read_csv(data_path + "train.csv")

In [None]:
train.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1804874 entries, 0 to 1804873
Data columns (total 45 columns):
 #   Column                               Dtype  
---  ------                               -----  
 0   id                                   int64  
 1   target                               float64
 2   comment_text                         object 
 3   severe_toxicity                      float64
 4   obscene                              float64
 5   identity_attack                      float64
 6   insult                               float64
 7   threat                               float64
 8   asian                                float64
 9   atheist                              float64
 10  bisexual                             float64
 11  black                                float64
 12  buddhist                             float64
 13  christian                            float64
 14  female                               float64
 15  heterosexual                    

In [None]:
test = pd.read_csv(data_path + "test.csv")

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97320 entries, 0 to 97319
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            97320 non-null  int64 
 1   comment_text  97320 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [None]:
test.isnull().sum()

Unnamed: 0,0
id,0
comment_text,0


In [None]:
train.isnull().sum()

Unnamed: 0,0
id,0
target,0
comment_text,3
severe_toxicity,0
obscene,0
identity_attack,0
insult,0
threat,0
asian,1399744
atheist,1399744


### **1.2 Missing Value Treatment**

- Fills missing `parent_id` values with -1 to denote root-level comments.
- Replaces missing `comment_text` entries with the placeholder `"MISSING"`.
- Fills identity columns (e.g., race, religion, gender) with 0, assuming non-activation.

In [None]:
# filling -1 for missing parent_id
train['parent_id'].fillna(-1, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['parent_id'].fillna(-1, inplace=True)


In [None]:
# filling missing comment_text with MISSING
train['comment_text'].fillna('MISSING', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['comment_text'].fillna('MISSING', inplace=True)


In [None]:
# filling 0s for (basically the rest of the columns) racial and religious columns
train.fillna(0, inplace=True)

### **1.3 Text Cleaning and Tokenization**

- Cleans `comment_text` by removing hashtags, mentions, URLs, and special characters using regex.
- Converts text to lowercase and tokenizes using `nltk.word_tokenize()`.
- Removes English stopwords with NLTK's predefined stopword list.

In [None]:
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

In [None]:
def preprocess_text(text):
  #remove hashtags
  text = re.sub(r'#\w+', '', str(text)) # Convert text to string using str(text)
  #remove @
  text = re.sub(r'@\w+', '', str(text)) # Convert text to string using str(text)
  #remove URLs
  text = re.sub(r'http\S+', '', str(text)) # Convert text to string using str(text)
  #remove special characters
  text = re.sub(r'[^\w\s]', '', str(text)) # Convert text to string using str(text)

  #convert to lowercase
  text = text.lower()

  #tokenize
  tokens = nltk.word_tokenize(text)

  #remove stopwords
  stop_words = set(nltk.corpus.stopwords.words('english'))
  filtered_tokens = [word for word in tokens if word not in stop_words]
  return ' '.join(filtered_tokens) # Join the tokens back

In [None]:
train['text'] = train['comment_text'].apply(preprocess_text)

In [None]:
train.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count,text
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,rejected,0,0,0,0,0,0.0,0,4,cool like would want mother read really great ...
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,rejected,0,0,0,0,0,0.0,0,4,thank would make life lot less anxietyinducing...
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,rejected,0,0,0,0,0,0.0,0,4,urgent design problem kudos taking impressive
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,rejected,0,0,0,0,0,0.0,0,4,something ill able install site releasing
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,rejected,0,0,0,1,0,0.0,4,47,haha guys bunch losers


### **1.4 Lemmatization**

- Applies lemmatization using NLTK’s `WordNetLemmatizer` to normalize tokens (e.g., "running" → "run").

In [None]:
# lemmatize the text column in train data
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
train['text'] = train['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

### **1.5 Sentence Vectorization (Word2Vec)**

- Trains a Word2Vec model on the cleaned, tokenized text.
- Converts each comment into a 100-dimensional sentence vector by averaging its token embeddings.


In [None]:
train['text'].head()

Unnamed: 0,text
0,cool like would want mother read really great ...
1,thank would make life lot less anxietyinducing...
2,urgent design problem kudos taking impressive
3,something ill able install site releasing
4,haha guy bunch loser


In [None]:
from gensim.models import Word2Vec

In [None]:
def get_sentence_vector(word_list, model):
    vectors = [model.wv[word] for word in word_list if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)


In [None]:
sentences = [sentence.split() for sentence in train['text']]

In [None]:
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

In [None]:
# Apply the function to your dataset
train["sentence_vector"] = train["text"].apply(lambda word_list: get_sentence_vector(word_list, model))

# Check first few sentence vectors
print(train["sentence_vector"].head())

0    [0.3695966, 0.030246127, 1.7840608, -0.7254034...
1    [0.29705206, 0.3363558, 1.7053747, -0.48943752...
2    [0.2693393, 0.26726028, 1.7680713, -0.43150175...
3    [0.39833638, 0.3731384, 1.6136991, -0.2354782,...
4    [0.4626832, 0.43987605, 2.0688334, 0.14376386,...
Name: sentence_vector, dtype: object


In [None]:
train

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count,text,sentence_vector
0,59848,0.000000,"This is so cool. It's like, 'would you want yo...",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,cool like would want mother read really great ...,"[0.3695966, 0.030246127, 1.7840608, -0.7254034..."
1,59849,0.000000,Thank you!! This would make my life a lot less...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,thank would make life lot less anxietyinducing...,"[0.29705206, 0.3363558, 1.7053747, -0.48943752..."
2,59852,0.000000,This is such an urgent design problem; kudos t...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,urgent design problem kudos taking impressive,"[0.2693393, 0.26726028, 1.7680713, -0.43150175..."
3,59855,0.000000,Is this something I'll be able to install on m...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,something ill able install site releasing,"[0.39833638, 0.3731384, 1.6136991, -0.2354782,..."
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.000000,0.021277,0.872340,0.0,0.0,0.0,...,0,0,0,1,0,0.0,4,47,haha guy bunch loser,"[0.4626832, 0.43987605, 2.0688334, 0.14376386,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1804869,6333967,0.000000,"Maybe the tax on ""things"" would be collected w...",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,maybe tax thing would collected product import...,"[0.35382685, 0.24724989, 1.7214196, -0.6015393..."
1804870,6333969,0.000000,What do you call people who STILL think the di...,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,call people still think divine role creation,"[0.48454404, 0.3087887, 1.7444198, -0.50347936..."
1804871,6333982,0.000000,"thank you ,,,right or wrong,,, i am following ...",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,4,thank right wrong following advice,"[-0.1587102, 0.61544734, 1.700462, -0.7920288,..."
1804872,6334009,0.621212,Anyone who is quoted as having the following e...,0.030303,0.030303,0.045455,0.621212,0.0,0.0,0.0,...,0,0,0,0,0,0.0,0,66,anyone quoted following exchange even apocryph...,"[0.23050489, 0.4827941, 1.7122359, -0.37115076..."


### **1.6 Export Cleaned Data**

- Saves the processed and vectorized dataset to `train_cleaned.csv` for downstream modeling tasks.

In [None]:
train.to_csv(data_path + 'train_cleaned.csv', index=False)