# Overview

 Data Loading and Preprocessing:
   - The script reads a CSV file (`train.csv`) into a pandas DataFrame.
   - It performs data preprocessing steps using functions defined later in the code, including converting text to lowercase, removing escape characters, HTML tags, links, digits, punctuation, stopwords, etc.
   - The DataFrame is then filtered to keep only relevant columns (`comment_text` and labels for multi-label classification).

 Sample Selection:
   - Due to the large size of the dataset, the code randomly selects 50% of the data to train the models. This reduces the data size to prevent memory issues.

Word Embeddings:
   - The script uses TF-IDF vectorization to convert the text data into numerical representations suitable for machine learning algorithms.
   - The `TfidfVectorizer` from scikit-learn is used to perform this vectorization.
   - The text data is split into training and testing sets using the `train_test_split` function from scikit-learn.

 Modelling - Binary Relevance with Different Classifiers & Classifier Chain.
   - The script demonstrates the use of the Binary Relevance method for multi-label classification.
   - It trains different classifiers using the MultiOutputClassifier wrapper from scikit-learn.
   - Three classifiers are used: MultinomialNB, LogisticRegression, and DecisionTreeClassifier, Random Forest, XGBOOST.
   - Each classifier is trained on the training data and evaluated on the testing data using metrics such as Hamming loss, accuracy, and log loss.


   The output of each evaluation is printed to the console.


In [None]:
import sklearn

In [None]:
!pip install scikit-multilearn


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.adapt import MLkNN
from sklearn.metrics import hamming_loss, accuracy_score

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to your CSV file
csv_file_path = '/content/drive/MyDrive/sikka/train.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path)

In [None]:
df

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


In [None]:

df['comment_text']

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object

In [None]:
df.shape

(159571, 8)

In [None]:
df.dtypes

id               object
comment_text     object
toxic             int64
severe_toxic      int64
obscene           int64
threat            int64
insult            int64
identity_hate     int64
dtype: object

In [None]:
df.nunique()

id               159571
comment_text     159571
toxic                 2
severe_toxic          2
obscene               2
threat                2
insult                2
identity_hate         2
dtype: int64

# Problem Statement:
 A Kaggle competition for a multi-class classification problem on text data - each text sample can belong to various classes. You must create a model which predicts the probability of each class for each text sample. The details can be found here - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data


# DATA PREPROCESSING

In [None]:
# need latest version of matplotlib >=(3.4.1)
!pip install --upgrade matplotlib
#installing required libraries
!pip install venn
!pip install contractions
!pip install scikit-multilearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting venn
  Downloading venn-0.1.3.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: venn
  Building wheel for venn (setup.py) ... [?25l[?25hdone
  Created wheel for venn: filename=venn-0.1.3-py3-none-any.whl size=19699 sha256=a7b17f2c36e53334d90063d6f4eb345ea129342ff0b212905ad0ac4c48144325
  Stored in directory: /root/.cache/pip/wheels/9c/ce/43/705b4a04cd822891d1d7a4c43fc444b4798978e72c79528c5f
Successfully built venn
Installing collected packages: venn
Successfully installed venn-0.1.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Down

In [None]:
#mandatory libraries
import os
import re
import string
import numpy as np
import pandas as pd

#plotting libraries
import venn
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


#NLTK libraries  & for data cleaning
import contractions
import nltk
from nltk.tree import Tree
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from wordcloud import WordCloud, STOPWORDS

#sk-learn libraries for vectorization and TSNE
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

#miscellaneous libraries
import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm
from itertools import combinations


In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tree import Tree
from nltk.chunk import ne_chunk
from nltk.tokenize import word_tokenize
import contractions
from tqdm import tqdm
from wordcloud import STOPWORDS
import numpy as np

# Function to convert the input text to lower case
def convert_to_lower_case(text):
    return text.lower()

# Function to remove newline, tab, and slashes from the input text
def remove_escape_char(text):
    return re.sub(r"[\n\t\\\/]", " ", text, flags=re.MULTILINE)

# Function to remove HTML tags and its content from the input text
def remove_html_tags(text):
    return re.sub(r"<.*>", " ", text, flags=re.MULTILINE)

# Function to remove any kind of links without HTML tags
def remove_links(text):
    text = re.sub(r"http\S+", " ", text, flags=re.MULTILINE)
    return re.sub(r"www\S+", " ", text, flags=re.MULTILINE)

# Function to remove digits from the input text
def remove_digits(text):
    return re.sub(r'\d', " ", text, flags=re.MULTILINE)

# Function to remove punctuation marks from the input text
def remove_punctuation(text):
    for i in string.punctuation:
        text = text.replace(i, " ")
    return text

# Function to keep only alphabets and underscore
def keep_alpha_and_underscore(text):
    return re.sub(r"[^a-zA-Z_]", " ", text, flags=re.MULTILINE)

# Function to remove extra spaces if any
def remove_extra_spaces_if_any(text):
    return re.sub(r" {2,}", " ", text, flags=re.MULTILINE)

# Downloading necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')

# Printing stop words from NLTK library
stop_words = stopwords.words('english')
display_length = 10
for i in range(int(np.ceil(len(stop_words) / display_length))):
    print(stop_words[i * display_length:(i + 1) * display_length])

# Printing stop words from Word Cloud library
display_length = 10
word_cloud_stp_wrds = list(STOPWORDS)
for i in range(int(np.ceil(len(list(word_cloud_stp_wrds)) / display_length))):
    print(word_cloud_stp_wrds[i * display_length:(i + 1) * display_length])

# Creating a list of final stop words by combining NLTK and Word Cloud stop words,
# and adding custom words
final_stop_words = list(STOPWORDS.union(set(stop_words)))
final_stop_words.extend(["mr", "mrs", "miss", "one", "two", "three", "four", "five",
                         "six", "seven", "eight", "nine", "ten", "us", "also", "dont", "cant",
                         "any", "can", "along", "among", "during", "anyone", "a", "b", "c",
                         "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r",
                         "s", "t", "u", "v", "w", "x", "y", "z", "hi", "hello", "hey", "ok", "okay",
                         "lol", "rofl", "hola", "let", "may", "etc"])
display_length = 10
for i in range(int(np.ceil(len(final_stop_words) / display_length))):
    print(final_stop_words[i * display_length:(i + 1) * display_length])

lemmatiser = WordNetLemmatizer()




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
["you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']
['himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself']
['they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this']
['that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']
['been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing']
['a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until']
['while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into']
['through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down']
['in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once']
['here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each']
['few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only']
['own', 'same', 'so

In [None]:
# Preprocessing function
def preprocess(text):
    preprocessed_text = []
    for each_text in tqdm(text):
        result = remove_links(each_text)
        result = remove_html_tags(result)
        result = remove_escape_char(result)
        result = remove_digits(result)
        result = remove_punctuation(result)
        result = convert_to_lower_case(result)
        result = ' '.join(non_stop_word for non_stop_word in result.split() if non_stop_word not in final_stop_words)
        result = keep_alpha_and_underscore(result)
        result = remove_extra_spaces_if_any(result)
        result = ' '.join(lemmatiser.lemmatize(word, pos="v") for word in result.split())
        preprocessed_text.append(result.strip())
    return preprocessed_text


In [None]:
# Performing the preprocessing on all the comments in the dataset
preprocessed_data = preprocess(df['comment_text'].values)

100%|██████████| 159571/159571 [47:23<00:00, 56.12it/s]


In [None]:
df['comment_text']= preprocessed_data

In [None]:
df.to_csv("clean_comments.csv")

In [4]:
df = pd.read_csv("/content/clean_comments.csv")

In [None]:
df

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0000997932d777bf,explanation edit make username hardcore metall...,0,0,0,0,0,0
1,1,000103f0d9cfb60f,aww match background colour seemingly stick th...,0,0,0,0,0,0
2,2,000113f07ec002fd,man really try edit war guy constantly remove ...,0,0,0,0,0,0
3,3,0001b41b1c6bb37e,make real suggestions improvement wonder secti...,0,0,0,0,0,0
4,4,0001d958c54c6e35,sir hero chance remember page,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
159566,159566,ffe987279560d7ff,second time ask view completely contradict cov...,0,0,0,0,0,0
159567,159567,ffea4adeee384e90,ashamed horrible thing put talk page,0,0,0,0,0,0
159568,159568,ffee36eab5c267c9,umm actual article prostitution ring,0,0,0,0,0,0
159569,159569,fff125370e4aaaf3,look actually put speedy first version delete ...,0,0,0,0,0,0


In [5]:
df = df[['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

In [6]:
df['comment_text'].isnull().values.any()

True

after cleaning, some comments became nan


In [7]:
df = df.dropna(subset=['comment_text'])


In [None]:
df

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,explanation edit make username hardcore metall...,0,0,0,0,0,0
1,aww match background colour seemingly stick th...,0,0,0,0,0,0
2,man really try edit war guy constantly remove ...,0,0,0,0,0,0
3,make real suggestions improvement wonder secti...,0,0,0,0,0,0
4,sir hero chance remember page,0,0,0,0,0,0
...,...,...,...,...,...,...,...
159566,second time ask view completely contradict cov...,0,0,0,0,0,0
159567,ashamed horrible thing put talk page,0,0,0,0,0,0
159568,umm actual article prostitution ring,0,0,0,0,0,0
159569,look actually put speedy first version delete ...,0,0,0,0,0,0


# The data set is too large resulting in colab notebook crashing multiple times hence I'm randomly selecting 50 % of the data and will train my models on this sample

In [8]:
sample_frac = 0.50 # Specify the desired fraction of the dataset

df = df.sample(frac=sample_frac, random_state=42)


In [10]:
#mandatory libraries
import os
import re
import numpy as np
import pandas as pd
import scipy
import string

#nltk-preprocessing
import nltk
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
from nltk.stem.wordnet import WordNetLemmatizer

#plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#misc
import joblib
import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm
from itertools import combinations

#multi-processing
import multiprocessing
from multiprocessing import Pool,freeze_support
from multiprocessing import Process

#multi-label
from skmultilearn.model_selection import iterative_train_test_split
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset

#metrics
from sklearn.metrics import hamming_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_curve, auc,roc_auc_score

#modelling
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

#Tensor flow for NLP
import tensorflow as tf
from tensorflow.keras.layers import Dense,Input,Activation,Dropout,BatchNormalization
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.callbacks import ReduceLROnPlateau
from keras.callbacks import Callback

#model loading
from tensorflow.keras.models import load_model

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


In [12]:
X = df['comment_text']
y = df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]


# Train test split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Word embeddings

In [14]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


# Modelling

Solving a multi-label classification problem is not straight forward. We dont have a straight away algorithm or classifier, but we can do probelm transformation and we can do "adapted" algorithms.

Different approaches to solve a multi-label classification problem, namely:

1. Problem Transformation

2. Adapted Algorithm

In [None]:
# need scikit-multilearn library for multi-label classification
!pip install scikit-multilearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Problem Transformation Methods :
These include the Binary Relevance, Label Powerset and Classifier Chain methods.



# Binary Relevance - MultinomialNB



In [None]:
# Train a single MultinomialNB classifier
classifier = MultinomialNB()

# Wrap the classifier with MultiOutputClassifier
br_classifier = MultiOutputClassifier(classifier)

# Fit the classifier to the training data
br_classifier.fit(X_train, y_train)



In [None]:
from sklearn.metrics import hamming_loss, accuracy_score, log_loss
from sklearn.multiclass import OneVsRestClassifier

# Convert the y_train to binary format
y_train_binary = y_train.values.astype(int)

# Train the OneVsRestClassifier with MultinomialNB
br_classifier = OneVsRestClassifier(MultinomialNB())
br_classifier.fit(X_train, y_train_binary)

# Make predictions on the test set
y_pred_binary = br_classifier.predict(X_test)
y_pred_prob= br_classifier.predict_proba(X_test)
# Calculate Hamming loss
hamming_loss_value = hamming_loss(y_test, y_pred_binary)
print("Hamming Loss:", hamming_loss_value)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy:", accuracy)

# Calculate log loss
log_loss_value = log_loss(y_test, y_pred_prob)
print("Log Loss:", log_loss_value)


Hamming Loss: 0.03240159128978225
Accuracy: 0.8989740368509213
Log Loss: 0.3895355847069237




1. Hamming Loss: 0.03
   - This indicates that, on average, the model misclassifies around 3% of the labels. A lower Hamming loss is desirable, so the obtained value suggests that the model performs well in terms of label-wise accuracy.

2. Accuracy: 0.89
   - The accuracy of approximately 89% indicates that the model correctly predicts around 89.8% of the labels in the test set. It's important to note that accuracy alone might not be sufficient to evaluate the performance of a multi-label classification model, especially when dealing with imbalanced datasets or varying importance of different labels.

3. Log Loss: 0.38
   - Log loss measures the discrepancy between the predicted probabilities and the true labels. A lower log loss value indicates better alignment between the predicted probabilities and the true labels.

# Binary Relevance - LogisticRegression

In [None]:
# LogisticRegression
classifier = LogisticRegression()
br_classifier = MultiOutputClassifier(classifier)
br_classifier.fit(X_train, y_train)

# Predict probabilities using each classifier
y_pred_proba = [clf.predict_proba(X_test)[:, 1] for clf in br_classifier.estimators_]

# Reshape predicted probabilities
y_pred_proba_reshaped = np.array(y_pred_proba).T

hamming_loss_value = hamming_loss(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)
log_loss_value = log_loss(y_test, y_pred_proba_reshaped)
print("LogisticRegression - Hamming Loss:", hamming_loss_value)
print("LogisticRegression - Accuracy:", accuracy)
print("LogisticRegression - Log Loss:", log_loss_value)


LogisticRegression - Hamming Loss: 0.021496370742601897
LogisticRegression - Accuracy: 0.9141541038525963
LogisticRegression - Log Loss: 0.2842339309438655


# Binary Relevance - DecisionTreeClassifier

In [None]:
# DecisionTreeClassifier
classifier = DecisionTreeClassifier()
br_classifier = MultiOutputClassifier(classifier)
br_classifier.fit(X_train, y_train)

y_pred_binary = br_classifier.predict(X_test)

y_pred_proba = [clf.predict_proba(X_test)[:, 1] for clf in br_classifier.estimators_]
# Reshape predicted probabilities
y_pred_proba_reshaped = np.array(y_pred_proba).T

hamming_loss_value = hamming_loss(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)
log_loss_value = log_loss(y_test, y_pred_proba_reshaped)

print("DecisionTreeClassifier - Hamming Loss:", hamming_loss_value)
print("DecisionTreeClassifier - Accuracy:", accuracy)
print("DecisionTreeClassifier - Log Loss:", log_loss_value)


DecisionTreeClassifier - Hamming Loss: 0.02568397543271915
DecisionTreeClassifier - Accuracy: 0.8885050251256281
DecisionTreeClassifier - Log Loss: 1.695421880141967


# Binary Relevance - RF


In [None]:

# RandomForestClassifier
classifier = RandomForestClassifier()
br_classifier = MultiOutputClassifier(classifier)
br_classifier.fit(X_train, y_train)
y_pred_binary = br_classifier.predict(X_test)
y_pred_proba = [clf.predict_proba(X_test)[:, 1] for clf in br_classifier.estimators_]
# Reshape predicted probabilities
y_pred_proba_reshaped = np.array(y_pred_proba).T

hamming_loss_value = hamming_loss(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)
log_loss_value = log_loss(y_test, y_pred_proba_reshaped)
print("RandomForestClassifier - Hamming Loss:", hamming_loss_value)
print("RandomForestClassifier - Accuracy:", accuracy)
print("RandomForestClassifier - Log Loss:", log_loss_value)



RandomForestClassifier - Hamming Loss: 0.020292434394193187
RandomForestClassifier - Accuracy: 0.9135259631490787
RandomForestClassifier - Log Loss: 0.3965953845384737


# Binary Relevance - XGB


In [None]:
# XGBClassifier
classifier = XGBClassifier()
br_classifier = MultiOutputClassifier(classifier)
br_classifier.fit(X_train, y_train)
y_pred_binary = br_classifier.predict(X_test)

y_pred_proba = [clf.predict_proba(X_test)[:, 1] for clf in br_classifier.estimators_]
# Reshape predicted probabilities
y_pred_proba_reshaped = np.array(y_pred_proba).T

hamming_loss_value = hamming_loss(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)
log_loss_value = log_loss(y_test, y_pred_proba_reshaped)
print("XGBClassifier - Hamming Loss:", hamming_loss_value)
print("XGBClassifier - Accuracy:", accuracy)
print("XGBClassifier - Log Loss:", log_loss_value)

XGBClassifier - Hamming Loss: 0.02065884980457845
XGBClassifier - Accuracy: 0.91321189279732
XGBClassifier - Log Loss: 0.2924138137729429


Among these models, both XGBOOST and LogisticRegression have similar performance in terms of Hamming Loss and Accuracy.



# Binary Relevance method  does not take into account the interdependence of labels and basically creates a separate classifier for each of the labels.

# Classifer chain

Classifier Chains is another simple technique, unlike Binary Relevance Classifier Chain preserves the relationship between the features. The operation of classifier chain works as follows...

classifier-1 will takes all the inputs and fits on the first target labels alone and the classifier-2 takes all the inputs and the first target labels together and fit on the 2nd label. Classifier-3 takes all the inputs and the first, second target labels all together as input and fits on the 3rd target label. and so on

Generalizing the folow as the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.

In [None]:
# Multinomial Naive Bayes
cc_classifier_mnb = ClassifierChain(classifier=MultinomialNB())
cc_classifier_mnb.fit(X_train, y_train)

y_pred_mnb = cc_classifier_mnb.predict(X_test)
hamming_loss_mnb = hamming_loss(y_test, y_pred_mnb)

y_pred= cc_classifier_mnb.predict_proba(X_test)

log_loss_mnb = log_loss(y_test, y_pred.toarray())
accuracy_mnb = accuracy_score(y_test, y_pred_mnb)

print("Hamming Loss (MultinomialNB):", hamming_loss_mnb)
print("Log Loss (MultinomialNB):", log_loss_mnb)
print("Accuracy (MultinomialNB):", accuracy_mnb)

Hamming Loss (MultinomialNB): 0.030220547180346176
Log Loss (MultinomialNB): 0.40508569313919496
Accuracy (MultinomialNB): 0.9027428810720268


In [None]:
# Logistic Regression
from skmultilearn.problem_transform import ClassifierChain
from sklearn.linear_model import LogisticRegression
# initialize classifier chains multi-label classifier
classifier = ClassifierChain(LogisticRegression())
# Training logistic regression model on train data
classifier.fit(X_train, y_train)


In [None]:

y_pred= classifier.predict(X_test)
y_pred_prob= classifier.predict_proba(X_test)

hamming_loss_mnb = hamming_loss(y_test, y_pred)
log_loss_mnb = log_loss(y_test, y_pred_prob.toarray())
accuracy_mnb = accuracy_score(y_test, y_pred)

print("Hamming Loss :", hamming_loss_mnb)
print("Log Loss :", log_loss_mnb)
print("Accuracy:", accuracy_mnb)

Hamming Loss : 0.023345896147403684
Log Loss : 0.3250033568561115
Accuracy: 0.9104899497487438


In [15]:
#XGB
from skmultilearn.problem_transform import ClassifierChain
from sklearn.metrics import hamming_loss, log_loss, accuracy_score
from xgboost import XGBClassifier

# Create ClassifierChain with XGBoost classifier
cc_classifier_xgb = ClassifierChain(classifier=XGBClassifier())

# Train the classifier
cc_classifier_xgb.fit(X_train, y_train)

In [16]:
# Make predictions on the test set
y_pred_xgb = cc_classifier_xgb.predict(X_test)
y_pred = cc_classifier_xgb.predict_proba(X_test)
# Calculate Hamming loss
hamming_loss_xgb = hamming_loss(y_test, y_pred_xgb)
print("Hamming Loss (XGBoost):", hamming_loss_xgb)

# Calculate log loss
log_loss_xgb = log_loss(y_test, y_pred.toarray())
print("Log Loss (XGBoost):", log_loss_xgb)

# Calculate accuracy
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("Accuracy (XGBoost):", accuracy_xgb)


Hamming Loss (XGBoost): 0.038662486938349006
Log Loss (XGBoost): 0.6169338468618949
Accuracy (XGBoost): 0.8683385579937304


By both the ways(binary relevance & classifier chain, logistic regression outperformed other models in terms of hamming loss and log loss. We kept hamming loss as an evaluation metric because we knew the dataset is not balanced)

Comparing results with Binary Relavance, Classifier chain metrics have been improved very slightly might be classifiers is able to preserve the correlation between the features and performing well with our data-set

In Classifier chain also the Logistic regression and XGBoost classifier results seems to be and high and very close to each other.

Label Power Set Approach is also good and preserves the correlation/dependencies between the features. But only disadvantage is that, as the no:of target labels increases, the unique class labels to be mapped is also increased exponentially. And training a such huge multi-class classification problem becomes much more complex and results would be with lower accuracy

# Conclusion


There are two primary approaches commonly employed to address multi-label classification problems: problem transformation methods and algorithm adaptation methods.

Problem transformation methods involve converting the multi-label problem into multiple binary classification problems. This enables the use of single-class classifiers to handle each transformed problem separately. I have mainly implemented Problem transformation methods in this notebook

On the other hand, algorithm adaptation methods focus on modifying the algorithms to directly handle multi-label classification. Instead of simplifying the problem by conversion, these methods aim to tackle the problem in its original, comprehensive form.

However, it is worth noting that these methods require a significant amount of time to process the dataset. Therefore, to mitigate this issue, experimentation was conducted through Problem transformation methods  on a random subset of the training data.