## Multi-label classification  

Often, we may encouter data that can be classified into more than one categories (for example movie genre, items in an image).  
However, typical classification tasks involve predicting a single label, as they treat classes as being mutually exclusive.   

Multi-Label Classification is the supervised learning problem where an instance may be associated with multiple labels. This is opposed to the traditional task of single-label classification (i.e., multi-class, or binary) where each instance is only associated with a single class label. 

  

### Techniques   

There are two main categorizations of methods that can be used to solve for the multi-label classification problem  
* problem transformation methods and 
* algorithm adaptation methods 

In the first case the learning task is transformed into more or single-label classification tasks. 
In the second, the algorithms are adapted so that they can handle multi-label data.   


<br />

The dataset used here is the GoEmotions.  
This is a dataset released from Google and it containes the emotions detected in those texts.  
It is the largest manually annotated dataset of 58K English Reddit comments, labeled for 27 emotion categories or neutral.  
Find the paper on [arXiv.org](https://arxiv.org/abs/2005.00547)

In [2]:
import pathlib
import pandas as pd 
import numpy as np 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer 
import re 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


In [3]:
dataset = pathlib.Path.cwd() / 'Datasets/train.tsv'
df = pd.read_csv(dataset, sep='\t', header=None, names=['comment', 'label', 'id'])
df['label'] = df['label'].str.split(',')

In [4]:
emotion_list = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment',                     
                'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism',                 
                'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

enkman_mapping = {
        "anger": ["anger", "annoyance", "disapproval"],
        "disgust": ["disgust"],
        "fear": ["fear", "nervousness"],
        "joy": ["joy", "amusement", "approval", "excitement", "gratitude",  "love", "optimism", "relief", "pride", "admiration", "desire", "caring"],
        "sadness": ["sadness", "disappointment", "embarrassment", "grief",  "remorse"],
        "surprise": ["surprise", "realization", "confusion", "curiosity"],
        "neutral": ["neutral"],
        }
enkman_mapping_rev = {v:key for key, value in enkman_mapping.items() for v in value}

In [5]:
# function from Google Research analysis 
def idx2class(idx_list):
    arr = []
    for i in idx_list:
        arr.append(emotion_list[int(i)])
    return arr

In [6]:
# add emotion label to the label ids
df['emotions'] = df['label'].apply(idx2class)

# use enkman mapping to reduce the emotions to a list of ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
df['mapped_emotions'] = df['emotions'].apply(lambda x: [enkman_mapping_rev[i] for i in x])

# fix issues where ['joy',' joy'] might appear
df.loc[df['mapped_emotions'].apply(len)>1, 'mapped_emotions'] = df.loc[df['mapped_emotions'].apply(len)>1, 'mapped_emotions'].apply(lambda x: [emotion for emotion in set(x)])

(simple) text pre-processing and TF_IDF representation

_NOTICE_   
r/ represents a reddit category   
Example: 'r/hockey has no love for us! Just stay here with all us cool people!'

\[NAME] is replaced from a word that may be representing a brand or a person  
Example: 'How have \[NAME] and \[NAME] looked tonight? I was watching the Huskies game during the first period.'

In [7]:
stemmer = PorterStemmer()
stopword_list = stopwords.words('english')


def process_reddit_comment(strng):
    # remove [NAME] placeholder
    processed_strng = re.sub('\[name]', '', strng)
    # remove reddit symbol 
    processed_strng = re.sub('/r', '', processed_strng)
    return processed_strng


def punct_remover(strng):
    # punctuation marks to be completely removed
    clean_strng = re.sub(r'[?|!|\'|"|#]', r'', strng)
    # punctuation marks to be replaced with space
    clean_strng = re.sub(r'[.|,|)|(|\|/]', r' ', clean_strng)
    # replace multi-space with single space 
    clean_strng = re.sub(r' +', r' ', clean_strng)

    return clean_strng


def tokenize_stem_no_stopwords(strng):
    return [stemmer.stem(w) for w in word_tokenize(strng) if w not in stopword_list]

In [8]:
# lowercase and remove punctuation 
df['processed_comment'] = df['comment'].str.lower()
df['processed_comment'] = df['processed_comment'].apply(process_reddit_comment)
df['processed_comment'] = df['processed_comment'].apply(punct_remover)
df['processed_comment'] = df['processed_comment'].apply(tokenize_stem_no_stopwords)

Sentiments are represented in the columns.  
If a reddit post is classified as having x sentiment, then we represent it with an 1 in x column

In [9]:
N = df.shape[0]
for emotion in enkman_mapping.keys():
    df[emotion] = np.zeros((N,1), dtype=int)

for emotion in enkman_mapping.keys():
    df[emotion] = df['mapped_emotions'].apply(lambda x: 1 if emotion in x else 0)

In [10]:
X_train, X_test = train_test_split(df, random_state=156, test_size=0.25, shuffle=True)

In [11]:
tfidf=TfidfVectorizer()

x_train = tfidf.fit_transform(X_train['processed_comment'].apply(lambda x: ' '.join(x)))
x_test = tfidf.transform(X_test['processed_comment'].apply(lambda x: ' '.join(x)))

### Modeling 

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, hamming_loss, label_ranking_loss
from sklearn.multiclass import OneVsRestClassifier
from skmultilearn.problem_transform import BinaryRelevance, ClassifierChain, LabelPowerset
from skmultilearn.ensemble import RakelD, RakelO
from sklearn.naive_bayes import GaussianNB

OneVsRest

In [42]:
for emotion in enkman_mapping.keys():
    print(f'OneVsRest classification for the emotion of {emotion}')
    clf = OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)
    clf.fit(x_train, X_train[emotion])
    y_pred = clf.predict(x_test)
    print(f'Accuracy on {emotion} is {accuracy_score(X_test[emotion], y_pred)*100:.2f}% \n')


OneVsRest classification for the emotion of anger
Accuracy on anger is 88.12% 

OneVsRest classification for the emotion of disgust
Accuracy on disgust is 98.14% 

OneVsRest classification for the emotion of fear
Accuracy on fear is 98.28% 

OneVsRest classification for the emotion of joy
Accuracy on joy is 82.13% 

OneVsRest classification for the emotion of sadness
Accuracy on sadness is 93.56% 

OneVsRest classification for the emotion of surprise
Accuracy on surprise is 88.30% 

OneVsRest classification for the emotion of neutral
Accuracy on neutral is 72.37% 



### Problem transformation methods 
1. Binary Relevance
2. ClassifierChain
3. Laber powerset

In [13]:
from skmultilearn.problem_transform import BinaryRelevance

clf = BinaryRelevance(GaussianNB())
clf.fit(x_train, X_train[enkman_mapping.keys()])
y_pred = clf.predict(x_test)

print(f'Accuracy score is {accuracy_score(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')
print(f'Hamming loss is {hamming_loss(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')


Accuracy score is 5.06% 

Hamming loss is 46.58% 



In [20]:
from skmultilearn.problem_transform import ClassifierChain

clf = ClassifierChain(classifier=GaussianNB())
clf.fit(x_train, X_train[enkman_mapping.keys()])
y_pred = clf.predict(x_test)

print(f'Accuracy score is {accuracy_score(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')
print(f'Hamming loss is {metrics.hamming_loss(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')

Accuracy score is 5.15% 

Hamming loss is 46.52% 



In [21]:
from skmultilearn.problem_transform import  LabelPowerset

clf = LabelPowerset(classifier=GaussianNB())
clf.fit(x_train, X_train[enkman_mapping.keys()])
y_pred = clf.predict(x_test)

print(f'Accuracy score is {accuracy_score(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')
print(f'Hamming loss is {hamming_loss(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')

Accuracy score is 13.47% 

Hamming loss is 25.34% 



### Binary Relevance Using RNN

In [28]:
from skmultilearn.ext import Keras
from keras.models import Sequential
from keras.layers import Dense

def create_model_single_class(input_dim, output_dim):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=input_dim, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(output_dim, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [29]:
KERAS_PARAMS = dict(epochs=10, batch_size=100, verbose=0)

clf = BinaryRelevance(classifier=Keras(create_model_single_class, False, KERAS_PARAMS), require_dense=[True,True])
clf.fit(x_train, X_train[enkman_mapping.keys()])
y_pred = clf.predict(x_test)

print(f'Accuracy score is {accuracy_score(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')
print(f'Hamming loss is {hamming_loss(X_test[enkman_mapping.keys()], y_pred)*100:.2f}% \n')



Accuracy score is 40.73% 

Hamming loss is 13.61% 



### Algorithm Adaptation methods 
1. BRkNNaClassifier
2. BRkNNbClassifier
3. MLkNN
4. MLTSVM

A toy data will be used for time efficiency!

In [32]:
from skmultilearn.dataset import load_dataset

X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')

emotions:train - exists, not redownloading
emotions:test - exists, not redownloading


In [42]:
from skmultilearn.adapt import BRkNNaClassifier

clf = BRkNNaClassifier(k=3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f'Accuracy score is {accuracy_score(y_test, y_pred)*100:.2f}% \n')

Accuracy score is 19.31% 



In [43]:
from skmultilearn.adapt import BRkNNbClassifier

clf = BRkNNbClassifier(k=3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f'Accuracy score is {accuracy_score(y_test, y_pred)*100:.2f}% \n')

Accuracy score is 4.46% 



In [44]:
from skmultilearn.adapt import MLkNN

clf = MLkNN(k=3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f'Accuracy score is {accuracy_score(y_test, y_pred)*100:.2f}% \n')
print(f'Hamming loss is {hamming_loss(y_test, y_pred)*100:.2f}% \n')

Accuracy score is 19.31% 

Hamming loss is 29.54% 



In [69]:
from skmultilearn.adapt import MLTSVM

clf =  MLTSVM(c_k = 2**-1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f'Accuracy score is {accuracy_score(y_test, y_pred)*100:.2f}% \n')
print(f'Ranking Loss is {label_ranking_loss(y_test, y_pred)*100:.2f}% \n')


  warn('spsolve is more efficient when sparse b '


Accuracy score is 7.43% 

Ranking Loss is 78.29% 

