## Multi-label classification  

Often, we may encouter data that can be classified into more than one categories (for example movie genre, items in an image).  
However, typical classification tasks involve predicting a single label, as they treat classes as being mutually exclusive.   

Multi-Label Classification is the supervised learning problem where an instance may be associated with multiple labels. This is opposed to the traditional task of single-label classification (i.e., multi-class, or binary) where each instance is only associated with a single class label. 

  

### Techniques   

There are two main categorizations of methods that can be used to solve for the multi-label classification problem  
* problem transformation methods and 
* algorithm adaptation methods 

In the first case the learning task is transformed into more or single-label classification tasks. 
In the second, the algorithms are adapted so that they can handle multi-label data.   


<br />

The dataset used here is the GoEmotions.  
This is a dataset released from Google and it containes the emotions detected in those texts.  
It is the largest manually annotated dataset of 58K English Reddit comments, labeled for 27 emotion categories or neutral.  
Find the paper on [arXiv.org](https://arxiv.org/abs/2005.00547)

In [1]:
import pathlib
import pandas as pd 
import numpy as np 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer
import re 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
dataset = pathlib.Path.cwd() / 'Datasets/train.tsv'
df = pd.read_csv(dataset, sep='\t', header=None, names=['comment', 'label', 'id'])
df['label'] = df['label'].str.split(',')

In [3]:
emotion_list = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment',                     'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism',                 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

enkman_mapping = {
        "anger": ["anger", "annoyance", "disapproval"],
        "disgust": ["disgust"],
        "fear": ["fear", "nervousness"],
        "joy": ["joy", "amusement", "approval", "excitement", "gratitude",  "love", "optimism", "relief", "pride", "admiration", "desire",                       "caring"],
        "sadness": ["sadness", "disappointment", "embarrassment", "grief",  "remorse"],
        "surprise": ["surprise", "realization", "confusion", "curiosity"],
        "neutral": ["neutral"],
        }
enkman_mapping_rev = {v:key for key, value in enkman_mapping.items() for v in value}

In [4]:
# function from Google Research analysis 
def idx2class(idx_list):
    arr = []
    for i in idx_list:
        arr.append(emotion_list[int(i)])
    return arr

In [5]:
# add emotion label to the label ids
df['emotions'] = df['label'].apply(idx2class)

# use enkman mapping to reduce the emotions to a list of ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'neutral']
df['mapped_emotions'] = df['emotions'].apply(lambda x: [enkman_mapping_rev[i] for i in x])

# fix issues where ['joy',' joy'] might appear
df.loc[df['mapped_emotions'].apply(len)>1, 'mapped_emotions'] = df.loc[df['mapped_emotions'].apply(len)>1, 'mapped_emotions'].apply(lambda x: [emotion for emotion in set(x)])

(simple) text pre-processing and TF_IDF representation

_NOTICE_   
r/ represents a reddit category   
Example: 'r/hockey has no love for us! Just stay here with all us cool people!'

\[NAME] is replaced from a word that may be representing a brand or a person  
Example: 'How have \[NAME] and \[NAME] looked tonight? I was watching the Huskies game during the first period.'

In [6]:
stemmer = PorterStemmer()
stopword_list = stopwords.words('english')


def process_reddit_comment(strng):
    # remove [NAME] placeholder
    processed_strng = re.sub('\[name]', '', strng)
    # remove reddit symbol 
    processed_strng = re.sub('/r', '', processed_strng)
    return processed_strng


def punct_remover(strng):
    # punctuation marks to be completely removed
    clean_strng = re.sub(r'[?|!|\'|"|#]', r'', strng)
    # punctuation marks to be replaced with space
    clean_strng = re.sub(r'[.|,|)|(|\|/]', r' ', clean_strng)
    # replace multi-space with single space 
    clean_strng = re.sub(r' +', r' ', clean_strng)

    return clean_strng


def tokenize_stem_no_stopwords(strng):
    return [stemmer.stem(w) for w in word_tokenize(strng) if w not in stopword_list]

In [7]:
# lowercase and remove punctuation 
df['processed_comment'] = df['comment'].str.lower()
df['processed_comment'] = df['processed_comment'].apply(process_reddit_comment)
df['processed_comment'] = df['processed_comment'].apply(punct_remover)
df['processed_comment'] = df['processed_comment'].apply(tokenize_stem_no_stopwords)

Sentiments are represented in the columns.  
If a reddit post is classified as having x sentiment, then we represent it with an 1 in x column

In [8]:
N = df.shape[0]
for emotion in enkman_mapping.keys():
    df[emotion] = np.zeros((N,1), dtype=int)

for emotion in enkman_mapping.keys():
    df[emotion] = df['mapped_emotions'].apply(lambda x: 1 if emotion in x else 0)

In [11]:
df

Unnamed: 0,comment,label,id,emotions,mapped_emotions,processed_comment,anger,disgust,fear,joy,sadness,surprise,neutral
0,My favourite food is anything I didn't have to...,[27],eebbqej,[neutral],[neutral],"[favourit, food, anyth, didnt, cook]",0,0,0,0,0,0,1
1,"Now if he does off himself, everyone will thin...",[27],ed00q6i,[neutral],[neutral],"[everyon, think, he, laugh, screw, peopl, inst...",0,0,0,0,0,0,1
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj,[anger],[anger],"[fuck, bayless, iso]",1,0,0,0,0,0,0
3,To make her feel threatened,[14],ed7ypvh,[fear],[fear],"[make, feel, threaten]",0,0,1,0,0,0,0
4,Dirty Southern Wankers,[3],ed0bdzj,[annoyance],[anger],"[dirti, southern, wanker]",1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43405,Added you mate well I’ve just got the bow and ...,[18],edsb738,[love],[joy],"[ad, mate, well, ’, got, bow, love, hunt, aspe...",0,0,0,1,0,0,0
43406,Always thought that was funny but is it a refe...,[6],ee7fdou,[confusion],[surprise],"[alway, thought, funni, refer, anyth]",0,0,0,0,0,1,0
43407,What are you talking about? Anything bad that ...,[3],efgbhks,[annoyance],[anger],"[talk, anyth, bad, happen, fault, -, good, thing]",1,0,0,0,0,0,0
43408,"More like a baptism, with sexy results!",[13],ed1naf8,[excitement],[joy],"[like, baptism, sexi, result]",0,0,0,1,0,0,0


In [9]:
X_train, X_test = train_test_split(df, random_state=156, test_size=0.25, shuffle=True)

In [10]:
tfidf=TfidfVectorizer()

x_train = tfidf.fit_transform(X_train['processed_comment'].apply(lambda x: ' '.join(x)))
x_test = tfidf.transform(X_test['processed_comment'].apply(lambda x: ' '.join(x)))