<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Multiclass_Multilabel_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiclass Multilabel
Multiclass multilabel may be the most confusing terminology in ML. Multiclass means there are multiple possible classes (i.e. occupation codes) but only one correct one for each example. Multilabel means each example can have multiple possible classes (i.e. both a janitor and a nurse).

Suppose, for example, we know the NAICS code for an establishment and want to predict which occupations are there (of which there could be many). It might look like the following, where id identifies each establishment.

In [47]:
import pandas as pd

df = pd.DataFrame([{'id': 1, 'naics': '1234', 'job': '31'},
                   {'id': 1, 'naics': '1234', 'job': '32'},
                   {'id': 2, 'naics': '2244', 'job': '31'},
                   {'id': 2, 'naics': '2244', 'job': '33'},
                   {'id': 2, 'naics': '2244', 'job': '58'}])
df.head()


Unnamed: 0,id,naics,job
0,1,1234,31
1,1,1234,32
2,2,2244,31
3,2,2244,33
4,2,2244,58


Sklearn multilabel algorithms expect each example to be associated with a list of labels, so we need to aggregate the labels into a list of labels, for each id. We can do this using the pandas `groupby` and `agg` methods as follows:


In [48]:
df = df.groupby('id')['job'].agg(list).reset_index()
df.head()

Unnamed: 0,id,job
0,1,"[31, 32]"
1,2,"[31, 33, 58]"


We now have data suitable for sklearn. We fit a multilabel logistic regression below.

In [50]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer

# 2 training examples but 4 possible outputs (a,b,c, and/or d)
vect = CountVectorizer()
vect.fit(df['naics'])
x_train = vect.transform(df['naics'])

mlb = MultiLabelBinarizer()
mlb.fit(df['jobs'])
y_train = mlb.transform(df['jobs'])
print(y_train)

[[1 1 0 0]
 [1 0 1 1]
 [0 0 1 0]]


In [51]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = OneVsRestClassifier(LogisticRegression())
clf.fit(x_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [52]:
y_pred = clf.predict(x_train)
print(y_pred)

[[1 0 1 0]
 [1 0 1 0]
 [1 0 1 0]]


In [56]:
preds = mlb.inverse_transform(y_pred)
print(preds)

[('31', '33'), ('31', '33'), ('31', '33')]


In [54]:
from sklearn.metrics import accuracy_score, f1_score

accuracy_score(y_train, y_pred)

0.0

In [55]:
f1_score(y_train, y_pred, average='macro')

0.4