## Description of our performance metrics

Multilabel classification

For our project, we have a multilabel classification problem involves mapping each movie in the dataset to a set of genre labels. In this type of classification problem, the labels are not mutually exclusive. For example, when classifying a movie into a set of genres, a single movie might be both Romance and Horror. Since the labels are not mutually exclusive, the predictions and true genre labels are now vectors of genre label sets, rather than vectors of genres. However, we can extend the fundamental ideas of precision, recall, etc. to operations on multilabel classification problems.

Multilabel evaluation metrics are grouped into two main categories: example based and label based metrics. Example based metrics are computed individually for each instance, then averaged to obtain the final value. Label based metrics are computed per label, instead of per instance. There are two approaches called micro-averaging and macro-averaging. 

Let MLD be multilabel dataset, $D$ is the number of samples, $\hat{y_j}$ is the predicted value for the $j$-th label sets of a given movie, $y_j$ is the corresponsing true label sets, and $n_{labels}$ is the number of labels.

We can calculate different measurements and evalute by multiple measurements:

**1. Compare bit-wise** . This can be too lenient, so we would not use it here.

**2. Compare vector-ise** (Accuracy classification score). This can be too strict, so we would not focus on it.

**3. Jaccard similarity score**: computes the average of Jaccard similarity coefficients (size of the intersection divided by the size of the union of two label sets) between pairs of label sets.

$$Jaccard(\hat{y_j},y_j) = \frac{1}{|D|}\sum_{j=1}^{|D|}\frac{|\hat{y_j}\cap{y_j}|}{|\hat{y_j}\cup{y_j}|}$$

**4. Hamming loss**. It is the most common evaluation metric in the multilabel literature, computed as the symmetric difference between predicted and true labels and divided by the total number of labels in the MLD.

    Then Hamming loss $L_{Hamming}$ is defined as:

$$L_{Hamming}(\hat{y_j},y_j) = \frac{1}{|D|}\sum_{j=1}^{|D|}\frac{xor(\hat{y_j}, y_j)}{n_{labels}}$$

The best value of Hamming loss is 0, the worst value of Hamming loss is 1.

**5. mutil-label precision**: This metric is computed as the ratio of relevant labels predicted by the classifier.

$$L_{precision}(\hat{y_j},y_j) = \frac{1}{|D|}\sum_{j=1}^{|D|}\frac{|\hat{y_j}\cap{y_j}|}{|{y_j}|}$$

**6. multi-label recall**: It is a metric commonly used along with the previous one, measuring the proportion of predicted labels which are relevant.

$$L_{recall}(\hat{y_j},y_j) = \frac{1}{|D|}\sum_{j=1}^{|D|}\frac{|\hat{y_j}\cap{y_j}|}{|\hat{y_j}|}$$

**7. F1 score (Harmonic Mean of precision and recall)**: providing a balanced assessment between precision
and sensitivity

$$F1 = 2 * \frac{precision * recall}{precision + recall}$$ For multi-label case, this is the specified weighted average of the F1 score of each class. In specific, we can specify the average in following ways: 'micro' calculates metrics globally by counting the total true positives, false negatives and false positives, 'macro' calculates metrics for each label, and find their unweighted mean, and 'weighted' calculates metrics for each label, and find their average, weighted by support (the number of true instances for each label) to account for imbalance. 

If there are labels with more instances than others and if we want to bias our metric towards the most populated ones, we will use micro-average. If there are labels with more instances than others and if we want to bias your metric toward the least populated ones (or at least we don't want to bias toward the most populated ones), we will use macro-average. If the micro-average result is significantly lower than the macro-average one, it means that we have some gross misclassification in the most populated labels, whereas our smaller labels are probably correctly classified. If the macro-average result is significantly lower than the micro-average one, it means our smaller labels are poorly classified, whereas our larger ones are probably correctly classified. In terms of imbalance nature of data, we probably would like to use weighted-average to account for imbalance. The best value of F1 score is 1, the worst value of F1 score is 0.

Based on the literature review, **Hamming loss** (best:0, worst:1) and **F1 score** (best:1, worst:0) are the suggested metrics for multi-label classification problems. We will evaluate above performance evaluations during the model building process, but with a focus on Hamming loss and F1 score


## Example of performance evaluation 

Actual performance evaluation will be included in other ipython notebooks, the following code is just easier for other teammates.

In [2]:
from sklearn.metrics import hamming_loss
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [85]:
#For example purpose:
imdb_example = pd.read_csv(r"C:\Users\cheriexu\Downloads\CS109B\movie_project\r\imdb_cluster_result.csv")
imdb_example.head(5)

Unnamed: 0,X,certificates_R,certificates_PG,art.direction_1,assistant.director_1,casting.director_1,cinematographer_1,costume.department_1,costume.designer_1,countries_1,...,Action,Documentary,Musical,History,Family,Fantasy,Sport,Biography,cluster_response,genres_comb
0,100,1,0,0.322495,0.042103,0.010279,0.046254,0.307966,0.028662,0.089642,...,0,0,0,0,0,0,0,0,4,"""Romance"", ""Comedy"", ""Fantasy"""
1,10001,0,1,0.027673,0.093694,0.319431,0.16525,0.307966,0.17286,0.014232,...,0,0,0,1,0,0,0,0,5,"""Horror"", ""Thriller"", ""Action"""
2,10002,1,0,0.212394,0.024906,0.006523,0.016308,0.012453,0.005535,0.089642,...,0,0,0,0,0,0,0,0,1,"""Horror"", ""Thriller"", ""Drama"""
3,10003,0,1,0.019767,0.024906,0.014133,0.025301,0.126112,0.007116,0.536766,...,1,0,0,0,0,0,0,0,1,"""Horror"", ""Thriller"", ""Drama"""
4,10004,1,0,0.104764,0.355406,0.017395,0.049812,0.126112,0.040028,0.536766,...,0,0,0,0,0,1,0,0,1,"""Horror"", ""Thriller"", ""Drama"""


In [86]:
#Input: dataframe from csv file
#Output: y: response variable that is good for multi-label classification
#        m: processor, may need to transform back in later
def process_multilabel(dataframe):
    #convert response variable to a set format
    #for example, '"Romance, "Horror"' to ("Romance", "Horror")
    dataframe['genres_comb'] = dataframe['genres_comb'].apply(lambda x: eval(x))
    y = dataframe.ix[:,'genres_comb']
    m = MultiLabelBinarizer().fit(y)
    y = m.transform(y)
    return(y, m)

In [87]:
y, m = process_multilabel(imdb_example)

In [91]:
imdb_example.ix[:,'genres_comb'].head(5)

0    (Romance, Comedy, Fantasy)
1    (Horror, Thriller, Action)
2     (Horror, Thriller, Drama)
3     (Horror, Thriller, Drama)
4     (Horror, Thriller, Drama)
Name: genres_comb, dtype: object

In [92]:
m.classes_

array(['Action', 'Adventure', 'Animation', 'Comedy', 'Drama', 'Family',
       'Fantasy', 'Horror', 'Music', 'Romance', 'Thriller'], dtype=object)

In [93]:
#for example purpose
y_true_m = y[1:100]
y_pred_m = y[101:200]

In [94]:
#two evluation functions:
def f1score_evaluation(y_true_m, y_pred_m, average_method):
    #convert  = [Horror, Thriller, Action] to a list with binary indication
    #m = MultiLabelBinarizer().fit(y_true)
    #f1score = f1_score(m.transform(y_true),
    #     m.transform(y_pred),
    #     average= average_method)
    
    return(f1_score(y_true_m, y_pred_m, average = average_method))

def hammingloss_evaluation(y_true_m, y_pred_m):
    #m = MultiLabelBinarizer().fit(y_true)
    #hammingloss = hamming_loss(m.transform(y_true),
    #     m.transform(y_pred))
     return(hamming_loss(y_true_m,
         y_pred_m))

In [95]:
f1score_evaluation(y_true_m, y_pred_m, 'weighted')

0.49209799363385326

In [96]:
hammingloss_evaluation(y_true_m, y_pred_m)

0.2865013774104683

For cross_val_score or GridSearchCV:

f1 score with weighted average just set scoring ='f1_weighted' or:

    eg:

    clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
    f1_scorer = make_scorer(f1score_evaluation, greater_is_better=Ture)
    cross_val_score(clf, X, y, cv=3, scoring =f1_scorer)   

hamming loss

    hamming_scorer = make_scorer(hammingloss_evaluation, greater_is_better=False)
    cross_val_score(clf, X, y, cv=3, scoring =hamming_scorer)   