# NPL with Multilabel Classification
--- 

Challenge for the Hand Talk [selective process](https://handtalk.notion.site/Classifica-o-de-frases-por-setor-18c80adbbf874c519c9efe19678ac4c1).  
*author: [@baiochi](http://github.com/baiochi)*

# Imports

In [35]:
# Algebra and maths operations
import numpy as np
# Data manipulation
import pandas as pd
# NLP toolkit
import nltk
# ML tools
import sklearn
# Visualization tools
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import plotly.express as px
import plotly.graph_objects as go
# Progress bar
from tqdm import tqdm
tqdm.pandas()
# Save models
import joblib
# Libraries version
import session_info

# Hide logs
import logging
loggers_to_shut_up = [
    "hyperopt.tpe",
    "hyperopt.fmin",
    "hyperopt.pyll.base",
]
for logger in loggers_to_shut_up:
    logging.getLogger(logger).setLevel(logging.ERROR)

# Cutom plot colors
URBAN_PALETTE_CATEGORICAL = pd.DataFrame({
    'cyan'    : '#1696d2',
    'gray'    : '#d2d2d2',
    'magenta' : '#ec008b',
    'yellow'  : '#fdbf11',
    'dark'    : '#332d2f',
    'ocean'   : '#0a4c6a',
}, index=['hex_code'])

# Colors for Cell output
WHITE = '\033[39m'
CYAN = '\033[36m'

session_info.show()

# First cycle: baseline model

The idea of creating a baseline model is to have a reference for the improved models.  
It is a classifier with barely any transformation in the data(like feature engineering) or tunning hyper-parameters.  

### Select data and split into train/test

The first step of creating a model is to split data randomly into train and test datasets.  
Train data will be used to fit the model, and in both will be computed metrics to evaluate model performance to check for underfitting or overfitting.
In this example we will use 80% of data to train, and 20% to test.  

> Note: for reproducibility reasons, we will run multi label binarize step again.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

# Read data
df = pd.read_csv('data/dataset.csv')

# Define features
X = df['sentence']

# Define target and binarize labels
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['category'].apply(str.split, sep=','))
labels = mlb.classes_

# Apply train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check data dimension
print('Data shape after splitting:')
print(f'X_train={X_train.shape}\t\t X_test={X_test.shape}\ny_train={y_train.shape}\t y_test={y_test.shape}')

Data shape after splitting:
X_train=(416,)		 X_test=(105,)
y_train=(416, 5)	 y_test=(105, 5)


### Build a `Pipeline` example with Logistic Regression

One of the biggest advantage to use Sklearn `Pipeline` is to prevent data leakage.  

> Data lakage can occour when using the test dataset in unwanted process, like scaling/vectorizing or when fitting the model. 
When this happens, the evaluation of train and test metrics to verify if the model is overfitted can be misleading.   
  
 In order to build our pipeline, we must first:  
 - Instance our transformer, which in our case is `TfidfVectorizer`  
 - Instance the estimator, e.g. `Logisticregression`  
 - Select the meta classifier, we will begin with `MultiOutputClassifier`   
   
   
 > Note: `TfidfVectorizer` is equivalent to `CountVectorizer` followed by `TfidfTransformer`.  
   
     
 Then all of this objects will be grouped in a `list` and passed as a parameter to the `Pipeline` object.  
 Each element of the list repreentes a **step** of the piepline, consisting in a tuple of two values(a string describing the step, and the object itself).  
   
 After fitting the data, the following metrics will be computed for the train and test datasets, in order to evaluate the model:     
- **Mean accuracy score**: mean accuracy on the given test data and labels, ranging from 0% to 100%.  
- **Hamming loss**: fraction of labels that are incorrectly predicted, ranging from 0 to 1, lower values means better score.  

In [3]:
# Get portuguese stopwords
from unidecode import unidecode
pt_stopwords = [unidecode(word) for word in nltk.corpus.stopwords.words('portuguese')]

# Instance the transformer
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert a collection of raw documents to a matrix of TF-IDF features
tfidf_vec = TfidfVectorizer(
    strip_accents='unicode',
    stop_words=pt_stopwords,
    max_features=5000,
    max_df=0.85,)

# Instance estimator with default hyper-parameters
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Import meta classifier
from sklearn.multioutput import MultiOutputClassifier

# Create Pipeline
from sklearn.pipeline import Pipeline
lr_pipeline = Pipeline([
    # Step 1, transform features
    ('tfidf', tfidf_vec),
    # Step 2, estimator wrapped in the meta classifier 
    ('clf_moc', MultiOutputClassifier(lr)),
])

# Fit model with train data
lr_pipeline.fit(X_train, y_train)


from sklearn.metrics import accuracy_score, hamming_loss

# Compute metrics for the model
print('Score for Logistic Regression')
print(f'\n{CYAN}Mean accuracy{WHITE}')
print(f'Train dataset: {lr_pipeline.score(X_train, y_train)*100:.2f}%')
print(f'Test dataset:  {lr_pipeline.score(X_test, y_test)*100:.2f}%')
# Hamming Loss metrics
print(f'\n{CYAN}Hamming loss average{WHITE}')
print(f'Train dataset: {hamming_loss(y_train, lr_pipeline.predict(X_train)):.2f}')
print(f'Test datasets:  {hamming_loss(y_test, lr_pipeline.predict(X_test)):.2f}')

# Cross validation score with F1 weighted
from sklearn.model_selection import cross_val_score
X_transf = tfidf_vec.transform(X)
print(f'\n{CYAN}Cross validation score{WHITE}\nF1 weighted: {cross_val_score(lr_pipeline[1], X_transf, y, cv=5, scoring="f1_weighted").mean():.2f}')

Score for Logistic Regression

[36mMean accuracy[39m
Train dataset: 18.75%
Test dataset:  11.43%

[36mHamming loss average[39m
Train dataset: 0.18
Test datasets:  0.19

[36mCross validation score[39m
F1 weighted: 0.12


As we can see, the model performed very poorly.   
Despite Logistic Regression being able to adapt in this multi label case, there are other algorithms more suitable to address this type of problem.  

Some of them are:
- Naive Bayes;
- Decision tress;
- Ensemble methods, e.g. Random Forest, Gradient Boosting.  

For now, let's create a custom function to easly try between different estimators.

### Custom function for the next estimators

This function will receive all the splitted data, transformer object, estimator object and meta classifier object.

In [4]:
# Function to run the pipeline for a given classifier and estimator
def run_pipeline(X, y, transformer, estimator, multiclass_strategy=None, show_metrics=True):
    
    # Get name from variable type
    var_name = lambda x : str(type(x)).split('.')[-1][:-2]
    transformer_name = var_name(transformer)
    estimator_name = var_name(estimator)

    # Apply train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    if multiclass_strategy:
        print(f'Multiclass strategy: {str(multiclass_strategy).split(".")[-1][:-2]}')
        estimator = multiclass_strategy(estimator)
    print(f'Estimator: {estimator_name}')

     # Create Pipeline
    pipeline = Pipeline([
        (transformer_name,        transformer),
        ('clf_' + estimator_name, estimator),
    ])
    
    # Fit model
    try:
        pipeline.fit(X_train, y_train)
    except ValueError:
        print(f'\nError: y should be a 1d array, got an array of shape {y_train.shape} instead.')
        print('Try wrapping the estimator in a meta classifier.')
        return None

    if show_metrics:
        # Model score (average accuracy)
        print(f'\n{CYAN}Average accuracy score{WHITE}')
        print(f'Train dataset: {pipeline.score(X_train, y_train)*100:.2f}%')
        print(f'Test dataset:  {pipeline.score(X_test, y_test)*100:.2f}%')
        # Hamming Loss metrics
        print(f'\n{CYAN}Hamming loss average{WHITE}')
        print(f'Train dataset: {hamming_loss(y_train, pipeline.predict(X_train)):.2f}')
        print(f'Test datasets:  {hamming_loss(y_test, pipeline.predict(X_test)):.2f}')
        # Cross validation score with F1 weighted
        X_transf = tfidf_vec.transform(X)
        print(f'\n{CYAN}Cross validation score{WHITE}\nF1 weighted: {cross_val_score(pipeline[1], X_transf, y, cv=5, scoring="f1_weighted").mean():.2f}')
    
    return pipeline

In [5]:
# Testing with LogisticRegression
lr_pipeline = run_pipeline(X, y , tfidf_vec, LogisticRegression(), MultiOutputClassifier)

Multiclass strategy: MultiOutputClassifier
Estimator: LogisticRegression

[36mAverage accuracy score[39m
Train dataset: 18.75%
Test dataset:  11.43%

[36mHamming loss average[39m
Train dataset: 0.18
Test datasets:  0.19

[36mCross validation score[39m
F1 weighted: 0.12


In [6]:
# Save LogisticRegression model
joblib.dump(lr_pipeline, 'models/logistic_regression.joblib')

['models/logistic_regression.joblib']

### Multinomial Naive Bayes

This classifier is suitable and recommended for classification with discrete features, like word counts in our case.

#### Binary Relevance

**MultiOutputClassifier** uses binary relevance method to convert our multi-label problem in multi-class, which involves training one binary classifier independently for each label.  

PROS: estimates single-label classifiers and can generalize beyond avialable label combinations.  
  
CONS: not suitable for large number of labels and ignores label relations.  


Reference: [scikit-learn.org](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier)

In [7]:
from sklearn.naive_bayes import MultinomialNB

# Instance estimator
mnb = MultinomialNB(
    fit_prior=True, 
    class_prior=None)

mnb_moc_pipeline = run_pipeline(X, y, tfidf_vec, mnb, MultiOutputClassifier)

Multiclass strategy: MultiOutputClassifier
Estimator: MultinomialNB

[36mAverage accuracy score[39m
Train dataset: 69.95%
Test dataset:  26.67%

[36mHamming loss average[39m
Train dataset: 0.06
Test datasets:  0.16

[36mCross validation score[39m
F1 weighted: 0.34


In [8]:
# Save MultinomialNB model with MultiOutputClassifier
joblib.dump(mnb_moc_pipeline, 'models/multinomial_nb_moc.joblib')

['models/multinomial_nb_moc.joblib']

#### Classifier Chain

This method combines a bumber of binary classifiers into a single multi-label model, beeing capable of exploring the labels correlations.

PROS: estimates single-label classifiers and can generalize beyond avialable label combinations, also takes label relations into account.  

CONS: not suitable for large number of labels and quality strongly depends on the label ordering in chain.  

Reference: [scikit-learn.org](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)

In [9]:
from sklearn.multioutput import ClassifierChain

mnb_cc_pipeline = run_pipeline(X, y, tfidf_vec, mnb, ClassifierChain)

Multiclass strategy: ClassifierChain
Estimator: MultinomialNB

[36mAverage accuracy score[39m
Train dataset: 73.80%
Test dataset:  30.48%

[36mHamming loss average[39m
Train dataset: 0.05
Test datasets:  0.16

[36mCross validation score[39m
F1 weighted: 0.43


In [10]:
# Save MultinomialNB model with ClassifierChain
joblib.dump(mnb_cc_pipeline, 'models/multinomial_nb_cc.joblib')

['models/multinomial_nb_cc.joblib']

#### Label Powerset

Gives a unique class to every possible label combination.  

The method maps each combination to a unique combination id number, and performs multi-class classification
using the `classifier` as multi-class classifier and combination ids as classes.  

PROS: estimates label dependencies, with only one classifier. Often is the best solution for subset accuracy if training data contains all relevant label combinations.  

CONS: requires all label combinations predictable by the classifier to be present in the training data, and are very prone to underfitting with large label spaces.  

In [11]:
from skmultilearn.problem_transform import LabelPowerset

mnb_lp_pipeline = run_pipeline(X, y, tfidf_vec,mnb, LabelPowerset)

Multiclass strategy: LabelPowerset
Estimator: MultinomialNB

[36mAverage accuracy score[39m
Train dataset: 88.94%
Test dataset:  60.00%

[36mHamming loss average[39m
Train dataset: 0.04
Test datasets:  0.15

[36mCross validation score[39m
F1 weighted: 0.61


In [12]:
# Save MultinomialNB model with LabelPowerset
joblib.dump(mnb_lp_pipeline, 'models/multinomial_nb_lp.joblib')

['models/multinomial_nb_lp.joblib']

### Model performance

Among all the different strategies, *Label Powerset* had a better score.   
  

| Strategy 	| Binary Relevance 	| Classifier Chain 	| Label powerset 	|
|---	|---	|---	|---	|
| Train acc 	| 70.19% 	| 73.80% 	| 88.94% 	|
| Test acc 	| 27.62% 	| 30.48% 	| 60.00% 	|
| Train hamming-loss 	| 0.06 	| 0.05 	| 0.04 	|
| Test hamming-loss 	| 0.16 	| 0.16 	| 0.15 	|
| CV score F1 weighted	| 0.30 	| 0.37 	| 0.57 	|  
  
We will use this meta classifier with Multinomial Naive Bayes to calculate the metrics.  

In [13]:
# Define baseline pipeline based on the best score
baseline_pipeline = run_pipeline(X, y, tfidf_vec, mnb, LabelPowerset, show_metrics=False)

Multiclass strategy: LabelPowerset
Estimator: MultinomialNB


In [14]:
# Make predictions
y_pred_train = baseline_pipeline.predict(X_train).toarray()
y_pred_test = baseline_pipeline.predict(X_test).toarray()
y_proba_train = baseline_pipeline.predict_proba(X_train).toarray()
y_proba_test = baseline_pipeline.predict_proba(X_test).toarray()

#### ROC AUC for Multilabel problem

##### Compute values 

In [15]:
from sklearn.metrics import auc, roc_curve, roc_auc_score

def multiclass_rouc_auc(y_true, y_score, n_classes):
    
    # Get FP/TP and ROC AUC Score for each class
    fpr = dict(); tpr = dict(); roc_auc = dict();
    for i in range(y_score.shape[1]):
        # false positives and true positives
        fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_score[:, i])
        # roc auc score
        roc_auc[i] = roc_auc_score(y_true[:, i], y_score[:, i])
    
    # Compute micro-average ROC curve and ROC area from prediction scores
    fpr['micro'], tpr['micro'], _ = roc_curve(y_true.ravel(), y_score.ravel())
    roc_auc['micro'] = auc(fpr['micro'], tpr['micro'])
    
    # Aggregate all false positive rates
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    
    # Interpolate all ROC curves at this points
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
    
    # Average it and compute AUC
    mean_tpr /= n_classes
    fpr['macro'] = all_fpr
    tpr['macro'] = mean_tpr
    roc_auc['macro'] = auc(fpr['macro'], tpr['macro'])
    
    return fpr, tpr, roc_auc

In [16]:
# Calculate metrics for train and test
fpr_train, trp_train, roc_auc_train = multiclass_rouc_auc(y_train, y_proba_train, len(labels))
fpr_test, trp_test, roc_auc_test = multiclass_rouc_auc(y_test, y_proba_test, len(labels))

##### Plot figure 

In [17]:
def plot_multiclass_roc_auc(fpr, tpr, roc_auc, labels, title=None):
    
    fig = go.Figure()

    # Baseline roc curve
    fig.add_shape(type='line', line=dict(dash='dash'),
                x0=0, x1=1, y0=0, y1=1)
    # Plot curve for each clas
    for index, label in enumerate(labels):
        name = f'{label} (area = {roc_auc[index]:.3f})'
        fig.add_trace(go.Scatter(x=fpr[index], y=tpr[index], name=name, mode='lines'))

    # Plot micro and macro average
    fig.add_trace(go.Scatter(x=fpr['macro'], y=tpr['macro'], name=f'macro-average (area = {roc_auc["macro"]:.2f})', mode='lines', line_dash='dot'))
    fig.add_trace(go.Scatter(x=fpr['micro'], y=tpr['micro'], name=f'micro-average (area = {roc_auc["micro"]:.2f})', mode='lines', line_dash='dot'))

    fig.update_layout(
        title_text = title + ' ROC curves',
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        yaxis=dict(scaleanchor='x', scaleratio=1),
        xaxis=dict(constrain='domain'),
        legend=dict(
            yanchor="bottom",
            y=0.02,
            xanchor="left",
            x=0.55
        ),
        autosize=False,
        width=600, height=500,
    )
    
    return fig

In [18]:
plot_multiclass_roc_auc(fpr_train, trp_train, roc_auc_train, labels, title='Train dataset').show()
plot_multiclass_roc_auc(fpr_test, trp_test, roc_auc_test, labels, title='Test dataset').show()

#### Classification report

In [19]:
from sklearn.metrics import f1_score, classification_report

# Compute F1 average score with param 'weighted' to account for the target imbalance
f1_score_train = f1_score(y_train, y_pred_train, average="weighted")
f1_score_test = f1_score(y_test, y_pred_test, average="weighted")

# Compute precision, recall and f1 score for each label, plus micro and macro average
clf_report_train = pd.DataFrame(classification_report(y_train, y_pred_train, output_dict=True)).T
clf_report_test = pd.DataFrame(classification_report(y_test, y_pred_test, output_dict=True)).T

# Rename axis with labels
clf_report_train.index = list(labels) + ['micro avg', 'macro avg', 'weighted avg', 'samples avg']
clf_report_test.index = list(labels) + ['micro avg', 'macro avg', 'weighted avg', 'samples avg']

# Add the ROC AUC scores computed previously
clf_report_train['roc auc'] = list(roc_auc_train.values()) + ([np.nan] * 2)
clf_report_test['roc auc'] = list(roc_auc_test.values()) + ([np.nan] * 2)

In [20]:
print(f'Classification report for Train dataset. F1 Average Score = {f1_score_train:.3f}')
clf_report_train

Classification report for Train dataset. F1 Average Score = 0.905


Unnamed: 0,precision,recall,f1-score,support,roc auc
educação,1.0,0.891304,0.942529,92.0,0.999933
finanças,1.0,0.666667,0.8,63.0,0.98723
indústrias,0.961538,0.862069,0.909091,87.0,0.993537
orgão público,0.872483,0.992366,0.928571,131.0,0.999223
varejo,1.0,0.822785,0.902778,79.0,0.998948
micro avg,0.947115,0.871681,0.907834,452.0,0.982963
macro avg,0.966804,0.847038,0.896594,452.0,0.996216
weighted avg,0.95564,0.871681,0.905234,452.0,
samples avg,0.947115,0.918269,0.927885,452.0,


In [21]:
print(f'Classification report for Train dataset. F1 Average Score = {f1_score_test:.3f}')
clf_report_test

Classification report for Train dataset. F1 Average Score = 0.629


Unnamed: 0,precision,recall,f1-score,support,roc auc
educação,0.958333,0.741935,0.836364,31.0,0.976024
finanças,1.0,0.230769,0.375,13.0,0.901338
indústrias,0.866667,0.65,0.742857,20.0,0.978235
orgão público,0.45283,1.0,0.623377,24.0,0.953961
varejo,0.8,0.296296,0.432432,27.0,0.871557
micro avg,0.67619,0.617391,0.645455,115.0,0.891453
macro avg,0.815566,0.5838,0.602006,115.0,0.940503
weighted avg,0.804431,0.617391,0.628662,115.0,
samples avg,0.67619,0.638095,0.650794,115.0,


#### Confusion Matrix

In [64]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

for i, label in enumerate(labels):
	px.imshow(
		confusion_matrix(y_test[:, i], y_pred_test[:, i]),
		text_auto=True,
		color_continuous_scale='viridis',
		labels=dict(x='Predicted', y='True', color='Count'),
		title=f'Confusion Matrix for {label}'
	).update_layout(
		xaxis_showticklabels=False,
		yaxis_showticklabels=False,
	).show()

#### Feature importance

In [22]:
# Create a map for each label
labels_mapping = {
    '0' : 'educação',
    '1' : 'finanças',
    '2' : 'indústrias',
    '3' : 'orgão público',
    '4' : 'varejo',
}
# Get classifier from pipeline
clf = baseline_pipeline[1]
# Extract combination labels
comb_labels = list(clf.unique_combinations_.keys())
# Get features names
feature_names = tfidf_vec.get_feature_names_out()
# Create feature dictionary
feature_importance = pd.DataFrame({'ranking':list(range(1,11))})
for i, class_label in enumerate(comb_labels):
    # Get index for the top probs
    top_n_features = np.argsort(clf.classifier.feature_log_prob_[i])[-10:]
    # Rename label or combination of labels
    if len(class_label) > 1:
        class_label = ','.join(labels_mapping[j] for j in class_label.split(','))
    else:
        class_label = labels_mapping[class_label]
    # Add observation to dataframe
    feature_importance[class_label] = [feature_names[j] for j in top_n_features]
# Set ranking to index
feature_importance.set_index('ranking', inplace=True)
# Sort columns
feature_importance.sort_index(axis=1, inplace=True)

feature_importance.head()

Unnamed: 0_level_0,educação,"educação,finanças","educação,indústrias","educação,orgão público",finanças,"finanças,indústrias","finanças,orgão público","finanças,varejo",indústrias,"indústrias,orgão público","indústrias,varejo",orgão público,varejo
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,horas,seguranca,480,estudantes,mes,gama,financeiras,consegue,carros,design,concessionarias,emissao,encontre
2,graduacao,trabalho,24x,prefeitura,banco,bilhao,des,12x,tracker,civil,adquira,direito,compre
3,aula,tecnico,producao,processo,conta,renovacao,instituicoes,unidades,seguranca,construcao,explore,nacional,escolha
4,estudos,03,aprendizagem,inscricoes,itau,pague,if,89,logistica,brasil,buscar,lei,pacote
5,lingua,32x,assistente,municipal,valor,parcela,imposto,29,onix,trabalhadores,combina,justica,90


#### Ranking metrics for Multilabel

In [23]:
from sklearn.metrics import coverage_error, label_ranking_average_precision_score, label_ranking_loss, ndcg_score

pd.DataFrame({
    'Coverage error' : [
        coverage_error(y_train, y_proba_train),
        coverage_error(y_test, y_proba_test),
    ],
    'LRAP' : [
        label_ranking_average_precision_score(y_train, y_proba_train),
        label_ranking_average_precision_score(y_test, y_proba_test),
    ],
    'Label ranking loss' : [
        label_ranking_loss(y_train, y_proba_train),
        label_ranking_loss(y_test, y_proba_test),
    ],
    'NDCG' : [
        ndcg_score(y_train, y_proba_train),
        ndcg_score(y_test, y_proba_test),
    ],
}, index=['Train dataset', 'Test dataset']).T

Unnamed: 0,Train dataset,Test dataset
Coverage error,1.221154,1.714286
LRAP,0.971274,0.812381
Label ranking loss,0.027444,0.149206
NDCG,0.981371,0.862186


# Second cycle: testing other estimators

## Decision Tree

In [24]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)

dt_pipeline = run_pipeline(X, y, tfidf_vec, dt_clf)

Estimator: DecisionTreeClassifier

[36mAverage accuracy score[39m
Train dataset: 100.00%
Test dataset:  60.00%

[36mHamming loss average[39m
Train dataset: 0.00
Test datasets:  0.15

[36mCross validation score[39m
F1 weighted: 0.60


In [25]:
# Save Decision Tree model
joblib.dump(dt_pipeline, 'models/decision_tree.joblib')

['models/decision_tree.joblib']

In [73]:
import graphviz
from sklearn.tree import export_graphviz

dot_data = export_graphviz(
    dt_pipeline[1], out_file=None, 
    #feature_names=dt_pipeline[0].get_feature_names_out(),  
    class_names=labels,  
    filled=True, rounded=True,  
    special_characters=True)  

#graphviz.Source(dot_data)

## Random Forest

The [`RandomForestClassifier`](#https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) uses the concept of *Bagging*(Bootsrap Aggregating) to generate random independent trees and make the final prediction throught the average of each individual prediction.  
  
This approach tends to remove some erros and outliers, making a model more generic an stable to overfitting. 
Also since it can handle multilabel problems, it's not necessary to wrap in any meta classifier. Although it's possible, doing so has a large chance of overfitting the model.  
The most important parameters are:  
- `n_estimators`: how many trees will be build. More trees will take a longer time to finish, and starts to loses its individual properties.  
- `max_features`: number of features in the random sample to be used in each split. Less imply to reduce overfitting, but can lead to underfitting.


In [26]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()

rf_pipeline = run_pipeline(X, y, tfidf_vec, rf_clf)

Estimator: RandomForestClassifier

[36mAverage accuracy score[39m
Train dataset: 100.00%
Test dataset:  50.48%

[36mHamming loss average[39m
Train dataset: 0.00
Test datasets:  0.14

[36mCross validation score[39m
F1 weighted: 0.60


In [27]:
# Save Random Forest model
joblib.dump(rf_pipeline, 'models/random_forest.joblib')

['models/random_forest.joblib']

## Support Vector Machine

## XGBoost

In [None]:
from xgboost import XGBClassifier

# Create a pipeline with the classifier
xgb_pipeline = Pipeline([
	('tfidf', TfidfVectorizer(
		strip_accents='unicode',
		stop_words=pt_stopwords,
	)),
	('clf', XGBClassifier(
		use_label_encoder=False,
		random_state=42
	))
]).fit(X_train, y_train)

# Third cycle: hyper-parameter tunning

## Optmized search with `hyperopt` 

In [34]:
# Optimize hyperparameters with hyperopt library
from hyperopt import Trials, hp, tpe, fmin, space_eval
# Import cross validation model selection
from sklearn.model_selection import cross_validate

# Transform features
X_transf = tfidf_vec.fit_transform(X)
X_train_transf = tfidf_vec.fit_transform(X_train)
X_test_transf = tfidf_vec.transform(X_test)

# Define the parameter space to search over for the best model parameters using hyperopt
param_space = {
	'n_estimators':hp.randint('n_estimators', 100, 1000),
	'max_depth': hp.randint('max_depth', 10, 200),           
	'min_samples_split': hp.uniform('min_samples_split', 0, 1),   
	'min_samples_leaf': hp.randint('min_samples_leaf', 1, 10),
	'criterion': hp.choice('criterion', ['gini', 'entropy']),
	'max_features': hp.choice('max_features', ['sqrt', 'log2']),
	'bootstrap': hp.choice('bootstrap', [True, False]),
	'class_weight': hp.choice('class_weight', ['balanced', 'balanced_subsample', None]),
}

# Define the objective function to minimize
pbar = tqdm(total=30, desc="Hyperopt")
def objective(params):
	# Create the pipeline with the parameters
	pipeline = Pipeline([
		('tfidf', tfidf_vec),
		('clf', RandomForestClassifier(**params, random_state=42))
	])
	# Compute the cross validation score
	cross_validate_results = cross_validate(	
		pipeline, X_train, y_train, 
		cv=5, scoring='f1_weighted', n_jobs=-1)
	pbar.update()
	# Return the negative of the cross validation score
	return -cross_validate_results['test_score'].mean()

# Create the trials object to store the results
trials = Trials()
# Run the hyperparameter optimization
best = fmin(
	fn=objective, 
	space=param_space, 
	algo=tpe.suggest, 
	max_evals=30, trials=trials,
	show_progressbar=False)
pbar.close()
# Get the best parameters
best_params = space_eval(param_space, best)
# Print the best parameters
print(best_params)
# Create the classifier with the best parameters
rf_clf = RandomForestClassifier(**best_params)
# Run the pipeline
rf_pipeline = run_pipeline(X, y, tfidf_vec, rf_clf)
# Get the f1 score for the test dataset
f1_score_test = f1_score(y_test, rf_pipeline[1].predict(X_test_transf), average="weighted")
# Print the f1 score
print(f'F1 score for test dataset = {f1_score_test:.3f}')

Hyperopt:   0%|          | 0/30 [01:52<?, ?it/s]
Hyperopt: 100%|██████████| 30/30 [03:27<00:00,  6.91s/it]


{'bootstrap': False, 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 169, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 0.2355470696890487, 'n_estimators': 985}
Estimator: RandomForestClassifier

[36mAverage accuracy score[39m
Train dataset: 89.18%
Test dataset:  51.43%

[36mHamming loss average[39m
Train dataset: 0.02
Test datasets:  0.11

[36mCross validation score[39m
F1 weighted: 0.64
F1 score for test dataset = 0.680


## Fine tunning with `GridSearchCV`

In [38]:
from sklearn.model_selection import GridSearchCV

# Define classifiers and hyper-parameters
parameters = [
    # Multinoial Naive Bayes
    {
        'classifier': [MultinomialNB()],
        'classifier__alpha': np.linspace(0.1, 1.0, 5, endpoint=True),
        'classifier__class_prior': [None],
        'classifier__fit_prior': [True]
    },
    # Random Forest Classifier
    {
        'classifier': [RandomForestClassifier()],
        'classifier__criterion': ['entropy'],
        'classifier__n_estimators': [800, 900, 1000, 1200],
        'classifier__max_features': ['sqrt'],
        'classifier__max_depth': [150, 160, 170, 180],
        'classifier__class_weight': ['balanced'],
        'classifier__bootstrap': [False],
    },
]

# Instance GridSearch object
grid_search = GridSearchCV(
    LabelPowerset(), 
    param_grid=parameters, 
    scoring='f1_weighted',
    cv=5,
    verbose=1, n_jobs=-1)

# Run GridSearch
grid_search.fit(X_train_transf, y_train)

# Parameters of the best estimator
print (f'{CYAN}Best estimator hyper-parameters: {WHITE}\n{grid_search.best_params_}')
# Model score (average accuracy)
print(f'\n{CYAN}Average accuracy score{WHITE}')
print(f'Train dataset: {grid_search.best_estimator_.score(X_train_transf, y_train)*100:.2f}%')
print(f'Test dataset:  {grid_search.best_estimator_.score(X_test_transf, y_test)*100:.2f}%')
# Hamming Loss metrics
print(f'\n{CYAN}Hamming loss average{WHITE}')
print(f'Train dataset: {hamming_loss(y_train, grid_search.best_estimator_.predict(X_train_transf)):.2f}')
print(f'Test datasets:  {hamming_loss(y_test, grid_search.best_estimator_.predict(X_test_transf)):.2f}')
# Cross validation score with F1 weighted
print(f'\n{CYAN}Cross validation score{WHITE}\nF1 weighted: {cross_val_score(grid_search.best_estimator_, X_transf, y, cv=5, scoring="f1_weighted").mean():.2f}')

Fitting 5 folds for each of 21 candidates, totalling 105 fits
[36mBest estimator hyper-parameters: [39m
{'classifier': MultinomialNB(alpha=0.1), 'classifier__alpha': 0.1, 'classifier__class_prior': None, 'classifier__fit_prior': True}

[36mAverage accuracy score[39m
Train dataset: 98.32%
Test dataset:  74.29%

[36mHamming loss average[39m
Train dataset: 0.00
Test datasets:  0.09

[36mCross validation score[39m
F1 weighted: 0.73


In [65]:
# Save grid search results
joblib.dump(grid_search, 'grid_search.joblib')

['grid_search.joblib']

## Metrics for model with best parameters

In [40]:
# Predictions
y_pred_train = grid_search.predict(X_train_transf).toarray()
y_pred_test = grid_search.predict(X_test_transf).toarray()
y_proba_train = grid_search.predict_proba(X_train_transf).toarray()
y_proba_test = grid_search.predict_proba(X_test_transf).toarray()
# Calculate metrics for train and test
fpr_train, trp_train, roc_auc_train = multiclass_rouc_auc(y_train, y_proba_train, len(labels))
fpr_test, trp_test, roc_auc_test = multiclass_rouc_auc(y_test, y_proba_test, len(labels))
# Compute precision, recall and f1 score for each label, plus micro and macro average
clf_report_train = pd.DataFrame(classification_report(y_train, y_pred_train, output_dict=True)).T
clf_report_test = pd.DataFrame(classification_report(y_test, y_pred_test, output_dict=True)).T
# Rename axis with labels
clf_report_train.index = list(labels) + ['micro avg', 'macro avg', 'weighted avg', 'samples avg']
clf_report_test.index = list(labels) + ['micro avg', 'macro avg', 'weighted avg', 'samples avg']
# Add the ROC AUC scores computed previously
clf_report_train['roc auc'] = list(roc_auc_train.values()) + ([np.nan] * 2)
clf_report_test['roc auc'] = list(roc_auc_test.values()) + ([np.nan] * 2)
# Plot ROC AUC curves
plot_multiclass_roc_auc(fpr_train, trp_train, roc_auc_train, labels, title='Train dataset').show()
plot_multiclass_roc_auc(fpr_test, trp_test, roc_auc_test, labels, title='Test dataset').show()

In [41]:
print(f'Classification report for Train dataset. F1 Average Score = {f1_score_train:.3f}')
clf_report_train

Classification report for Train dataset. F1 Average Score = 0.905


Unnamed: 0,precision,recall,f1-score,support,roc auc
educação,1.0,1.0,1.0,92.0,1.0
finanças,0.983607,0.952381,0.967742,63.0,0.99964
indústrias,1.0,0.977011,0.988372,87.0,0.99993
orgão público,1.0,0.992366,0.996169,131.0,0.999973
varejo,0.987179,0.974684,0.980892,79.0,1.0
micro avg,0.995516,0.982301,0.988864,452.0,0.999936
macro avg,0.994157,0.979288,0.986635,452.0,0.99994
weighted avg,0.995474,0.982301,0.988816,452.0,
samples avg,0.995192,0.989183,0.991186,452.0,


In [42]:
print(f'Classification report for Train dataset. F1 Average Score = {f1_score_test:.3f}')
clf_report_test

Classification report for Train dataset. F1 Average Score = 0.680


Unnamed: 0,precision,recall,f1-score,support,roc auc
educação,0.925926,0.806452,0.862069,31.0,0.971229
finanças,0.666667,0.461538,0.545455,13.0,0.853679
indústrias,0.85,0.85,0.85,20.0,0.975294
orgão público,0.785714,0.916667,0.846154,24.0,0.965792
varejo,0.809524,0.62963,0.708333,27.0,0.871557
micro avg,0.828571,0.756522,0.790909,115.0,0.932683
macro avg,0.807566,0.732857,0.762402,115.0,0.931872
weighted avg,0.826823,0.756522,0.784763,115.0,
samples avg,0.828571,0.785714,0.8,115.0,


# Conclusions

In this last model, despite having the highest score for the test dataset(74.29%) and cross validation(73%), the almost perfect train score(98.32%) indicates a overfitting case.  