[Classificação de frases por setor](https://handtalk.notion.site/Classifica-o-de-frases-por-setor-18c80adbbf874c519c9efe19678ac4c1)

https://www.kaggle.com/code/rhodiumbeng/classifying-multi-label-comments-0-9741-lb/notebook

In [72]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import plotly.graph_objects as go
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from unidecode import unidecode
import pickle

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gabriel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [73]:
df = pd.read_csv("dataset.csv")

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 521 entries, 0 to 520
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  521 non-null    object
 1   category  521 non-null    object
dtypes: object(2)
memory usage: 8.3+ KB


In [75]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [76]:
col0 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 0])
col1 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 1].dropna())
cat = np.unique(np.append(col1, col0))
num_classes = len(cat)

In [77]:
num_classes

5

The training dataset contains texts that are categorized into one or more of six distinct classes: 'educação', 'finanças', 'indústrias', 'orgão público', and 'varejo'. This setup constitutes a multi-label classification challenge.

## Preprocessing

First, we'll divide the dataset into training and testing sets. This ensures that the model is trained on a subset of the data and evaluated on a separate set it hasn't seen before, allowing for a fair assessment of its performance. This step is crucial in preventing data leakage and ensuring that our evaluation metrics accurately reflect the model's ability to generalize to new data.

In [78]:
train, test = train_test_split(df, test_size=0.2)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

### One hot encoding targe variable

Second, we'll apply one-hot encoding to the target column. This process will transform each category into a separate column, where a category's presence or absence in a sample is represented by 1 or 0, respectively.

In [79]:
train["category"] = train["category"].str.split(",")
test["category"] = test["category"].str.split(",")

mlb = MultiLabelBinarizer()

one_hot_encoded_train = mlb.fit_transform(train['category'])
one_hot_train_df = pd.DataFrame(one_hot_encoded_train, columns=mlb.classes_)
train = pd.concat([train, one_hot_train_df], axis=1).drop('category', axis=1)

one_hot_encoded_test = mlb.transform(test['category'])
one_hot_test_df = pd.DataFrame(one_hot_encoded_test, columns=mlb.classes_)
test = pd.concat([test, one_hot_test_df], axis=1).drop('category', axis=1)

In [80]:
# check missing values in numeric columns
train.describe()

Unnamed: 0,educação,finanças,indústrias,orgão público,varejo
count,416.0,416.0,416.0,416.0,416.0
mean,0.235577,0.15625,0.213942,0.300481,0.1875
std,0.42487,0.363529,0.41058,0.459019,0.390782
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [81]:
train["sentence"] = train["sentence"].str.strip().str.lower()
test["sentence"] = test["sentence"].str.strip().str.lower()

In [82]:
X_train = train.sentence
X_test = test.sentence

print(X_train.shape, X_test.shape)

(416,) (105,)


### TfidfVectorizer

TfidfVectorizer stands for Term Frequency-Inverse Document Frequency. It enhances the simple count-based approach by considering not only how often a word appears in a single document but also how unique the word is across all documents in the corpus. It combines two metrics:

- Term Frequency (TF): Similar to CountVectorizer, it measures how frequently a term occurs in a document. This is normalized by dividing by the total number of words in the document to avoid bias towards longer documents.
- Inverse Document Frequency (IDF): This measures how unique or common a word is in the entire document corpus. The more documents a word appears in, the lower its IDF (and thus, its importance).

The TF-IDF score of a word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word across the corpus. This helps to diminish the effect of frequently occurring words that don’t hold much meaningful information about the document

In [83]:
stop_words_pt = stopwords.words('portuguese')
vectorizer = TfidfVectorizer(stop_words=stop_words_pt)

X_train_matrix = vectorizer.fit_transform(X_train)
X_test_matrix = vectorizer.transform(X_test)

In [84]:
X_train_matrix

<416x1663 sparse matrix of type '<class 'numpy.float64'>'
	with 2780 stored elements in Compressed Sparse Row format>

In [85]:
X_test_matrix

<105x1663 sparse matrix of type '<class 'numpy.float64'>'
	with 409 stored elements in Compressed Sparse Row format>

In [86]:
y_train = train[cat]
y_test = test[cat]

## Training

There are various strategies to address multi-label classification issues. In scenarios where there is no significant correlation among the target classes, one straightforward approach is to employ Binary Relevance.

Binary Relevance stands out for its simplicity and popularity in multi-label classification tasks, aiming to simultaneously predict multiple labels for each dataset instance. Its fundamental concept involves breaking down the multi-label classification challenge into several independent binary classification tasks, corresponding to each label within the dataset.

In [87]:
from sklearn.tree import DecisionTreeClassifier
kf = KFold(n_splits=5)
dict_acc = dict(zip(cat, [[] for _ in range(len(cat))]))
model_dict = {}
for i, (train_index, test_index) in enumerate(kf.split(X_train_matrix)):
    X_tr, X_te = X_train_matrix[train_index], X_train_matrix[test_index]
    y_tr, y_te = y_train.loc[train_index], y_train.loc[test_index]
    print(f"Training in Fold: {i}")
    for label in cat:
        # model = LogisticRegression(C=15)
        model = DecisionTreeClassifier()
        print(f'Processing {label}')
        model.fit(X_tr, y_tr[label])
        y_pred = model.predict(X_te)
        score = accuracy_score(y_te[label], y_pred)
        print(f'Training accuracy is {score}')
        test_y_prob = model.predict_proba(X_te)[:,1]
        model_dict[label] = model
        dict_acc[label].append(score)
        if score > np.max(dict_acc[label]):
            model_dict[label] = model
    print("*" * 50)

Training in Fold: 0
Processing educação
Training accuracy is 0.8095238095238095
Processing finanças
Training accuracy is 0.8452380952380952
Processing indústrias
Training accuracy is 0.8690476190476191
Processing orgão público
Training accuracy is 0.7619047619047619
Processing varejo
Training accuracy is 0.8809523809523809
**************************************************
Training in Fold: 1
Processing educação
Training accuracy is 0.8795180722891566
Processing finanças
Training accuracy is 0.8313253012048193
Processing indústrias
Training accuracy is 0.7831325301204819
Processing orgão público
Training accuracy is 0.8072289156626506
Processing varejo
Training accuracy is 0.8313253012048193
**************************************************
Training in Fold: 2
Processing educação
Training accuracy is 0.8674698795180723
Processing finanças
Training accuracy is 0.9036144578313253
Processing indústrias
Training accuracy is 0.8433734939759037
Processing orgão público
Training accuracy is 

In [88]:
for c in model_dict.keys():
    filename = "_".join(unidecode(c).split(" "))
    with open(f'./models/{filename}.pkl', 'wb') as file:
        pickle.dump(model_dict[c], file)

In [89]:
acc_df = pd.DataFrame(dict_acc)
acc_df["index"] = acc_df.reset_index()["index"].apply(lambda x: f"Fold {x + 1}")
acc_df = acc_df.rename(columns={"index": ""}).set_index("").T

In [90]:
acc_df

Unnamed: 0,Fold 1,Fold 2,Fold 3,Fold 4,Fold 5
educação,0.809524,0.879518,0.86747,0.855422,0.891566
finanças,0.845238,0.831325,0.903614,0.891566,0.807229
indústrias,0.869048,0.783133,0.843373,0.831325,0.891566
orgão público,0.761905,0.807229,0.855422,0.771084,0.819277
varejo,0.880952,0.831325,0.843373,0.855422,0.86747


The mean training accuracy across folds is shown bellow:

In [91]:
acc_df.mean(axis=1)

educação         0.860700
finanças         0.855795
indústrias       0.843689
orgão público    0.802983
varejo           0.855709
dtype: float64

## Validation

In [92]:
from sklearn.metrics import accuracy_score, hamming_loss, precision_score, recall_score, f1_score, jaccard_score
for label in cat:
    print(f"Label: {label}")
    y_true = y_test[label]
    y_pred = model_dict[label].predict(X_test_matrix)
    accuracy = np.round(accuracy_score(y_true, y_pred), 4)
    hamming = np.round(hamming_loss(y_true, y_pred), 4)
    precision = np.round(precision_score(y_true, y_pred, average='macro'), 4)
    recall = np.round(recall_score(y_true, y_pred, average='macro'), 4)
    f1 = np.round(f1_score(y_true, y_pred, average='macro'), 4)
    jaccard = np.round(jaccard_score(y_true, y_pred, average='macro'), 4)

    print(f"Accuracy: {accuracy}\nHamming Loss: {hamming}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}\nJaccard Score: {jaccard}")
    print("*" * 30)

Label: educação
Accuracy: 0.8286
Hamming Loss: 0.1714
Precision: 0.824
Recall: 0.6675
F1 Score: 0.6983
Jaccard Score: 0.5729
******************************
Label: finanças
Accuracy: 0.8571
Hamming Loss: 0.1429
Precision: 0.6594
Recall: 0.7195
F1 Score: 0.6812
Jaccard Score: 0.5671
******************************
Label: indústrias
Accuracy: 0.9048
Hamming Loss: 0.0952
Precision: 0.8737
Recall: 0.7663
F1 Score: 0.8056
Jaccard Score: 0.6974
******************************
Label: orgão público
Accuracy: 0.8095
Hamming Loss: 0.1905
Precision: 0.8297
Recall: 0.6867
F1 Score: 0.7125
Jaccard Score: 0.58
******************************
Label: varejo
Accuracy: 0.8952
Hamming Loss: 0.1048
Precision: 0.8988
Recall: 0.8263
F1 Score: 0.8536
Jaccard Score: 0.7527
******************************


In [93]:
# sample = "Melhor política industrial é acabar com isenção para compras internacionais, diz presidente da Fiemg"
# sample = "Chevrolet Spin recebe mudanças, mas mantém antigo motor 1.8"
sample = "Bancos estão mudando datas de fechamento das faturas dos cartões?"

def prediction(sample):
    sample = sample.strip().lower()
    sample_matrix = vectorizer.transform([sample])
    classes = []
    probs = []
    for key, value in model_dict.items():
        prob = np.round(100 * value.predict_proba(sample_matrix)[0][-1], 2)
        print(value.predict_proba(sample_matrix))
        print(f"{key}: {prob}")
        if prob >= 50:
            classes.append(key)
            probs.append(prob)
    return dict(zip(classes, probs))

In [94]:
prediction(sample)

[[1. 0.]]
educação: 0.0
[[1. 0.]]
finanças: 0.0
[[1. 0.]]
indústrias: 0.0
[[1. 0.]]
orgão público: 0.0
[[1. 0.]]
varejo: 0.0


{}

In [95]:
X_train_matrix

<416x1663 sparse matrix of type '<class 'numpy.float64'>'
	with 2780 stored elements in Compressed Sparse Row format>

In [96]:
y_train

Unnamed: 0,educação,finanças,indústrias,orgão público,varejo
0,0,0,1,0,0
1,1,0,0,0,0
2,0,0,1,0,1
3,0,0,0,0,1
4,0,0,0,1,0
...,...,...,...,...,...
411,0,0,0,1,0
412,1,0,0,0,0
413,1,0,0,0,0
414,0,0,0,0,1


In [150]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

clf = MultiOutputClassifier(MultinomialNB()).fit(X_train_matrix, y_train)

In [160]:
from catboost import CatBoostClassifier
from sklearn.multiclass import OneVsRestClassifier

ovr = OneVsRestClassifier(estimator=CatBoostClassifier(iterations=10))
ovr.fit(X_train_matrix, y_train)

Learning rate set to 0.483378
0:	learn: 0.5545012	total: 54.9ms	remaining: 494ms
1:	learn: 0.4907733	total: 58.9ms	remaining: 236ms
2:	learn: 0.4442690	total: 63.9ms	remaining: 149ms
3:	learn: 0.4028789	total: 68.1ms	remaining: 102ms
4:	learn: 0.3766845	total: 71.8ms	remaining: 71.8ms
5:	learn: 0.3610786	total: 75.7ms	remaining: 50.4ms
6:	learn: 0.3423515	total: 80.5ms	remaining: 34.5ms
7:	learn: 0.3226904	total: 84.5ms	remaining: 21.1ms
8:	learn: 0.3129856	total: 88ms	remaining: 9.78ms
9:	learn: 0.2953522	total: 92.3ms	remaining: 0us
Learning rate set to 0.483378
0:	learn: 0.4328148	total: 9.05ms	remaining: 81.5ms
1:	learn: 0.3729760	total: 13.9ms	remaining: 55.5ms
2:	learn: 0.3436014	total: 19.3ms	remaining: 45ms
3:	learn: 0.3292257	total: 26.9ms	remaining: 40.4ms
4:	learn: 0.3130460	total: 31.6ms	remaining: 31.6ms
5:	learn: 0.2956640	total: 38.1ms	remaining: 25.4ms
6:	learn: 0.2769167	total: 42.8ms	remaining: 18.3ms
7:	learn: 0.2688242	total: 48ms	remaining: 12ms
8:	learn: 0.2519707

In [152]:
from sklearn.metrics import accuracy_score
print('Accuracy Score: ', accuracy_score(y_test, prediction))

Accuracy Score:  0.22857142857142856


In [153]:
from sklearn.metrics import hamming_loss
print('Hamming Loss: ', round(hamming_loss(y_test, prediction),2))

Hamming Loss:  0.17


In [140]:
# sample = "Melhor política industrial é acabar com isenção para compras internacionais, diz presidente da Fiemg"
# sample = "Chevrolet Spin recebe mudanças, mas mantém antigo motor 1.8"
# sample = "Bancos estão mudando datas de fechamento das faturas dos cartões?"
sample = "A união entre os dois bancos foi aprovada pelo CADE em 18 de agosto de 2010."
sample = sample.strip().lower()
sample_matrix = vectorizer.transform([sample])
prediction = clf.predict(sample_matrix)
predicted_labels = (cat * prediction)[(cat * prediction) != ""]

In [141]:
predicted_labels

array([], dtype=object)