[Classificação de frases por setor](https://handtalk.notion.site/Classifica-o-de-frases-por-setor-18c80adbbf874c519c9efe19678ac4c1)

https://www.kaggle.com/code/rhodiumbeng/classifying-multi-label-comments-0-9741-lb/notebook

In [82]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import plotly.graph_objects as go
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from unidecode import unidecode
import pickle

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gabriel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [83]:
df = pd.read_csv("dataset.csv")

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 521 entries, 0 to 520
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  521 non-null    object
 1   category  521 non-null    object
dtypes: object(2)
memory usage: 8.3+ KB


In [85]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [86]:
col0 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 0])
col1 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 1].dropna())
cat = np.unique(np.append(col1, col0))
num_classes = len(cat)

In [87]:
num_classes

5

The training dataset contains texts that are categorized into one or more of six distinct classes: 'educação', 'finanças', 'indústrias', 'orgão público', and 'varejo'. This setup constitutes a multi-label classification challenge.

## Preprocessing

First, we'll divide the dataset into training and testing sets. This ensures that the model is trained on a subset of the data and evaluated on a separate set it hasn't seen before, allowing for a fair assessment of its performance. This step is crucial in preventing data leakage and ensuring that our evaluation metrics accurately reflect the model's ability to generalize to new data.

In [88]:
train, test = train_test_split(df, test_size=0.2)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

### One hot encoding targe variable

Second, we'll apply one-hot encoding to the target column. This process will transform each category into a separate column, where a category's presence or absence in a sample is represented by 1 or 0, respectively.

In [89]:
train["category"] = train["category"].str.split(",")
test["category"] = test["category"].str.split(",")

mlb = MultiLabelBinarizer()

one_hot_encoded_train = mlb.fit_transform(train['category'])
one_hot_train_df = pd.DataFrame(one_hot_encoded_train, columns=mlb.classes_)
train = pd.concat([train, one_hot_train_df], axis=1).drop('category', axis=1)

one_hot_encoded_test = mlb.transform(test['category'])
one_hot_test_df = pd.DataFrame(one_hot_encoded_test, columns=mlb.classes_)
test = pd.concat([test, one_hot_test_df], axis=1).drop('category', axis=1)

In [90]:
# check missing values in numeric columns
train.describe()

Unnamed: 0,educação,finanças,indústrias,orgão público,varejo
count,416.0,416.0,416.0,416.0,416.0
mean,0.230769,0.158654,0.216346,0.295673,0.192308
std,0.421832,0.365793,0.412249,0.456894,0.394588
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [91]:
correlation = train[["educação", "finanças", "indústrias", "orgão público", "varejo"]].corr()

The correlation between category is very lower, near to 0.

There is no correlation between them

In [92]:
mask = np.triu(np.ones_like(correlation, dtype=bool))
rLT = correlation.mask(mask)

heat = go.Heatmap(
    z = rLT,
    x = rLT.columns.values,
    y = rLT.columns.values,
    zmin = - 0.25, # Sets the lower bound of the color domain
    zmax = 1,
    xgap = 1, # Sets the horizontal gap (in pixels) between bricks
    ygap = 1,
    colorscale = 'viridis',
)

fig=go.Figure(data=[heat])
layout = fig.update_layout(
    title={
    'text': "<b>Categories correlation</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=600, width=600,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange='reversed'
)

fig.show()

In [93]:
# Verifying if there is any blank sentence
print(df[df["sentence"] == ""])
print("\n")
print(df[df["sentence"] == " "])

Empty DataFrame
Columns: [sentence, category]
Index: []


Empty DataFrame
Columns: [sentence, category]
Index: []


### Exploratory Data Analysis

Let's plot a histogram to verify the distribution of sentence length

In [94]:
train['sentence_length'] = train['sentence'].apply(lambda x: len(str(x)))

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=train["sentence_length"],
    marker_color='#7860bd',
))
fig.update_layout(
    title={
    'text': "<b>Sentence length histogram</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
)

fig.show()

In [95]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [96]:
def category_length(category):
    length_sum = (train[category] * train["sentence_length"]).sum()
    length_mean = np.round(length_sum / train[category].sum(), 2)
    return length_sum, length_mean

In [97]:
for category in cat:
    length_sum, length_mean = category_length(category)
    print(f"Summing the length of all sentences in the category '{category}' we have {length_sum} characters.")
    print(f"The mean length of the category '{category}' is {length_mean} characters.\n")

Summing the length of all sentences in the category 'educação' we have 6934 characters.
The mean length of the category 'educação' is 72.23 characters.

Summing the length of all sentences in the category 'finanças' we have 4747 characters.
The mean length of the category 'finanças' is 71.92 characters.

Summing the length of all sentences in the category 'indústrias' we have 6144 characters.
The mean length of the category 'indústrias' is 68.27 characters.

Summing the length of all sentences in the category 'orgão público' we have 8040 characters.
The mean length of the category 'orgão público' is 65.37 characters.

Summing the length of all sentences in the category 'varejo' we have 4918 characters.
The mean length of the category 'varejo' is 61.48 characters.



In [None]:
train

In [98]:
train["sentence"] = train["sentence"].str.strip().str.lower()
test["sentence"] = test["sentence"].str.strip().str.lower()

In [None]:
number_docs = [train["educação"].sum() + test["educação"].sum(),
               train["finanças"].sum() + test["finanças"].sum(),
               train["indústrias"].sum() + test["indústrias"].sum(),
               train["orgão público"].sum() + test["orgão público"].sum(),
               train["varejo"].sum() + test["varejo"].sum()
               ]


fig = go.Figure()
fig.add_trace(go.Histogram(
    x=["educação", "finanças", "indústrias", "orgão público", "varejo"],
    y=number_docs,
    histfunc='sum', texttemplate="%{y}",
    textposition='outside', outsidetextfont=dict(size=12),
    marker_color='#7860bd',
))

fig.update_traces(xbins_size="M1")
fig.update_xaxes(showgrid=True, ticklabelmode="period", dtick="M1", tickformat="%b\n%Y")
fig.update_layout(bargap=0.1)
fig.update_layout(
    title={
    'text': "<b>Quantity of samples with each category</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
    yaxis=dict(categoryorder='total ascending')
)

fig.show()

In [100]:
train = train.drop('sentence_length',axis=1)
X_train = train.sentence
X_test = test.sentence

print(X_train.shape, X_test.shape)

(416,) (105,)


### TfidfVectorizer

TfidfVectorizer stands for Term Frequency-Inverse Document Frequency. It enhances the simple count-based approach by considering not only how often a word appears in a single document but also how unique the word is across all documents in the corpus. It combines two metrics:

- Term Frequency (TF): Similar to CountVectorizer, it measures how frequently a term occurs in a document. This is normalized by dividing by the total number of words in the document to avoid bias towards longer documents.
- Inverse Document Frequency (IDF): This measures how unique or common a word is in the entire document corpus. The more documents a word appears in, the lower its IDF (and thus, its importance).

The TF-IDF score of a word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word across the corpus. This helps to diminish the effect of frequently occurring words that don’t hold much meaningful information about the document

In [101]:
stop_words_pt = stopwords.words('portuguese')
vectorizer = TfidfVectorizer(stop_words=stop_words_pt)

X_train_matrix = vectorizer.fit_transform(X_train)
X_test_matrix = vectorizer.transform(X_test)

In [102]:
X_train_matrix

<416x1711 sparse matrix of type '<class 'numpy.float64'>'
	with 2824 stored elements in Compressed Sparse Row format>

In [103]:
X_test_matrix

<105x1711 sparse matrix of type '<class 'numpy.float64'>'
	with 416 stored elements in Compressed Sparse Row format>

In [104]:
y_train = train[cat]
y_test = test[cat]

## Training

There are various strategies to address multi-label classification issues. In scenarios where there is no significant correlation among the target classes, one straightforward approach is to employ Binary Relevance.

Binary Relevance stands out for its simplicity and popularity in multi-label classification tasks, aiming to simultaneously predict multiple labels for each dataset instance. Its fundamental concept involves breaking down the multi-label classification challenge into several independent binary classification tasks, corresponding to each label within the dataset.

In [105]:
from sklearn.tree import DecisionTreeClassifier
kf = KFold(n_splits=5)
dict_acc = dict(zip(cat, [[] for _ in range(len(cat))]))
model_dict = {}
for i, (train_index, test_index) in enumerate(kf.split(X_train_matrix)):
    X_tr, X_te = X_train_matrix[train_index], X_train_matrix[test_index]
    y_tr, y_te = y_train.loc[train_index], y_train.loc[test_index]
    print(f"Training in Fold: {i}")
    for label in cat:
        # model = LogisticRegression(C=15)
        model = DecisionTreeClassifier()
        print(f'Processing {label}')
        model.fit(X_tr, y_tr[label])
        y_pred = model.predict(X_te)
        score = accuracy_score(y_te[label], y_pred)
        print(f'Training accuracy is {score}')
        test_y_prob = model.predict_proba(X_te)[:,1]
        model_dict[label] = model
        dict_acc[label].append(score)
        if score > np.max(dict_acc[label]):
            model_dict[label] = model
    print("*" * 50)

Training in Fold: 0
Processing educação
Training accuracy is 0.8095238095238095
Processing finanças
Training accuracy is 0.8452380952380952
Processing indústrias
Training accuracy is 0.8809523809523809
Processing orgão público
Training accuracy is 0.8452380952380952
Processing varejo
Training accuracy is 0.8928571428571429
**************************************************
Training in Fold: 1
Processing educação
Training accuracy is 0.8554216867469879
Processing finanças
Training accuracy is 0.7951807228915663
Processing indústrias
Training accuracy is 0.8554216867469879
Processing orgão público
Training accuracy is 0.8554216867469879
Processing varejo
Training accuracy is 0.8192771084337349
**************************************************
Training in Fold: 2
Processing educação
Training accuracy is 0.891566265060241
Processing finanças
Training accuracy is 0.8554216867469879
Processing indústrias
Training accuracy is 0.8433734939759037
Processing orgão público
Training accuracy is 0

In [106]:
for c in model_dict.keys():
    filename = "_".join(unidecode(c).split(" "))
    with open(f'./models/{filename}.pkl', 'wb') as file:
        pickle.dump(model_dict[c], file)

In [107]:
acc_df = pd.DataFrame(dict_acc)
acc_df["index"] = acc_df.reset_index()["index"].apply(lambda x: f"Fold {x + 1}")
acc_df = acc_df.rename(columns={"index": ""}).set_index("").T

In [108]:
acc_df

Unnamed: 0,Fold 1,Fold 2,Fold 3,Fold 4,Fold 5
educação,0.809524,0.855422,0.891566,0.891566,0.855422
finanças,0.845238,0.795181,0.855422,0.879518,0.927711
indústrias,0.880952,0.855422,0.843373,0.855422,0.819277
orgão público,0.845238,0.855422,0.783133,0.819277,0.759036
varejo,0.892857,0.819277,0.879518,0.831325,0.879518


The mean training accuracy across folds is shown bellow:

In [109]:
acc_df.mean(axis=1)

educação         0.860700
finanças         0.860614
indústrias       0.850889
orgão público    0.812421
varejo           0.860499
dtype: float64

## Validation

In [110]:
from sklearn.metrics import accuracy_score, hamming_loss, precision_score, recall_score, f1_score, jaccard_score
for label in cat:
    print(f"Label: {label}")
    y_true = y_test[label]
    y_pred = model_dict[label].predict(X_test_matrix)
    accuracy = np.round(accuracy_score(y_true, y_pred), 4)
    hamming = np.round(hamming_loss(y_true, y_pred), 4)
    precision = np.round(precision_score(y_true, y_pred, average='macro'), 4)
    recall = np.round(recall_score(y_true, y_pred, average='macro'), 4)
    f1 = np.round(f1_score(y_true, y_pred, average='macro'), 4)
    jaccard = np.round(jaccard_score(y_true, y_pred, average='macro'), 4)

    print(f"Accuracy: {accuracy}\nHamming Loss: {hamming}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}\nJaccard Score: {jaccard}")
    print("*" * 30)

Label: educação
Accuracy: 0.8476
Hamming Loss: 0.1524
Precision: 0.8379
Recall: 0.74
F1 Score: 0.77
Jaccard Score: 0.6454
******************************
Label: finanças
Accuracy: 0.8571
Hamming Loss: 0.1429
Precision: 0.5991
Recall: 0.6079
F1 Score: 0.6032
Jaccard Score: 0.5098
******************************
Label: indústrias
Accuracy: 0.8667
Hamming Loss: 0.1333
Precision: 0.7543
Recall: 0.7543
F1 Score: 0.7543
Jaccard Score: 0.6346
******************************
Label: orgão público
Accuracy: 0.819
Hamming Loss: 0.181
Precision: 0.828
Recall: 0.7295
F1 Score: 0.754
Jaccard Score: 0.6218
******************************
Label: varejo
Accuracy: 0.8286
Hamming Loss: 0.1714
Precision: 0.7864
Recall: 0.7184
F1 Score: 0.7412
Jaccard Score: 0.6119
******************************


In [116]:
# sample = "Melhor política industrial é acabar com isenção para compras internacionais, diz presidente da Fiemg"
# sample = "Chevrolet Spin recebe mudanças, mas mantém antigo motor 1.8"
sample = "Bancos estão mudando datas de fechamento das faturas dos cartões?"

def prediction(sample):
    sample = sample.strip().lower()
    sample_matrix = vectorizer.transform([sample])
    classes = []
    probs = []
    for key, value in model_dict.items():
        prob = np.round(100 * value.predict_proba(sample_matrix)[0][-1], 2)
        print(f"{key}: {prob}")
        if prob >= 50:
            classes.append(key)
            probs.append(prob)
    return dict(zip(classes, probs))

In [117]:
prediction(sample)

educação: 0.0
finanças: 0.0
indústrias: 0.0
orgão público: 0.0
varejo: 0.0


{}