[Classificação de frases por setor](https://handtalk.notion.site/Classifica-o-de-frases-por-setor-18c80adbbf874c519c9efe19678ac4c1)

In [332]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import plotly.graph_objects as go
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gabriel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [333]:
df = pd.read_csv("dataset.csv")

In [334]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 521 entries, 0 to 520
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  521 non-null    object
 1   category  521 non-null    object
dtypes: object(2)
memory usage: 8.3+ KB


In [335]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [336]:
col0 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 0])
col1 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 1].dropna())
cat = np.unique(np.append(col1, col0))
num_classes = len(cat)

In [337]:
num_classes

5

The training dataset contains texts that are categorized into one or more of six distinct classes: 'educação', 'finanças', 'indústrias', 'orgão público', and 'varejo'. This setup constitutes a multi-label classification challenge.

## Preprocessing

First, we'll divide the dataset into training and testing sets. This ensures that the model is trained on a subset of the data and evaluated on a separate set it hasn't seen before, allowing for a fair assessment of its performance. This step is crucial in preventing data leakage and ensuring that our evaluation metrics accurately reflect the model's ability to generalize to new data.

In [338]:
train, test = train_test_split(df, test_size=0.2)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

### One hot encoding targe variable

Second, we'll apply one-hot encoding to the target column. This process will transform each category into a separate column, where a category's presence or absence in a sample is represented by 1 or 0, respectively.

In [339]:
train["category"] = train["category"].str.split(",")
test["category"] = test["category"].str.split(",")

mlb = MultiLabelBinarizer()

one_hot_encoded_train = mlb.fit_transform(train['category'])
one_hot_train_df = pd.DataFrame(one_hot_encoded_train, columns=mlb.classes_)
train = pd.concat([train, one_hot_train_df], axis=1).drop('category', axis=1)

one_hot_encoded_test = mlb.transform(test['category'])
one_hot_test_df = pd.DataFrame(one_hot_encoded_test, columns=mlb.classes_)
test = pd.concat([test, one_hot_test_df], axis=1).drop('category', axis=1)

In [340]:
# check missing values in numeric columns
train.describe()

Unnamed: 0,educação,finanças,indústrias,orgão público,varejo
count,416.0,416.0,416.0,416.0,416.0
mean,0.235577,0.146635,0.199519,0.295673,0.213942
std,0.42487,0.354167,0.40012,0.456894,0.41058
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [341]:
correlation = train[["educação", "finanças", "indústrias", "orgão público", "varejo"]].corr()

The correlation between category is very lower, near to 0.

There is no correlation between them

In [342]:
mask = np.triu(np.ones_like(correlation, dtype=bool))
rLT = correlation.mask(mask)

heat = go.Heatmap(
    z = rLT,
    x = rLT.columns.values,
    y = rLT.columns.values,
    zmin = - 0.25, # Sets the lower bound of the color domain
    zmax = 1,
    xgap = 1, # Sets the horizontal gap (in pixels) between bricks
    ygap = 1,
    colorscale = 'viridis',
)

fig=go.Figure(data=[heat])
layout = fig.update_layout(
    title={
    'text': "<b>Categories correlation</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=600, width=600,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange='reversed'
)

fig.show()

In [343]:
fig.update_layout(
    title={
    'text': "<b>Sentence length histogram</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
)

In [344]:
# Verifying if there is any blank sentence
print(df[df["sentence"] == ""])
print("\n")
print(df[df["sentence"] == " "])

Empty DataFrame
Columns: [sentence, category]
Index: []


Empty DataFrame
Columns: [sentence, category]
Index: []


### Exploratory Data Analysis

Let's plot a histogram to verify the distribution of sentence length

In [345]:
train['sentence_length'] = train['sentence'].apply(lambda x: len(str(x)))

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=train["sentence_length"],
    marker_color='#7860bd',
))
fig.update_layout(
    title={
    'text': "<b>Sentence length histogram</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
)

fig.show()

In [346]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [347]:
def category_length(category):
    length_sum = (train[category] * train["sentence_length"]).sum()
    length_mean = np.round(length_sum / train[category].sum(), 2)
    return length_sum, length_mean

In [348]:
for category in cat:
    length_sum, length_mean = category_length(category)
    print(f"Summing the length of all sentences in the category '{category}' we have {length_sum} characters.")
    print(f"The mean length of the category '{category}' is {length_mean} characters.\n")

Summing the length of all sentences in the category 'educação' we have 7034 characters.
The mean length of the category 'educação' is 71.78 characters.

Summing the length of all sentences in the category 'finanças' we have 4489 characters.
The mean length of the category 'finanças' is 73.59 characters.

Summing the length of all sentences in the category 'indústrias' we have 5724 characters.
The mean length of the category 'indústrias' is 68.96 characters.

Summing the length of all sentences in the category 'orgão público' we have 8095 characters.
The mean length of the category 'orgão público' is 65.81 characters.

Summing the length of all sentences in the category 'varejo' we have 5204 characters.
The mean length of the category 'varejo' is 58.47 characters.



In [349]:
train["sentence"] = train["sentence"].str.strip().str.lower()
test["sentence"] = test["sentence"].str.strip().str.lower()

In [350]:
train

Unnamed: 0,sentence,educação,finanças,indústrias,orgão público,varejo,sentence_length
0,ficam com uma soma enorme que permite o invest...,0,1,0,0,0,71
1,escolha o número de parcelas,0,0,0,0,1,28
2,"materiais para se aprofundar nas leituras, bon...",1,0,0,0,0,56
3,contabilidade para advogados.,0,1,0,0,0,29
4,acesse o fale conosco e veja tutoriais e també...,0,0,0,1,0,88
...,...,...,...,...,...,...,...
411,cadastro nacional de adoção (cna) - portal cnj.,0,0,0,1,0,47
412,para que serve a previdência social?,0,0,0,1,0,36
413,selecione o tipo de documento; documentos pess...,1,0,0,0,0,75
414,programa de aprendizagem,1,0,0,0,0,24


In [351]:
train = train.drop('sentence_length',axis=1)
X_train = train.sentence
X_test = test.sentence

print(X_train.shape, X_test.shape)

(416,) (105,)


### TfidfVectorizer

TfidfVectorizer stands for Term Frequency-Inverse Document Frequency. It enhances the simple count-based approach by considering not only how often a word appears in a single document but also how unique the word is across all documents in the corpus. It combines two metrics:

- Term Frequency (TF): Similar to CountVectorizer, it measures how frequently a term occurs in a document. This is normalized by dividing by the total number of words in the document to avoid bias towards longer documents.
- Inverse Document Frequency (IDF): This measures how unique or common a word is in the entire document corpus. The more documents a word appears in, the lower its IDF (and thus, its importance).

The TF-IDF score of a word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word across the corpus. This helps to diminish the effect of frequently occurring words that don’t hold much meaningful information about the document

In [352]:
stop_words_pt = stopwords.words('portuguese')
vectorizer = TfidfVectorizer(stop_words=stop_words_pt)

X_train_matrix = vectorizer.fit_transform(X_train)
X_test_matrix = vectorizer.transform(X_test)

In [353]:
X_train_matrix

<416x1696 sparse matrix of type '<class 'numpy.float64'>'
	with 2813 stored elements in Compressed Sparse Row format>

In [354]:
X_test_matrix

<105x1696 sparse matrix of type '<class 'numpy.float64'>'
	with 412 stored elements in Compressed Sparse Row format>

In [355]:
y_train = train[cat]
y_test = test[cat]

## Training

There are various strategies to address multi-label classification issues. In scenarios where there is no significant correlation among the target classes, one straightforward approach is to employ Binary Relevance.

Binary Relevance stands out for its simplicity and popularity in multi-label classification tasks, aiming to simultaneously predict multiple labels for each dataset instance. Its fundamental concept involves breaking down the multi-label classification challenge into several independent binary classification tasks, corresponding to each label within the dataset.

In [356]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

In [357]:
#!TODO incluir rotina para salvar os melhores modelos para serem usado na fase de validação 

kf = KFold(n_splits=5)
dict_acc = dict(zip(cat, [[] for _ in range(len(cat))]))
for i, (train_index, test_index) in enumerate(kf.split(X_train_matrix)):
    X_tr, X_te = X_train_matrix[train_index], X_train_matrix[test_index]
    y_tr, y_te = y_train.loc[train_index], y_train.loc[test_index]
    logreg = LogisticRegression(C=12.0)
    print(f"Training in Fold: {i}")
    for label in cat:
        print(f'Processing {label}')
        logreg.fit(X_tr, y_tr[label])
        y_pred = logreg.predict(X_te)
        score = accuracy_score(y_te[label], y_pred)
        print(f'Training accuracy is {score}')
        test_y_prob = logreg.predict_proba(X_te)[:,1]
        dict_acc[label].append(score)
    print("*" * 50)

Training in Fold: 0
Processing educação
Training accuracy is 0.9047619047619048
Processing finanças
Training accuracy is 0.8452380952380952
Processing indústrias
Training accuracy is 0.8809523809523809
Processing orgão público
Training accuracy is 0.7857142857142857
Processing varejo
Training accuracy is 0.8095238095238095
**************************************************
Training in Fold: 1
Processing educação
Training accuracy is 0.8433734939759037
Processing finanças
Training accuracy is 0.8554216867469879
Processing indústrias
Training accuracy is 0.891566265060241
Processing orgão público
Training accuracy is 0.7469879518072289
Processing varejo
Training accuracy is 0.8192771084337349
**************************************************
Training in Fold: 2
Processing educação
Training accuracy is 0.7951807228915663
Processing finanças
Training accuracy is 0.8433734939759037
Processing indústrias
Training accuracy is 0.8313253012048193
Processing orgão público
Training accuracy is 0

In [358]:
dict_acc

{'educação': [0.9047619047619048,
  0.8433734939759037,
  0.7951807228915663,
  0.7831325301204819,
  0.7951807228915663],
 'finanças': [0.8452380952380952,
  0.8554216867469879,
  0.8433734939759037,
  0.8554216867469879,
  0.9397590361445783],
 'indústrias': [0.8809523809523809,
  0.891566265060241,
  0.8313253012048193,
  0.8313253012048193,
  0.8313253012048193],
 'orgão público': [0.7857142857142857,
  0.7469879518072289,
  0.8072289156626506,
  0.8433734939759037,
  0.8554216867469879],
 'varejo': [0.8095238095238095,
  0.8192771084337349,
  0.8674698795180723,
  0.891566265060241,
  0.7951807228915663]}

In [373]:
acc_df = pd.DataFrame(dict_acc)
acc_df["index"] = acc_df.reset_index()["index"].apply(lambda x: f"Fold {x + 1}")
acc_df = acc_df.rename(columns={"index": ""}).set_index("").T

In [374]:
acc_df

Unnamed: 0,Fold 1,Fold 2,Fold 3,Fold 4,Fold 5
educação,0.904762,0.843373,0.795181,0.783133,0.795181
finanças,0.845238,0.855422,0.843373,0.855422,0.939759
indústrias,0.880952,0.891566,0.831325,0.831325,0.831325
orgão público,0.785714,0.746988,0.807229,0.843373,0.855422
varejo,0.809524,0.819277,0.86747,0.891566,0.795181


The mean training accuracy across folds is shown bellow:

In [380]:
acc_df.mean(axis=1)

educação         0.824326
finanças         0.867843
indústrias       0.853299
orgão público    0.807745
varejo           0.836604
dtype: float64

## Validation

In [381]:
y_pred = logreg.predict(X_test_matrix)

In [382]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])