[Classificação de frases por setor](https://handtalk.notion.site/Classifica-o-de-frases-por-setor-18c80adbbf874c519c9efe19678ac4c1)

In [36]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
import plotly.graph_objects as go
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from unidecode import unidecode
import pickle

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gabriel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [37]:
df = pd.read_csv("dataset.csv")

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 521 entries, 0 to 520
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  521 non-null    object
 1   category  521 non-null    object
dtypes: object(2)
memory usage: 8.3+ KB


In [39]:
df

Unnamed: 0,sentence,category
0,"Auxílio-Doença Previdenciário, Benefícios em E...",orgão público
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",finanças
2,Então encontraremos na próxima aula.,educação
3,Veja os resultados da categoria de ofertas do ...,indústrias
4,"Além disso, a embalagem é reutilizável e 100% ...","indústrias,varejo"
...,...,...
516,"Selecione o local de estudo, curso sem encontr...",educação
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,"educação,orgão público"
518,Empresas e órgãos públicos,orgão público
519,DGE – Departamento de Gestão Estratégica Metas...,orgão público


In [40]:
col0 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 0])
col1 = np.unique(df["category"].str.split(",").apply(pd.Series).loc[:, 1].dropna())
cat = np.unique(np.append(col1, col0))
num_classes = len(cat)

In [41]:
num_classes

5

The training dataset contains texts that are categorized into one or more of six distinct classes: 'educação', 'finanças', 'indústrias', 'orgão público', and 'varejo'. This setup constitutes a multi-label classification challenge.

## Preprocessing

### Binarize variable

Let's apply one-hot encoding to the target column. This process will transform each category into a separate column, where a category's presence or absence in a sample is represented by 1 or 0, respectively.

In [42]:
df["category"] = df["category"].str.split(",")
mlb = MultiLabelBinarizer()
one_hot_encoded_train = mlb.fit_transform(df['category'])
one_hot_train_df = pd.DataFrame(one_hot_encoded_train, columns=mlb.classes_)
df = pd.concat([df, one_hot_train_df], axis=1).drop('category', axis=1)

In [43]:
# check missing values in numeric columns
df.describe()

Unnamed: 0,educação,finanças,indústrias,orgão público,varejo
count,521.0,521.0,521.0,521.0,521.0
mean,0.236084,0.145873,0.205374,0.297505,0.203455
std,0.425083,0.353318,0.404363,0.4576,0.402955
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [44]:
correlation = df[["educação", "finanças", "indústrias", "orgão público", "varejo"]].corr()

The correlation between category is very lower, near to 0.

There is no correlation between them

In [45]:
mask = np.triu(np.ones_like(correlation, dtype=bool))
rLT = correlation.mask(mask)

heat = go.Heatmap(
    z = rLT,
    x = rLT.columns.values,
    y = rLT.columns.values,
    zmin = - 0.25, # Sets the lower bound of the color domain
    zmax = 1,
    xgap = 1, # Sets the horizontal gap (in pixels) between bricks
    ygap = 1,
    colorscale = 'viridis',
)

fig=go.Figure(data=[heat])
layout = fig.update_layout(
    title={
    'text': "<b>Categories correlation</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=600, width=600,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange='reversed'
)

fig.show()

In [46]:
# Verifying if there is any blank sentence
print(df[df["sentence"] == ""])
print("\n")
print(df[df["sentence"] == " "])

Empty DataFrame
Columns: [sentence, educação, finanças, indústrias, orgão público, varejo]
Index: []


Empty DataFrame
Columns: [sentence, educação, finanças, indústrias, orgão público, varejo]
Index: []


### Exploratory Data Analysis

Let's plot a histogram to verify the distribution of sentence length

In [47]:
df['sentence_length'] = df['sentence'].apply(lambda x: len(str(x)))

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=df["sentence_length"],
    marker_color='#7860bd',
))
fig.update_layout(
    title={
    'text': "<b>Sentence length histogram</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
)

fig.show()

In [48]:
df

Unnamed: 0,sentence,educação,finanças,indústrias,orgão público,varejo,sentence_length
0,"Auxílio-Doença Previdenciário, Benefícios em E...",0,0,0,1,0,78
1,"PAGAR TODAS AS CONTAS EM ATRASO R$1.290,90.",0,1,0,0,0,43
2,Então encontraremos na próxima aula.,1,0,0,0,0,36
3,Veja os resultados da categoria de ofertas do ...,0,0,1,0,0,71
4,"Além disso, a embalagem é reutilizável e 100% ...",0,0,1,0,1,57
...,...,...,...,...,...,...,...
516,"Selecione o local de estudo, curso sem encontr...",1,0,0,0,0,61
517,ESTUDANTES DA REDE MUNICIPAL VOLTAM ÀS AULAS E...,1,0,0,1,0,59
518,Empresas e órgãos públicos,0,0,0,1,0,26
519,DGE – Departamento de Gestão Estratégica Metas...,0,0,0,1,0,104


In [49]:
def category_length(category):
    length_sum = (df[category] * df["sentence_length"]).sum()
    length_mean = np.round(length_sum / df[category].sum(), 2)
    return length_sum, length_mean

In [50]:
for category in cat:
    length_sum, length_mean = category_length(category)
    print(f"Summing the length of all sentences in the category '{category}' we have {length_sum} characters.")
    print(f"The mean length of the category '{category}' is {length_mean} characters.\n")

Summing the length of all sentences in the category 'educação' we have 8619 characters.
The mean length of the category 'educação' is 70.07 characters.

Summing the length of all sentences in the category 'finanças' we have 5486 characters.
The mean length of the category 'finanças' is 72.18 characters.

Summing the length of all sentences in the category 'indústrias' we have 7256 characters.
The mean length of the category 'indústrias' is 67.81 characters.

Summing the length of all sentences in the category 'orgão público' we have 10320 characters.
The mean length of the category 'orgão público' is 66.58 characters.

Summing the length of all sentences in the category 'varejo' we have 6228 characters.
The mean length of the category 'varejo' is 58.75 characters.



In [51]:
df["sentence"] = df["sentence"].str.strip().str.lower()

In [52]:
number_docs = [df["educação"].sum(),
               df["finanças"].sum(),
               df["indústrias"].sum(),
               df["orgão público"].sum(),
               df["varejo"].sum()
               ]


fig = go.Figure()
fig.add_trace(go.Histogram(
    x=["educação", "finanças", "indústrias", "orgão público", "varejo"],
    y=number_docs,
    histfunc='sum', texttemplate="%{y}",
    textposition='outside', outsidetextfont=dict(size=12),
    marker_color='#7860bd',
))

fig.update_traces(xbins_size="M1")
fig.update_xaxes(showgrid=True, ticklabelmode="period", dtick="M1", tickformat="%b\n%Y")
fig.update_layout(bargap=0.1)
fig.update_layout(
    title={
    'text': "<b>Quantity of samples with each category</b>",
    'font' : dict(size=26, color='black', family='Helvetica'),
    'y':0.95,
    'x':0.5},
    height=550, width=1100,
    font_family='Helvetica',
    font_color='black',
    font_size=16,
    plot_bgcolor='white',
    yaxis=dict(categoryorder='total ascending')
)

fig.show()