# Trabalho 3 - Análise de Sentimentos em Notícias
- Equipe: Arthur Santos, Daniele Simas, Laís Dib e Luã Maury
- Dataset: [Stock Market News Data in Portuguese](https://www.kaggle.com/datasets/mateuspicanco/financial-phrase-bank-portuguese-translation)

## 1. Pré-processamento

In [1]:
import pandas as pd

In [2]:
PATH_DF = "./financial_phrase_bank_pt_br.csv"

In [3]:
df = pd.read_csv(PATH_DF)
df.head()

Unnamed: 0,y,text,text_pt
0,neutral,Technopolis plans to develop in stages an area...,A Technopolis planeja desenvolver em etapas um...
1,negative,The international electronic industry company ...,"A Elcoteq, empresa internacional da indústria ..."
2,positive,With the new production plant the company woul...,Com a nova planta de produção a empresa aument...
3,positive,According to the company 's updated strategy f...,De acordo com a estratégia atualizada da empre...
4,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,FINANCIAMENTO DO CRESCIMENTO DA ASPOCOMP A Asp...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4845 entries, 0 to 4844
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   y        4845 non-null   object
 1   text     4845 non-null   object
 2   text_pt  4845 non-null   object
dtypes: object(3)
memory usage: 113.7+ KB


### 1.1. Verificando se existem dados duplicados e removendo-os

In [5]:
df.duplicated().sum()

6

In [6]:
df.drop_duplicates(inplace=True, ignore_index=True)
df.duplicated().sum()

0

### 1.2. Ajustando texto para minúsculo

In [7]:
for column in df.columns:
    df[column] = df[column].str.lower()
    
df.head()

Unnamed: 0,y,text,text_pt
0,neutral,technopolis plans to develop in stages an area...,a technopolis planeja desenvolver em etapas um...
1,negative,the international electronic industry company ...,"a elcoteq, empresa internacional da indústria ..."
2,positive,with the new production plant the company woul...,com a nova planta de produção a empresa aument...
3,positive,according to the company 's updated strategy f...,de acordo com a estratégia atualizada da empre...
4,positive,financing of aspocomp 's growth aspocomp is ag...,financiamento do crescimento da aspocomp a asp...


### 1.3. Removendo pontuação do texto

In [8]:
import string

In [9]:
EXCLUDE = set(string.punctuation)
EXCLUDE

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~'}

In [10]:
def remove_punctuation(text): 
    try: 
        text = ''.join(char for char in text if char not in EXCLUDE) 
    except: 
        pass 
    return text

In [11]:
for column in df.columns:
    df[column] = df[column].apply(remove_punctuation)

df.head()

Unnamed: 0,y,text,text_pt
0,neutral,technopolis plans to develop in stages an area...,a technopolis planeja desenvolver em etapas um...
1,negative,the international electronic industry company ...,a elcoteq empresa internacional da indústria e...
2,positive,with the new production plant the company woul...,com a nova planta de produção a empresa aument...
3,positive,according to the company s updated strategy fo...,de acordo com a estratégia atualizada da empre...
4,positive,financing of aspocomp s growth aspocomp is agg...,financiamento do crescimento da aspocomp a asp...


### 1.4. Removendo stopwords

In [12]:
# !pip install nltk

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ArkadeUser\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.corpus import stopwords

- Texto em inglês

In [15]:
eng_stopwords = stopwords.words('english')

df["text"] = df["text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (eng_stopwords)]))
df["text"].head()

0    technopolis plans develop stages area less 100...
1    international electronic industry company elco...
2    new production plant company would increase ca...
3    according company updated strategy years 20092...
4    financing aspocomp growth aspocomp aggressivel...
Name: text, dtype: object

- Texto em português

In [16]:
pt_stopwords = stopwords.words('portuguese')

df["text_pt"] = df["text_pt"].apply(lambda x: ' '.join([word for word in x.split() if word not in (pt_stopwords)]))
df["text_pt"].head()

0    technopolis planeja desenvolver etapas área in...
1    elcoteq empresa internacional indústria eletrô...
2    nova planta produção empresa aumentaria capaci...
3    acordo estratégia atualizada empresa anos 2009...
4    financiamento crescimento aspocomp aspocomp pe...
Name: text_pt, dtype: object

In [17]:
df.head()

Unnamed: 0,y,text,text_pt
0,neutral,technopolis plans develop stages area less 100...,technopolis planeja desenvolver etapas área in...
1,negative,international electronic industry company elco...,elcoteq empresa internacional indústria eletrô...
2,positive,new production plant company would increase ca...,nova planta produção empresa aumentaria capaci...
3,positive,according company updated strategy years 20092...,acordo estratégia atualizada empresa anos 2009...
4,positive,financing aspocomp growth aspocomp aggressivel...,financiamento crescimento aspocomp aspocomp pe...


### 1.5. Removendo acentos

In [18]:
# !pip install unidecode

In [19]:
from unidecode import unidecode

In [20]:
for column in df.columns:
    df[column] = df[column].apply(lambda x: unidecode(x))

df.head()

Unnamed: 0,y,text,text_pt
0,neutral,technopolis plans develop stages area less 100...,technopolis planeja desenvolver etapas area in...
1,negative,international electronic industry company elco...,elcoteq empresa internacional industria eletro...
2,positive,new production plant company would increase ca...,nova planta producao empresa aumentaria capaci...
3,positive,according company updated strategy years 20092...,acordo estrategia atualizada empresa anos 2009...
4,positive,financing aspocomp growth aspocomp aggressivel...,financiamento crescimento aspocomp aspocomp pe...


### 1.6. Lematização

In [21]:
# !pip install spacy

In [22]:
!spacy download en_core_web_sm
!spacy download pt_core_news_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
                                              0.0/12.8 MB ? eta -:--:--
                                              0.0/12.8 MB 2.0 MB/s eta 0:00:07
                                              0.0/12.8 MB 2.0 MB/s eta 0:00:07
                                              0.0/12.8 MB 2.0 MB/s eta 0:00:07
                                             0.1/12.8 MB 525.1 kB/s eta 0:00:25
                                             0.1/12.8 MB 525.1 kB/s eta 0:00:25
                                             0.1/12.8 MB 481.4 kB/s eta 0:00:27
                                             0.1/12.8 MB 502.3 kB/s eta 0:00:26
                                             0.2/12.8 MB 567.2 kB/s eta 0:00:23
                                             0.2/12.8 MB 567.2 kB/s eta 0:00:23
                                   


[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pt-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.6.0/pt_core_news_sm-3.6.0-py3-none-any.whl (13.0 MB)
                                              0.0/13.0 MB ? eta -:--:--
                                              0.0/13.0 MB 1.3 MB/s eta 0:00:11
                                             0.0/13.0 MB 487.6 kB/s eta 0:00:27
                                             0.1/13.0 MB 409.6 kB/s eta 0:00:32
                                             0.1/13.0 MB 409.6 kB/s eta 0:00:32
                                             0.1/13.0 MB 409.6 kB/s eta 0:00:32
                                             0.1/13.0 MB 450.6 kB/s eta 0:00:29
                                             0.1/13.0 MB 448.2 kB/s eta 0:00:29
                                             0.2/13.0 MB 490.7 kB/s eta 0:00:27
                                             0.2/13.0 MB 491.5 kB/s eta 0:00:27
                              


[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
import spacy

In [24]:
NLP_ENG = spacy.load("en_core_web_sm")
NLP_PT = spacy.load("pt_core_news_sm")

In [25]:
def lemmatizer(data, language):
    final_data = []
    lemm = None
    
    if language == "english":
        lemm = NLP_ENG
    else:
        lemm = NLP_PT
        
    for word in lemm(data):
        if word.pos_ == "VERB":
            final_data.append(word.lemma_)
        else:
            final_data.append(word.orth_)
        
    return " ".join(final_data)

In [26]:
df["lemmatized_text"] = df.text.map(lambda x: lemmatizer(x, "english"))
df["lemmatized_text_pt"] = df.text_pt.map(lambda x: lemmatizer(x, language="portuguese"))

df.head()

Unnamed: 0,y,text,text_pt,lemmatized_text,lemmatized_text_pt
0,neutral,technopolis plans develop stages area less 100...,technopolis planeja desenvolver etapas area in...,technopolis plans develop stages area less 100...,technopolis planejar desenvolver etapas area i...
1,negative,international electronic industry company elco...,elcoteq empresa internacional industria eletro...,international electronic industry company elco...,elcoteq empresa internacional industria eletro...
2,positive,new production plant company would increase ca...,nova planta producao empresa aumentaria capaci...,new production plant company would increase ca...,nova planta producao empresa aumentar capacida...
3,positive,according company updated strategy years 20092...,acordo estrategia atualizada empresa anos 2009...,accord company update strategy years 20092012 ...,acordo estrategia atualizada empresa ano 20092...
4,positive,financing aspocomp growth aspocomp aggressivel...,financiamento crescimento aspocomp aspocomp pe...,finance aspocomp growth aspocomp aggressively ...,financiamento crescimento aspocomp aspocomp pe...


### 1.7. Verificando se existem dados duplicados e removendo-os

In [27]:
df.duplicated().sum()

5

In [28]:
df.drop_duplicates(inplace=True, ignore_index=True)
df.duplicated().sum()

0

### 1.8. Verficiando distribuição dos dados entre as classes

In [29]:
# Quantidade de dados por classe
df["y"].value_counts()

y
neutral     2867
positive    1363
negative     604
Name: count, dtype: int64

In [30]:
# Proporção entre as classes
df["y"].value_counts(normalize=True)*100

y
neutral     59.309061
positive    28.196111
negative    12.494828
Name: proportion, dtype: float64

É possível notar que o dataset em questão é desbalanceado, já que a diferença entre a quantidade de dados presente em cada classe supera 10%. Para contornar tal fato, será utilizada da estratégia de Validação Cruzada no treinamento dos modelos.

### 1.9. Preparando TF-IDF

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
eng_vectorizer = TfidfVectorizer(sublinear_tf=True)
pt_vectorizer = TfidfVectorizer(sublinear_tf=True)

In [33]:
features_text = eng_vectorizer.fit_transform(df["lemmatized_text"])
features_text_pt = pt_vectorizer.fit_transform(df["lemmatized_text_pt"])

### 1.10. Separando dados em partes para treino e teste

In [34]:
from sklearn.preprocessing import LabelEncoder

In [35]:
le = LabelEncoder()
le.fit(df['y'])
le.classes_

array(['negative', 'neutral', 'positive'], dtype=object)

- Dados para treino

In [36]:
# Classes
y_train = df['y'][:-400]
y_train

0        neutral
1       negative
2       positive
3       positive
4       positive
          ...   
4429    negative
4430    negative
4431    negative
4432    negative
4433    negative
Name: y, Length: 4434, dtype: object

In [37]:
y_train_le = le.transform(y_train)
y_train_le

array([1, 0, 2, ..., 0, 0, 0])

In [38]:
# Atributos
X_train_eng = features_text[:-400]
X_train_pt = features_text_pt[:-400]

- Dados para teste

In [39]:
# Classes
y_test = df['y'][-400:]
y_test

4434    negative
4435    negative
4436    negative
4437    negative
4438    negative
          ...   
4829    negative
4830     neutral
4831    negative
4832    negative
4833    negative
Name: y, Length: 400, dtype: object

In [40]:
y_test_le = le.transform(y_test)
y_test_le

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
       2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2,
       1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1,
       1, 2, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       2, 0, 2, 1, 0, 1, 1, 2, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 0, 0, 2, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 2, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [41]:
# Atributos
X_test_eng = features_text[-400:]
X_test_pt = features_text_pt[-400:]

## 2. Treinando modelos com validação cruzada

In [42]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
import joblib

In [43]:
PATH_METRICS = "./metrics/"
PATH_MODELS = "./models/"

In [44]:
def evaluate_model(model, skf, X, y, model_name):
    pipeline = Pipeline([('estimator', model)])
    scores = cross_validate(pipeline, X, y, cv=skf, scoring=["accuracy",
                                                             "f1_weighted",
                                                             "precision_weighted", 
                                                             "recall_weighted"])
    columns = {}

    for test in scores:
        columns[test] = test.replace('_', ' ').title()

    df = pd.DataFrame(scores, index=["K-Fold " + str(i) for i in range(1, 4)])
    df = df.rename(columns=columns)
    
    for i, (train_index, _) in enumerate(skf.split(X, y)):
        X_train = X[train_index]
        y_train = y[train_index]

        model.fit(X_train, y_train)
        
        save_path = PATH_MODELS + model_name + str(i) + ".pkl"
        joblib.dump(model, save_path)
    
    return df

In [45]:
# !pip install xgboost

In [46]:
from sklearn import tree
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [47]:
# Instanciando StratifiedKFold
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Instanciando os modelos a serem treinados
decision_tree = tree.DecisionTreeClassifier()
svm_model = svm.SVC(kernel='linear')
kNN = KNeighborsClassifier(n_neighbors=5)
xgb_model = xgb.XGBClassifier()
lr_model = LogisticRegression()
nb_model = MultinomialNB()
rf_model = RandomForestClassifier()
etc_model = ExtraTreesClassifier()

- Validação cruzada para textos em inglês

In [48]:
# Treinando e obtendo as métricas de ambos os modelos
metrics_dt_eng = evaluate_model(decision_tree, skf, X_train_eng, y_train, "Decision_Tree_eng")
metrics_svm_eng = evaluate_model(svm_model, skf, X_train_eng, y_train, "SVM_eng")
metrics_knn_eng = evaluate_model(kNN, skf, X_train_eng, y_train, "KNN_eng")
metrics_xgb_eng = evaluate_model(xgb_model, skf, X_train_eng, y_train_le, "XGB_eng")
metrics_lr_eng = evaluate_model(lr_model, skf, X_train_eng, y_train, "Logistic_Regression_eng")
metrics_nb_eng = evaluate_model(nb_model, skf, X_train_eng, y_train, "Naive_Bayes_eng")
metrics_rf_eng = evaluate_model(rf_model, skf, X_train_eng, y_train, "Random_Forest_eng")
metrics_etc_eng = evaluate_model(etc_model, skf, X_train_eng, y_train, "Extra_Tree_eng")

found 0 physical cores < 1
  File "c:\Users\ArkadeUser\Documents\projects\nlp-sentimental-analysis\venv\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [49]:
metrics_dt_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,3.488176,0.023978,0.711773,0.705493,0.70148,0.711773
K-Fold 2,2.359762,0.024,0.712449,0.70925,0.707162,0.712449
K-Fold 3,3.329239,0.016004,0.694858,0.690629,0.687413,0.694858


In [50]:
metrics_svm_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,1.060325,0.407952,0.767253,0.746396,0.760637,0.767253
K-Fold 2,1.038465,0.414531,0.764547,0.740703,0.762694,0.764547
K-Fold 3,1.094825,0.406285,0.738836,0.709347,0.724834,0.738836


In [51]:
metrics_knn_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.008116,0.390908,0.709743,0.696435,0.694603,0.709743
K-Fold 2,0.009184,0.1696,0.713126,0.684486,0.696572,0.713126
K-Fold 3,0.007505,0.177736,0.700947,0.666445,0.684654,0.700947


In [52]:
metrics_xgb_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,1.512913,0.016006,0.774019,0.760686,0.767921,0.774019
K-Fold 2,0.536114,0.016001,0.769959,0.752305,0.758227,0.769959
K-Fold 3,0.536098,0.012011,0.743572,0.725392,0.731289,0.743572


In [53]:
metrics_lr_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.566131,0.016274,0.744249,0.707091,0.761332,0.744249
K-Fold 2,0.555012,0.0242,0.748309,0.709231,0.757647,0.748309
K-Fold 3,0.454655,0.032381,0.728011,0.681038,0.71271,0.728011


In [54]:
metrics_nb_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.008125,0.041133,0.705007,0.640424,0.650364,0.705007
K-Fold 2,0.010207,0.022128,0.709066,0.644451,0.666847,0.709066
K-Fold 3,0.0079,0.016855,0.698241,0.628531,0.652656,0.698241


In [55]:
metrics_rf_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,14.334477,0.101358,0.751691,0.717633,0.742062,0.751691
K-Fold 2,14.99746,0.111963,0.759134,0.725059,0.752766,0.759134
K-Fold 3,14.095937,0.112497,0.744926,0.708512,0.762509,0.744926


In [56]:
metrics_etc_eng

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,22.638088,0.139432,0.753721,0.729917,0.742737,0.753721
K-Fold 2,24.07935,0.135307,0.771989,0.744107,0.771369,0.771989
K-Fold 3,22.017257,0.136739,0.756428,0.727445,0.759177,0.756428


- Validação cruzada dos textos em português

In [57]:
# Treinando e obtendo as métricas de ambos os modelos
metrics_dt_pt = evaluate_model(decision_tree, skf, X_train_pt, y_train, "Decision_Tree_pt")
metrics_svm_pt = evaluate_model(svm_model, skf, X_train_pt, y_train, "SVM_pt")
metrics_knn_pt = evaluate_model(kNN, skf, X_train_pt, y_train, "KNN_pt")
metrics_xgb_pt = evaluate_model(xgb_model, skf, X_train_pt, y_train_le, "XGB_pt")
metrics_lr_pt = evaluate_model(lr_model, skf, X_train_pt, y_train, "Logistic_Regression_pt")
metrics_nb_pt = evaluate_model(nb_model, skf, X_train_pt, y_train, "Naive_Bayes_pt")
metrics_rf_pt = evaluate_model(rf_model, skf, X_train_pt, y_train, "Random_Forest_pt")
metrics_etc_pt = evaluate_model(etc_model, skf, X_train_pt, y_train, "Extra_Tree_pt")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [58]:
metrics_dt_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,2.741453,0.023946,0.717862,0.710998,0.708712,0.717862
K-Fold 2,2.747814,0.023999,0.707037,0.702475,0.700746,0.707037
K-Fold 3,2.704338,0.023991,0.694181,0.687813,0.684329,0.694181


In [59]:
metrics_svm_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,1.170115,0.432072,0.759134,0.737845,0.75496,0.759134
K-Fold 2,1.120782,0.429992,0.776049,0.752999,0.778676,0.776049
K-Fold 3,1.159268,0.446781,0.758457,0.732961,0.759833,0.758457


In [60]:
metrics_knn_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.00185,0.217833,0.70433,0.689035,0.687526,0.70433
K-Fold 2,0.0,0.208186,0.698917,0.662461,0.689454,0.698917
K-Fold 3,0.0,0.173575,0.710419,0.67614,0.702064,0.710419


In [61]:
metrics_xgb_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.628312,0.014509,0.775372,0.758948,0.770118,0.775372
K-Fold 2,0.524805,0.012135,0.782815,0.765339,0.777253,0.782815
K-Fold 3,0.520527,0.008202,0.76184,0.744195,0.756732,0.76184


In [62]:
metrics_lr_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.63649,0.033363,0.744249,0.707107,0.751174,0.744249
K-Fold 2,0.629121,0.022336,0.751015,0.712389,0.762551,0.751015
K-Fold 3,0.387756,0.026387,0.730041,0.685964,0.720708,0.730041


In [63]:
metrics_nb_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,0.005936,0.018258,0.711096,0.64958,0.655627,0.711096
K-Fold 2,0.0,0.02424,0.707037,0.641584,0.660926,0.707037
K-Fold 3,0.006077,0.018232,0.699594,0.63204,0.650511,0.699594


In [64]:
metrics_rf_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,15.37565,0.112865,0.766576,0.737103,0.772877,0.766576
K-Fold 2,15.892756,0.117261,0.756428,0.723703,0.760941,0.756428
K-Fold 3,14.951352,0.106857,0.744926,0.709726,0.746655,0.744926


In [65]:
metrics_etc_pt

Unnamed: 0,Fit Time,Score Time,Test Accuracy,Test F1 Weighted,Test Precision Weighted,Test Recall Weighted
K-Fold 1,24.560087,0.147513,0.772666,0.749005,0.773806,0.772666
K-Fold 2,25.40217,0.133341,0.764547,0.735546,0.768942,0.764547
K-Fold 3,23.492034,0.14345,0.759811,0.733109,0.77058,0.759811


## 3. Realizando testes nos modelos treinados com validação cruzada

In [66]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
import os

In [67]:
def testing_metrics(y_true, y_pred, language, filename, training_type=""):
    name_df = ""
    df = None
    columns = ["Models", "Accuracy", "Precision Weighted", 
               "F1-Score Weigthed", "Recall Weighted"]
    
    if "eng" == language:
        name_df = "{}testing_metrics_eng.csv".format(training_type)
    else:
        name_df = "{}testing_metrics_pt.csv".format(training_type)
    
    if name_df not in os.listdir(PATH_METRICS):
        df = pd.DataFrame(columns=columns)
    else:
        df = pd.read_csv(PATH_METRICS + name_df)
    
    acc = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="weighted")
    f1 = f1_score(y_true, y_pred, average="weighted")
    recall = recall_score(y_true, y_pred, average="weighted")
    
    data = [filename, acc, precision, f1, recall]
    len_df = len(df)
    
    for column, d in zip(columns, data):
        df.at[len_df, column] = d
            
    df.to_csv(PATH_METRICS + name_df, index=False)

- Mapeando arquivos dos modelos

In [68]:
files = [i for i in os.listdir(PATH_MODELS)]
files_eng = [i for i in files if "eng" in i]
files_pt = [i for i in files if i not in files_eng]

In [69]:
files_eng.sort()
files_eng

['Decision_Tree_eng0.pkl',
 'Decision_Tree_eng1.pkl',
 'Decision_Tree_eng2.pkl',
 'Extra_Tree_eng0.pkl',
 'Extra_Tree_eng1.pkl',
 'Extra_Tree_eng2.pkl',
 'KNN_eng0.pkl',
 'KNN_eng1.pkl',
 'KNN_eng2.pkl',
 'Logistic_Regression_eng0.pkl',
 'Logistic_Regression_eng1.pkl',
 'Logistic_Regression_eng2.pkl',
 'Naive_Bayes_eng0.pkl',
 'Naive_Bayes_eng1.pkl',
 'Naive_Bayes_eng2.pkl',
 'Random_Forest_eng0.pkl',
 'Random_Forest_eng1.pkl',
 'Random_Forest_eng2.pkl',
 'SVM_eng0.pkl',
 'SVM_eng1.pkl',
 'SVM_eng2.pkl',
 'XGB_eng0.pkl',
 'XGB_eng1.pkl',
 'XGB_eng2.pkl']

In [70]:
files_pt.sort()
files_pt

['Decision_Tree_pt0.pkl',
 'Decision_Tree_pt1.pkl',
 'Decision_Tree_pt2.pkl',
 'Extra_Tree_pt0.pkl',
 'Extra_Tree_pt1.pkl',
 'Extra_Tree_pt2.pkl',
 'KNN_pt0.pkl',
 'KNN_pt1.pkl',
 'KNN_pt2.pkl',
 'Logistic_Regression_pt0.pkl',
 'Logistic_Regression_pt1.pkl',
 'Logistic_Regression_pt2.pkl',
 'Naive_Bayes_pt0.pkl',
 'Naive_Bayes_pt1.pkl',
 'Naive_Bayes_pt2.pkl',
 'Random_Forest_pt0.pkl',
 'Random_Forest_pt1.pkl',
 'Random_Forest_pt2.pkl',
 'SVM_pt0.pkl',
 'SVM_pt1.pkl',
 'SVM_pt2.pkl',
 'XGB_pt0.pkl',
 'XGB_pt1.pkl',
 'XGB_pt2.pkl']

- Testando modelos com texto em inglês

In [71]:
for file in files_eng:
    model = joblib.load(PATH_MODELS + file)
    y_pred = model.predict(X_test_eng)
    
    if "XGB" not in file:
        testing_metrics(y_test, y_pred, "eng", file, "CV")
    else:
        testing_metrics(y_test_le, y_pred, "eng", file, "CV")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [72]:
pd.read_csv(PATH_METRICS + "CVtesting_metrics_eng.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision_Tree_eng0.pkl,0.3175,0.682592,0.301562,0.3175
1,Decision_Tree_eng1.pkl,0.3225,0.660187,0.324247,0.3225
2,Decision_Tree_eng2.pkl,0.33,0.615429,0.338737,0.33
3,Extra_Tree_eng0.pkl,0.2525,0.740737,0.168178,0.2525
4,Extra_Tree_eng1.pkl,0.2475,0.626262,0.165932,0.2475
5,Extra_Tree_eng2.pkl,0.245,0.559683,0.159807,0.245
6,KNN_eng0.pkl,0.3225,0.64627,0.311249,0.3225
7,KNN_eng1.pkl,0.2975,0.628772,0.27511,0.2975
8,KNN_eng2.pkl,0.2975,0.64613,0.286223,0.2975
9,Logistic_Regression_eng0.pkl,0.235,0.733313,0.140378,0.235


- Testando modelos com texto em português

In [73]:
for file in files_pt:
    model = joblib.load(PATH_MODELS + file)
    y_pred = model.predict(X_test_pt)
    
    if "XGB" not in file:
        testing_metrics(y_test, y_pred, "pt", file, "CV")
    else:
        testing_metrics(y_test_le, y_pred, "pt", file, "CV")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [74]:
pd.read_csv(PATH_METRICS + "CVtesting_metrics_pt.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision_Tree_pt0.pkl,0.47,0.702354,0.515918,0.47
1,Decision_Tree_pt1.pkl,0.4375,0.700806,0.478735,0.4375
2,Decision_Tree_pt2.pkl,0.475,0.686425,0.51343,0.475
3,Extra_Tree_pt0.pkl,0.305,0.686925,0.269581,0.305
4,Extra_Tree_pt1.pkl,0.3075,0.720896,0.275529,0.3075
5,Extra_Tree_pt2.pkl,0.2675,0.679252,0.214341,0.2675
6,KNN_pt0.pkl,0.335,0.663632,0.33021,0.335
7,KNN_pt1.pkl,0.3675,0.676636,0.379038,0.3675
8,KNN_pt2.pkl,0.3525,0.681905,0.347534,0.3525
9,Logistic_Regression_pt0.pkl,0.2625,0.733035,0.186731,0.2625


Tendo em vista que os modelos não desempenharam muito bem com o treinamento com validação cruzada e testado com os últimos 400 exemplos da base de dados original, serão tentadas outras abordagens visando melhoria das métricas.

## 4. Treinando e testando sem validação cruzada
Aqui, X_train_eng e X_train_pt serão usados para treino e teste sem validação cruzada para verificação de melhoria das métricas.

In [75]:
from sklearn.neural_network import MLPClassifier

In [76]:
# Instanciando os modelos a serem treinados
decision_tree = tree.DecisionTreeClassifier()
svm_model = svm.SVC(kernel='linear')
kNN = KNeighborsClassifier(n_neighbors=5)
xgb_model = xgb.XGBClassifier()
lr_model = LogisticRegression()
nb_model = MultinomialNB()
rf_model = RandomForestClassifier()
etc_model = ExtraTreesClassifier()
mlp = MLPClassifier()

### 4.1. Textos em inglês
- Treino

In [77]:
decision_tree.fit(X_train_eng, y_train)
svm_model.fit(X_train_eng, y_train)
kNN.fit(X_train_eng, y_train)
xgb_model.fit(X_train_eng, y_train_le)
lr_model.fit(X_train_eng, y_train)
nb_model.fit(X_train_eng, y_train)
rf_model.fit(X_train_eng, y_train)
etc_model.fit(X_train_eng, y_train)
mlp.fit(X_train_eng, y_train)

- Teste

In [78]:
testing_metrics(y_test, decision_tree.predict(X_test_eng), "eng", "Decision Tree")
testing_metrics(y_test, svm_model.predict(X_test_eng), "eng", "SVM")
testing_metrics(y_test, kNN.predict(X_test_eng), "eng", "KNN")
testing_metrics(y_test_le, xgb_model.predict(X_test_eng), "eng", "XGBoost")
testing_metrics(y_test, lr_model.predict(X_test_eng), "eng", "Logistic Regression")
testing_metrics(y_test, nb_model.predict(X_test_eng), "eng", "Naive Bayes")
testing_metrics(y_test, rf_model.predict(X_test_eng), "eng", "Random Forest")
testing_metrics(y_test, etc_model.predict(X_test_eng), "eng", "Extra Tree")
testing_metrics(y_test, mlp.predict(X_test_eng), "eng", "MLP Classifier")

  _warn_prf(average, modifier, msg_start, len(result))


In [79]:
pd.read_csv(PATH_METRICS + "testing_metrics_eng.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.3425,0.66258,0.360447,0.3425
1,SVM,0.2925,0.715094,0.24375,0.2925
2,KNN,0.31,0.642051,0.296555,0.31
3,XGBoost,0.4025,0.714615,0.413457,0.4025
4,Logistic Regression,0.2425,0.733536,0.150517,0.2425
5,Naive Bayes,0.2175,0.072224,0.108383,0.2175
6,Random Forest,0.2275,0.731924,0.128818,0.2275
7,Extra Tree,0.2775,0.707049,0.207812,0.2775
8,MLP Classifier,0.3525,0.689618,0.355763,0.3525


### 4.2. Textos em português
- Treino

In [80]:
decision_tree.fit(X_train_pt, y_train)
svm_model.fit(X_train_pt, y_train)
kNN.fit(X_train_pt, y_train)
xgb_model.fit(X_train_pt, y_train_le)
lr_model.fit(X_train_pt, y_train)
nb_model.fit(X_train_pt, y_train)
rf_model.fit(X_train_pt, y_train)
etc_model.fit(X_train_pt, y_train)
mlp.fit(X_train_pt, y_train)

- Teste

In [81]:
testing_metrics(y_test, decision_tree.predict(X_test_pt), "pt", "Decision Tree")
testing_metrics(y_test, svm_model.predict(X_test_pt), "pt", "SVM")
testing_metrics(y_test, kNN.predict(X_test_pt), "pt", "KNN")
testing_metrics(y_test_le, xgb_model.predict(X_test_pt), "pt", "XGBoost")
testing_metrics(y_test, lr_model.predict(X_test_pt), "pt", "Logistic Regression")
testing_metrics(y_test, nb_model.predict(X_test_pt), "pt", "Naive Bayes")
testing_metrics(y_test, rf_model.predict(X_test_pt), "pt", "Random Forest")
testing_metrics(y_test, etc_model.predict(X_test_pt), "pt", "Extra Tree")
testing_metrics(y_test, mlp.predict(X_test_pt), "pt", "MLP Classifier")

  _warn_prf(average, modifier, msg_start, len(result))


In [82]:
pd.read_csv(PATH_METRICS + "testing_metrics_pt.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.455,0.688776,0.493201,0.455
1,SVM,0.4375,0.728271,0.455923,0.4375
2,KNN,0.3375,0.642909,0.344157,0.3375
3,XGBoost,0.4925,0.72063,0.525667,0.4925
4,Logistic Regression,0.3225,0.73904,0.288387,0.3225
5,Naive Bayes,0.22,0.073857,0.110567,0.22
6,Random Forest,0.31,0.675283,0.279573,0.31
7,Extra Tree,0.3,0.724475,0.26005,0.3
8,MLP Classifier,0.4625,0.743727,0.493084,0.4625


Os modelos acabaram tendo poucas melhorias, sendo XGBoost o modelo com maior F1-Score em português (0.525667) e inglês (0.413457).

## 5. Treinando e testando com dataset balanceado
### 5.1. Balanceando dataset levando em conta a quantidade de exemplos da menor classe

In [83]:
y_train.value_counts()

y
neutral     2782
positive    1310
negative     342
Name: count, dtype: int64

In [84]:
classes = y_train.unique()
elements_per_class = 342
choosen_elements = []

for c in classes:
    y_class = y_train.loc[y_train == c]
    
    if len(y_class) >= elements_per_class:
        choosen = y_class.sample(n=elements_per_class, random_state=42)
    else:
        choosen = y_class
    
    choosen_elements.append(choosen)

y_train_balanced = pd.concat(choosen_elements).sort_index()
y_train_balanced

1       negative
12      positive
25      positive
31      positive
33      positive
          ...   
4429    negative
4430    negative
4431    negative
4432    negative
4433    negative
Name: y, Length: 1026, dtype: object

In [85]:
y_train_le_balanced = y_train_le[y_train_balanced.index]

In [86]:
X_train_eng_balanced = X_train_eng[y_train_balanced.index]
X_train_eng_balanced.shape

(1026, 10332)

In [87]:
X_train_pt_balanced = X_train_pt[y_train_balanced.index]
X_train_pt_balanced.shape

(1026, 11672)

### Textos em inglês
- Treino

In [88]:
decision_tree.fit(X_train_eng_balanced, y_train_balanced)
svm_model.fit(X_train_eng_balanced, y_train_balanced)
kNN.fit(X_train_eng_balanced, y_train_balanced)
xgb_model.fit(X_train_eng_balanced, y_train_le_balanced)
lr_model.fit(X_train_eng_balanced, y_train_balanced)
nb_model.fit(X_train_eng_balanced, y_train_balanced)
rf_model.fit(X_train_eng_balanced, y_train_balanced)
etc_model.fit(X_train_eng_balanced, y_train_balanced)
mlp.fit(X_train_eng_balanced, y_train_balanced)

- Teste

In [89]:
testing_metrics(y_test, decision_tree.predict(X_test_eng), "eng", "Decision Tree", "BL")
testing_metrics(y_test, svm_model.predict(X_test_eng), "eng", "SVM", "BL")
testing_metrics(y_test, kNN.predict(X_test_eng), "eng", "KNN", "BL")
testing_metrics(y_test_le, xgb_model.predict(X_test_eng), "eng", "XGBoost", "BL")
testing_metrics(y_test, lr_model.predict(X_test_eng), "eng", "Logistic Regression", "BL")
testing_metrics(y_test, nb_model.predict(X_test_eng), "eng", "Naive Bayes", "BL")
testing_metrics(y_test, rf_model.predict(X_test_eng), "eng", "Random Forest", "BL")
testing_metrics(y_test, etc_model.predict(X_test_eng), "eng", "Extra Tree", "BL")
testing_metrics(y_test, mlp.predict(X_test_eng), "eng", "MLP Classifier", "BL")

In [90]:
pd.read_csv(PATH_METRICS + "BLtesting_metrics_eng.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.535,0.668954,0.571192,0.535
1,SVM,0.5775,0.699035,0.609989,0.5775
2,KNN,0.5375,0.606984,0.560663,0.5375
3,XGBoost,0.5975,0.69808,0.626977,0.5975
4,Logistic Regression,0.5525,0.700614,0.586288,0.5525
5,Naive Bayes,0.7025,0.73983,0.714935,0.7025
6,Random Forest,0.5325,0.696377,0.568465,0.5325
7,Extra Tree,0.5375,0.691626,0.571306,0.5375
8,MLP Classifier,0.5575,0.669204,0.591203,0.5575


### Textos em português
- Treino

In [91]:
decision_tree.fit(X_train_pt_balanced, y_train_balanced)
svm_model.fit(X_train_pt_balanced, y_train_balanced)
kNN.fit(X_train_pt_balanced, y_train_balanced)
xgb_model.fit(X_train_pt_balanced, y_train_le_balanced)
lr_model.fit(X_train_pt_balanced, y_train_balanced)
nb_model.fit(X_train_pt_balanced, y_train_balanced)
rf_model.fit(X_train_pt_balanced, y_train_balanced)
etc_model.fit(X_train_pt_balanced, y_train_balanced)
mlp.fit(X_train_pt_balanced, y_train_balanced)

- Teste

In [92]:
testing_metrics(y_test, decision_tree.predict(X_test_pt), "pt", "Decision Tree", "BL")
testing_metrics(y_test, svm_model.predict(X_test_pt), "pt", "SVM", "BL")
testing_metrics(y_test, kNN.predict(X_test_pt), "pt", "KNN", "BL")
testing_metrics(y_test_le, xgb_model.predict(X_test_pt), "pt", "XGBoost", "BL")
testing_metrics(y_test, lr_model.predict(X_test_pt), "pt", "Logistic Regression", "BL")
testing_metrics(y_test, nb_model.predict(X_test_pt), "pt", "Naive Bayes", "BL")
testing_metrics(y_test, rf_model.predict(X_test_pt), "pt", "Random Forest", "BL")
testing_metrics(y_test, etc_model.predict(X_test_pt), "pt", "Extra Tree", "BL")
testing_metrics(y_test, mlp.predict(X_test_pt), "pt", "MLP Classifier", "BL")

In [93]:
pd.read_csv(PATH_METRICS + "BLtesting_metrics_pt.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.595,0.689786,0.628642,0.595
1,SVM,0.5675,0.685662,0.602382,0.5675
2,KNN,0.555,0.621997,0.579884,0.555
3,XGBoost,0.63,0.696022,0.654409,0.63
4,Logistic Regression,0.585,0.711554,0.617721,0.585
5,Naive Bayes,0.615,0.68173,0.640046,0.615
6,Random Forest,0.5975,0.701167,0.622704,0.5975
7,Extra Tree,0.6025,0.686345,0.628512,0.6025
8,MLP Classifier,0.5525,0.652658,0.587001,0.5525


Naive Bayes se destacou no teste em textos em inglês, com F1-Score de 0.714935. Em português, XGBoost continua com maior F1-Score dentre os demais modelos, sendo igual a 0.654409.

### 5.2. Balanceando dataset visando aumentar a quantidade de exemplos das menores classes

In [94]:
# !pip install imblearn

In [95]:
from imblearn.over_sampling import ADASYN

### Textos em inglês
- Treino

In [96]:
adasyn = ADASYN()
X_train_eng_resampled, y_train_resampled = adasyn.fit_resample(X_train_eng, y_train)
_, y_train_le_resampled = adasyn.fit_resample(X_train_eng, y_train_le)

In [97]:
decision_tree.fit(X_train_eng_resampled, y_train_resampled)
svm_model.fit(X_train_eng_resampled, y_train_resampled)
kNN.fit(X_train_eng_resampled, y_train_resampled)
xgb_model.fit(X_train_eng_resampled, y_train_le_resampled)
lr_model.fit(X_train_eng_resampled, y_train_resampled)
nb_model.fit(X_train_eng_resampled, y_train_resampled)
rf_model.fit(X_train_eng_resampled, y_train_resampled)
etc_model.fit(X_train_eng_resampled, y_train_resampled)
mlp.fit(X_train_eng_resampled, y_train_resampled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- Teste

In [98]:
testing_metrics(y_test, decision_tree.predict(X_test_eng), "eng", "Decision Tree", "RS")
testing_metrics(y_test, svm_model.predict(X_test_eng), "eng", "SVM", "RS")
testing_metrics(y_test, kNN.predict(X_test_eng), "eng", "KNN", "RS")
testing_metrics(y_test_le, xgb_model.predict(X_test_eng), "eng", "XGBoost", "RS")
testing_metrics(y_test, lr_model.predict(X_test_eng), "eng", "Logistic Regression", "RS")
testing_metrics(y_test, nb_model.predict(X_test_eng), "eng", "Naive Bayes", "RS")
testing_metrics(y_test, rf_model.predict(X_test_eng), "eng", "Random Forest", "RS")
testing_metrics(y_test, etc_model.predict(X_test_eng), "eng", "Extra Tree", "RS")
testing_metrics(y_test, mlp.predict(X_test_eng), "eng", "MLP Classifier", "RS")

In [99]:
pd.read_csv(PATH_METRICS + "RStesting_metrics_eng.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.4225,0.672256,0.454549,0.4225
1,SVM,0.395,0.702459,0.401111,0.395
2,KNN,0.5,0.533565,0.48215,0.5
3,XGBoost,0.5425,0.741285,0.575595,0.5425
4,Logistic Regression,0.435,0.703043,0.456779,0.435
5,Naive Bayes,0.6225,0.735383,0.658022,0.6225
6,Random Forest,0.2825,0.716682,0.235111,0.2825
7,Extra Tree,0.31,0.709134,0.283183,0.31
8,MLP Classifier,0.3825,0.685018,0.389924,0.3825


### Textos em português
- Treino

In [100]:
X_train_pt_resampled, y_train_resampled = adasyn.fit_resample(X_train_pt, y_train)
_, y_train_le_resampled = adasyn.fit_resample(X_train_pt, y_train_le)

In [101]:
decision_tree.fit(X_train_pt_resampled, y_train_resampled)
svm_model.fit(X_train_pt_resampled, y_train_resampled)
kNN.fit(X_train_pt_resampled, y_train_resampled)
xgb_model.fit(X_train_pt_resampled, y_train_le_resampled)
lr_model.fit(X_train_pt_resampled, y_train_resampled)
nb_model.fit(X_train_pt_resampled, y_train_resampled)
rf_model.fit(X_train_pt_resampled, y_train_resampled)
etc_model.fit(X_train_pt_resampled, y_train_resampled)
mlp.fit(X_train_pt_resampled, y_train_resampled)

- Teste

In [102]:
testing_metrics(y_test, decision_tree.predict(X_test_pt), "pt", "Decision Tree", "RS")
testing_metrics(y_test, svm_model.predict(X_test_pt), "pt", "SVM", "RS")
testing_metrics(y_test, kNN.predict(X_test_pt), "pt", "KNN", "RS")
testing_metrics(y_test_le, xgb_model.predict(X_test_pt), "pt", "XGBoost", "RS")
testing_metrics(y_test, lr_model.predict(X_test_pt), "pt", "Logistic Regression", "RS")
testing_metrics(y_test, nb_model.predict(X_test_pt), "pt", "Naive Bayes", "RS")
testing_metrics(y_test, rf_model.predict(X_test_pt), "pt", "Random Forest", "RS")
testing_metrics(y_test, etc_model.predict(X_test_pt), "pt", "Extra Tree", "RS")
testing_metrics(y_test, mlp.predict(X_test_pt), "pt", "MLP Classifier", "RS")

In [103]:
pd.read_csv(PATH_METRICS + "RStesting_metrics_pt.csv")

Unnamed: 0,Models,Accuracy,Precision Weighted,F1-Score Weigthed,Recall Weighted
0,Decision Tree,0.4875,0.687703,0.534837,0.4875
1,SVM,0.505,0.7267,0.543708,0.505
2,KNN,0.57,0.519107,0.529954,0.57
3,XGBoost,0.58,0.726151,0.614801,0.58
4,Logistic Regression,0.535,0.726948,0.572778,0.535
5,Naive Bayes,0.6375,0.720389,0.667987,0.6375
6,Random Forest,0.415,0.697988,0.440006,0.415
7,Extra Tree,0.5075,0.735747,0.540926,0.5075
8,MLP Classifier,0.45,0.721996,0.479252,0.45


Naive Bayes continuou em destaque no teste em textos em inglês, com F1-Score de 0.689906. Em português, agora Naive Bayes teve melhor desempenho, com F1-Score igual a 0.681226.

## 6. Procurando atributos ótimos para os melhores modelos
- Treinamento sem validação cruzada: XGBoost
- Treinamento com dados balanceados de acordo com a menor classe: Naive Bayes e XGBoost
- Treinamento com dados balanceados com ADASYN: Naive Bayes

Aqui, serão procurados atributos ótimos para os melhores modelos dada as abordagens feitas anteriormente somente para textos em português.

### 6.1. XGBoost
- Conjunto de dados original

In [107]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

hyperparameters = {
    'max_depth': [6, 12, 15],  # Valores possíveis para a profundidade máxima
    'learning_rate': [0.1, 0.01, 0.001],  # Valores possíveis para a taxa de aprendizado
    'n_estimators': [100, 500, 1000]  # Valores possíveis para o número de estimadores
}

skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

grid_search = GridSearchCV(xgb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt, y_train_le)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.756332564285095


- Conjunto de dados balanceado de acordo com a menor classe

In [108]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'max_depth': [6, 12, 15],  # Valores possíveis para a profundidade máxima
    'learning_rate': [0.1, 0.01, 0.001],  # Valores possíveis para a taxa de aprendizado
    'n_estimators': [100, 500, 1000]  # Valores possíveis para o número de estimadores
}

grid_search = GridSearchCV(xgb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt_balanced, y_train_le_balanced)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.6065881989145527


- Conjunto de dados balanceado com ADASYN

In [109]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'max_depth': [6, 12, 15],  # Valores possíveis para a profundidade máxima
    'learning_rate': [0.1, 0.01, 0.001],  # Valores possíveis para a taxa de aprendizado
    'n_estimators': [100, 500, 1000]  # Valores possíveis para o número de estimadores
}


grid_search = GridSearchCV(xgb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt_resampled, y_train_le_resampled)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.8876527501104017


### 6.2. Naive Bayes
- Conjunto de dados original

In [104]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'alpha': [0.1, 1.0, 10.0],  # Valores possíveis para o parâmetro de suavização (smoothing) alpha
    'fit_prior': [True, False]  # Valores possíveis para ajustar prioridades dos dados
}


grid_search = GridSearchCV(nb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt, y_train)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.6909068065699291


- Conjunto de dados balanceado de acordo com a menor classe

In [105]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'alpha': [0.1, 1.0, 10.0],  # Valores possíveis para o parâmetro de suavização (smoothing) alpha
    'fit_prior': [True, False]  # Valores possíveis para ajustar prioridades dos dados
}


grid_search = GridSearchCV(nb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt_balanced, y_train_le_balanced)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.6130118057257555


- Conjunto de dados balanceado com ADASYN

In [106]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'alpha': [0.1, 1.0, 10.0],  # Valores possíveis para o parâmetro de suavização (smoothing) alpha
    'fit_prior': [True, False]  # Valores possíveis para ajustar prioridades dos dados
}


grid_search = GridSearchCV(nb_model, hyperparameters, cv=skf, scoring='f1_weighted')
grid_search.fit(X_train_pt_resampled, y_train_le_resampled)

print("Melhor pontuação de desempenho:")
print(grid_search.best_score_)

Melhor pontuação de desempenho:
0.8711621649697584


## 7. Considerações finais
- Melhor modelo para textos em inglês: Naive Bayes
  - F1-Score: 0.714935
  - Dataset: balanceado de acordo com a quantidade de exemplos da menor classe
  - Sem GridSearchCV(): não foi tentada tal abordagem

- Melhor modelo para textos em português: XGBoost
  - F1-Score: 0.8876527501104017
  - Dataset: balanceado com ADASYN
  - Com GridSearchCV()