# 01 - Pré-processamento

Neste notebook, buscaremos realizar um pré-processamento dos dados, incluindo um leve tratamento do texto (se necessário) e a definição da variável target (sentimento).

## Constantes

In [2]:
# Número de amostras geradas aleatoriamente para validarmos o texto
N_SAMPLES = 5

# Para validar o modelo em frases de controle
TEXTS_EX = ["This is terrific!",  # 2 --> Positivo
            "To be honest, it's overrated",  # 0 --> Negativo
            "Nothing special"]  # 1 --> Neutro

## Importações

In [3]:
import pandas as pd
import numpy as np
import pickle as pkl
import re
from tqdm.notebook import tqdm
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from scipy.special import softmax

## Pré-execuções e configurações

In [4]:
pd.set_option("max_colwidth", 125)  # Para conseguir ler os textos mais facilmente
tqdm.pandas()  # Para aparecer a barra de progresso na hora da execução de apply e etc

## Execução

### Leitura e validação dos dados

Inicialmente, vamos verificar os dados que temos disponíveis para, então, realizar o pré-processamento.

In [5]:
! ls -lh ../data

total 1,6M
-rw-rw-r-- 1 felipe felipe   40 nov 23 13:11 classes_dict.pkl
-rw-rw-r-- 1 felipe felipe 710K nov 21 17:13 dataset_train.csv
-rw-rw-r-- 1 felipe felipe 645K nov 23 15:20 dataset_train_wclasses.csv
-rw-rw-r-- 1 felipe felipe 180K nov 21 17:13 dataset_valid.csv


In [6]:
! head -n 5 ../data/dataset_train.csv

|input
0|judging from previous posts this used to be a good place , but not any longer .
1|we , there were four of us , arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude .
2|they never brought us complimentary noodles , ignored repeated requests for sugar , and threw our dishes on the table .
3|the food was lousy - too sweet or too salty and the portions tiny .


In [7]:
df_train = pd.read_csv('../data/dataset_train.csv', sep='|', usecols=[1])
df_val = pd.read_csv('../data/dataset_valid.csv', sep='|', usecols=[1])

df_train

Unnamed: 0,input
0,"judging from previous posts this used to be a good place , but not any longer ."
1,"we , there were four of us , arrived at noon - the place was empty - and the staff acted like we were imposing on them an..."
2,"they never brought us complimentary noodles , ignored repeated requests for sugar , and threw our dishes on the table ."
3,the food was lousy - too sweet or too salty and the portions tiny .
4,"after all that , they complained to me about the small tip ."
...,...
8860,"Outstanding Bagels , but you get what you pay for ."
8861,The sides were ok and incredibly salty .
8862,"While the menu is n't especially groundbreaking , everything I 've tried so far has been well-executed and tasty ."
8863,It 's just O.K . pizza .


Como a base de dados é relativamente grande, vamos selecionar aleatoriamente N_SAMPLES (definido na seção "Constantes") amostras e vamos verificar se há alguma incoerência com o texto. Se necessário, podemos reexecutar a célula quantas vezes forem necessárias.

In [7]:
indexes_val = np.random.randint(0, len(df_train), size=N_SAMPLES)

for sentence in df_train.iloc[indexes_val]['input'].values:
    print(f'--> {sentence}', end='\n\n')

--> Needless to say a PC that ca n't support a cell phone is less than useless !

--> The worse Hotel In Miami South Beach

--> To this day , there are NONE .

--> My MacBook is faster than any comparable PC .

--> The food was DELICIOUS · one you got to it .



In [8]:
indexes_val = np.random.randint(0, len(df_val), size=N_SAMPLES)

for sentence in df_val.iloc[indexes_val]['input'].values:
    print(f'--> {sentence}', end='\n\n')

--> This place is so much fun .

--> The crispy chicken was n't for us , though .

--> Try ordering from the regular menu , then you would not regret !

--> It 's a great place to pick up a cheap lunch or dinner .

--> The place is a lot of fun .



Além disso, por segurança vamos verificar se há linhas sem registro:

In [9]:
df_train.isnull().sum(), df_val.isnull().sum()

(input    0
 dtype: int64,
 input    0
 dtype: int64)

Com base em algumas amostras geradas acima, pode-se verificar que apesar da base de treino possuir sentenças fora do contexto de avaliação de restaurante, todas as frases contidas na base de validação aparentam ser deste contexto. Devido ao prazo apertado, tentaremos, inicialmente, gerar os targets com base nesta base e, caso os resultados visualizados na base de validação não sejam satisfatórios, tentaremos usar alguma técnica para filtrar a base de treino para conter mais frases do contexto de avaliação de restaurantes. Destaca-se também que como não fora visualizado nenhum erro que possa interferir no processo de definição dos targets, não realizaremos nenhum pré-processamento no texto apriori.

Por fim, como pontuações e espaços podem atrapalhar no processo de treinamento (próxima etapa), vamos remover a pontuação da nossa base e remover espaços a mais.

In [18]:
df_train['input'] = df_train['input'].apply(lambda f: re.sub('\s\s+' , ' ', (re.sub(r'[^\w\s]', '', f).strip().lower())))
df_train.dropna(inplace=True)
df_train

Unnamed: 0,input
0,judging from previous posts this used to be a good place but not any longer
1,we there were four of us arrived at noon the place was empty and the staff acted like we were imposing on them and they w...
2,they never brought us complimentary noodles ignored repeated requests for sugar and threw our dishes on the table
3,the food was lousy too sweet or too salty and the portions tiny
4,after all that they complained to me about the small tip
...,...
8860,outstanding bagels but you get what you pay for
8861,the sides were ok and incredibly salty
8862,while the menu is nt especially groundbreaking everything i ve tried so far has been wellexecuted and tasty
8863,it s just ok pizza


## Scripts

Para a criação dos targets, faremos alguns testes com alguns modelos da HuggingFace já pré-treinados, sendo eles :

- MODELO 1: [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment)- Treinado com base em ~58M de tweets, nos retorna 3 possíveis classes: 0 - Negativo, 1 - Neutro e 2 - Positivo.  
- MODELO 2: [nlptown/bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) - Treinado em 5 línguas diferentes (incluindo o inglês), nos retorna o número de estrelas referente ao sentimento, sendo 1 o pior e 5 o melhor.
<!-- - [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) - TODO -->

A seleção destes modelos levaram em conta, principalmente, a sua popularidade e avaliações de acordo com um [artigo publicado pela HuggingFace](https://huggingface.co/blog/sentiment-analysis-python).

Para escolhermos o melhor modelo, iremos realizar a inferência em toda base de treino e, em seguida, iremos visualizar onde obtivemos divergência entre os resultados e, nestes casos, validaremos manualmente qual foi o melhor. Para tal, faz-se necessário estabelecer uma equivalência entre os resultados:

|CLASSE EQUIVALENTE|CLASSE ACRÔNIMO|MODELO 1| MODELO 2|
|---|---|---|---|
|Negativo|NEG|0|1,00 a 2,33 estrelas|
|Neutro|NEU|1|2,34 a 3,66 estrelas|
|Positivo|POS|2|3,67 a 5,00 estrelas|

Como ambos os modelos são da HuggingFace, vamos construir uma classe padrão para facilitar na hora de realizar a inferência:

In [19]:
class HuggingFaceLLM:
    def __init__(self, model_ref, classes_map):
        self.__set_tokenizer(model_ref)
        self.__set_model(model_ref)
        self.__set_classes_map(classes_map)

    def __set_tokenizer(self, model_ref):
        try:
            tokenizer = AutoTokenizer.from_pretrained(f'../models/hf/{model_ref}')
        except:
            tokenizer = AutoTokenizer.from_pretrained(model_ref)
            tokenizer.save_pretrained(f'../models/hf/{model_ref}')
        
        self.__tokenizer = tokenizer 

    
    def __set_model(self, model_ref):
        try:
            model = AutoModelForSequenceClassification.from_pretrained(f'../models/hf/{model_ref}')
        except:
            model = AutoModelForSequenceClassification.from_pretrained(model_ref)
            model.save_pretrained(f'../models/hf/{model_ref}')

        self.__model = model


    def __set_classes_map(self, classes_map):  # Apenas para manter o padrão
        self.__classes_map = classes_map

    
    def predict_scores(self, input_text):
        encoded_input = self.__tokenizer(input_text, return_tensors='pt')
        output = self.__model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)

        return scores


    def predict_target(self, input_text):
        scores = self.predict_scores(input_text)
        target = np.argmax(scores)
        
        return target

    
    def predict_class(self, input_text):
        scores = self.predict_scores(input_text)
        pred_class = self.__classes_map(scores)

        return pred_class

### Modelo 1 - cardiffnlp/twitter-roberta-base-sentiment

Primeiramente, vamos verificar um resultado com o modelo e em seguida vamos aplicá-lo a toda base. O procedimento aqui utilizado foi baseado no exemplo utilizado pelo autor.

#### Leitura do modelo

Para a definição de nosso modelo, vamos precisar tanto do seu referencial (autor/modelo) quanto de uma função de mapeamento de scores para a classe desejada. 

In [20]:
def model1_map(score): 
    target = np.argmax(score)

    match target:
        case 0:
            return 'NEG'
        
        case 1: 
            return 'NEU'

        case 2:
            return 'POS'

        case default:
            raise Exception('Invalid target')


model1_ref = "cardiffnlp/twitter-roberta-base-sentiment"

model1 = HuggingFaceLLM(model1_ref, model1_map)

#### Testes do modelo com frases exemplo

In [21]:
for t in TEXTS_EX:
    print(f'{t} -> {model1.predict_class(t)} ({model1.predict_scores(t)}')

This is terrific! -> POS ([0.0023162  0.01429832 0.9833855 ]
To be honest, it's overrated -> NEG ([0.78642404 0.19143705 0.02213896]
Nothing special -> NEU ([0.2822652  0.6492529  0.06848197]


### Predizendo as classes da base

Exemplo de como podemos predizer usando apply:

In [22]:
preds = df_train['input'].iloc[0:N_SAMPLES].progress_apply(lambda t: model1.predict_class(t))

for i, p in enumerate(preds):
    print(f'--> {df_train["input"].iloc[i]} ({p})', end='\n\n')

  0%|          | 0/5 [00:00<?, ?it/s]

--> judging from previous posts this used to be a good place but not any longer (NEG)

--> we there were four of us arrived at noon the place was empty and the staff acted like we were imposing on them and they were very rude (NEG)

--> they never brought us complimentary noodles ignored repeated requests for sugar and threw our dishes on the table (NEG)

--> the food was lousy too sweet or too salty and the portions tiny (NEG)

--> after all that they complained to me about the small tip (NEG)



Agora vamos aplicar a toda a base de treino:

In [23]:
df_train['class_model1'] = df_train['input'].progress_apply(lambda t: model1.predict_class(t))
df_train.tail(10)

  0%|          | 0/8865 [00:00<?, ?it/s]

Unnamed: 0,input,class_model1
8855,oh but wait we were out of drinks which were also delightfully overpriced,NEG
8856,since it literally is a complete hole in the wall it s a bit intimidating at first but you get over that very quickly as ...,POS
8857,it looked like shredded cheese partly done still in strips,NEU
8858,what generous portions,NEU
8859,warm comfortable surroundings nice appointments witness the etched glass and brickwork separating the dining rooms,POS
8860,outstanding bagels but you get what you pay for,POS
8861,the sides were ok and incredibly salty,NEU
8862,while the menu is nt especially groundbreaking everything i ve tried so far has been wellexecuted and tasty,POS
8863,it s just ok pizza,NEU
8864,but the coconut rice was good,POS


### Modelo 2 - nlptown/bert-base-multilingual-uncased-sentiment

Com os resultados preditos para o primeiro modelo, vamos realizar agora a predição para o segundo.

#### Leitura do modelo

Assim como anteriormente, vamos precisar passar o referncial do modelo e uma função de mapeamento score $\rightarrow$ classe. Diferente do primeiro caso, o segundo modelo traz um output de 1 a 5. Conforme nossa tabela de equivalência, vamos criar uma função de mapeamento adequada.

In [24]:
def model2_map(score): 
    target = np.dot(score, [1, 2, 3, 4, 5])  # Equivalente a pegar um target ponderado pela probabilidade

    if 1 <= target < 2.33: 
        return 'NEG'

    elif 2.33 <= target < 3.66:
        return 'NEU'

    elif 3.66 <= target <= 5:
        return 'POS'

    raise Exception('Invalid target')


model2_ref = "nlptown/bert-base-multilingual-uncased-sentiment"

model2 = HuggingFaceLLM(model2_ref, model2_map)

#### Testes do modelo com frases exemplo

In [25]:
for t in TEXTS_EX:
    print(f'{t} -> {model2.predict_class(t)} ({model2.predict_scores(t)})')

This is terrific! -> POS ([0.00313067 0.00178067 0.008292   0.07177804 0.9150186 ])
To be honest, it's overrated -> NEG ([0.20254053 0.471416   0.2968501  0.02402706 0.00516636])
Nothing special -> NEG ([0.23077075 0.4229156  0.3145413  0.0264923  0.00527997])


### Predizendo as classes da base

In [26]:
preds = df_train['input'].iloc[0:N_SAMPLES].progress_apply(lambda t: model2.predict_class(t))

for i, p in enumerate(preds):
    print(f'--> {df_train["input"].iloc[i]} ({p})', end='\n\n')

  0%|          | 0/5 [00:00<?, ?it/s]

--> judging from previous posts this used to be a good place but not any longer (NEU)

--> we there were four of us arrived at noon the place was empty and the staff acted like we were imposing on them and they were very rude (NEG)

--> they never brought us complimentary noodles ignored repeated requests for sugar and threw our dishes on the table (NEG)

--> the food was lousy too sweet or too salty and the portions tiny (NEG)

--> after all that they complained to me about the small tip (NEU)



In [27]:
df_train['class_model2'] = df_train['input'].progress_apply(lambda t: model2.predict_class(t))
df_train.tail(10)

  0%|          | 0/8865 [00:00<?, ?it/s]

Unnamed: 0,input,class_model1,class_model2
8855,oh but wait we were out of drinks which were also delightfully overpriced,NEG,NEG
8856,since it literally is a complete hole in the wall it s a bit intimidating at first but you get over that very quickly as ...,POS,NEU
8857,it looked like shredded cheese partly done still in strips,NEU,NEG
8858,what generous portions,NEU,NEU
8859,warm comfortable surroundings nice appointments witness the etched glass and brickwork separating the dining rooms,POS,POS
8860,outstanding bagels but you get what you pay for,POS,POS
8861,the sides were ok and incredibly salty,NEU,NEU
8862,while the menu is nt especially groundbreaking everything i ve tried so far has been wellexecuted and tasty,POS,NEU
8863,it s just ok pizza,NEU,NEU
8864,but the coconut rice was good,POS,NEU


### Validação do resultado e seleção do melhor estimador

Para essa tarefa, vamos ver em quantos casos os modelos previram igualmente as classes em quantas não. Para essas, escolheremos algumas amostras e faremos uma análise manual para, então, definiremos o melhor estimador e exportaremos a nova base de treino (com targets) considerando o melhor modelo.

In [28]:
df_train['class_model1'].value_counts()

class_model1
POS    4393
NEU    2436
NEG    2036
Name: count, dtype: int64

In [29]:
df_train['class_model2'].value_counts()

class_model2
POS    4121
NEU    2779
NEG    1965
Name: count, dtype: int64

In [30]:
(df_train['class_model1'] == df_train['class_model2']).value_counts(normalize=True)

True     0.693627
False    0.306373
Name: proportion, dtype: float64

In [31]:
df_train_diff = df_train[df_train['class_model1'] != df_train['class_model2']]

df_train_diff.iloc[np.random.randint(0, len(df_train_diff), 10)]

Unnamed: 0,input,class_model1,class_model2
8332,the machine itself is alright,POS,NEU
2373,service was quick,NEU,POS
6417,of course i inspected the other netbooks and clearly their hinges are tighter and i even demonstrate the difference betwe...,NEU,POS
5563,we hade a bigger room with a place to relax and sit down,POS,NEU
5471,european people mostly english and german speaking only few russian and polish middle age,NEU,NEG
6571,we carry the netbook around here and there hence it s kinda of irritating when the lcd just slide downwards,NEG,NEU
7931,it would nt fit in most 17inch bags,NEG,NEU
64,once you step into cosette you re miraculously in a small off the beaten path parisian bistro,POS,NEU
8377,really no problems with the hand me down computers i received from my children,NEU,POS
8439,buyers beware,NEU,NEG


Para os diversos casos analisados, verificou-se, no geral, que a acurácia dos modelos são equivalentes, nos impossibilitando de tirar uma conclusão precipitada por meio desta análise superficial. Todavia, iremos considerar que o primeiro modelo é o melhor dentre os dois e, se os resultados visualizados nas etapas finais não forem satisfatórios, podemos fazer um segundo teste com o outro modelo.

Além disso, como na etapa de treinamento do modelo de classificação nós precisamos usar valores numéricos, já iremos converter as classes para os seus respectivos números:

In [32]:
classes_dict = {
    'NEG': 0,
    'NEU': 1,
    'POS': 2
}

with open('../data/classes_dict.pkl', 'wb') as file:
    pkl.dump(classes_dict, file)

# Para modelo 1
df_train_exp = df_train.drop(columns=['class_model2']).rename(columns={'class_model1': 'class'})
df_train_exp['class'] = df_train_exp['class'].map(classes_dict)

# Para modelo 2 - descomentar se necessário mudar
# df_train_exp = df_train.drop(columns=['class_model1']).rename(columns={'class_model2': 'class'})
# df_train_exp['class'] = df_train_exp['class'].map(classes_dict)

df_train_exp.dropna().to_csv('../data/dataset_train_wclass.csv', index=False)  # Apenas por segurança

df_train_exp

Unnamed: 0,input,class
0,judging from previous posts this used to be a good place but not any longer,0
1,we there were four of us arrived at noon the place was empty and the staff acted like we were imposing on them and they w...,0
2,they never brought us complimentary noodles ignored repeated requests for sugar and threw our dishes on the table,0
3,the food was lousy too sweet or too salty and the portions tiny,0
4,after all that they complained to me about the small tip,0
...,...,...
8860,outstanding bagels but you get what you pay for,2
8861,the sides were ok and incredibly salty,1
8862,while the menu is nt especially groundbreaking everything i ve tried so far has been wellexecuted and tasty,2
8863,it s just ok pizza,1


**Obs**: Note que a extração de features, balanceamento de classes e etc. serão feitas no processo de treinamento pois dependendo do algoritmo utilizado podemos ter melhoras ou pioras.

### Limpeza na base de validação

Por fim, vamos limpar a base de validação para facilitar o processo futuro.

In [8]:
df_val['input'] = df_val['input'].apply(lambda f: re.sub('\s\s+' , ' ', (re.sub(r'[^\w\s]', '', f).strip().lower())))
df_val.dropna(inplace=True)
df_val.dropna().to_csv('../data/dataset_valid_cleaned.csv', index=False)  # Apenas por segurança
df_val

Unnamed: 0,input
0,the pizza was really good
1,knowledge of the chef and the waitress are below average
2,the service was ok
3,i m happy to have nosh in the neighborhood and the food is very comforting
4,indoor was very cozy and cute
...,...
2212,i wish they had one near my office i would go everyday
2213,however i do not understand the extraordinary hype about this restaurant
2214,i do nt get it what s so special about prune
2215,one less manhattanite the better
