### Bibliotecas necessárias

- A biblioteca **simpletransformers** é baseada na famosa biblioteca do Hugging Face, conhecida como **transformers**. Como no nome dela já diz, trata-se de uma ferramenta mais simples de usar, onde seu uso é apropriado em tarefas bem definidas, como é o caso deste trabalho. O objetivo é classificar *tweets* relacionados à política em positivo, negativo ou neutro.

In [1]:
!pip install "simpletransformers" -qq

In [2]:
import re
import tqdm
import torch
import string
import pickle
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.tokenize import TweetTokenizer
from sklearn import metrics
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from pylab import rcParams
from collections import defaultdict
import logging
import warnings
warnings.filterwarnings('ignore')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Leitura dos dados

In [4]:
df_train = pd.read_csv('/content/drive/My Drive/NLP/Competition1/Dados/train.csv', usecols=['Id', 'Created At', 'Text', 'Classificacao'])
df_test = pd.read_csv('/content/drive/My Drive/NLP/Competition1/Dados/test.csv', usecols=['Id', 'Created At', 'Text'])

In [5]:
df_train.set_index('Id', inplace=True)
df_train.head()

Unnamed: 0_level_0,Created At,Text,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6272,Mon Jan 09 15:27:43 +0000 2017,Dois são detidos ao tentar jogar celulares e d...,Positivo
1644,Sun Jan 08 02:14:34 +0000 2017,me matan esas minas q cambian 554 veces su fot...,Neutro
7956,Sat Feb 11 09:49:11 +0000 2017,Líderes de motim em presídio de Minas Gerais s...,Positivo
85,Thu Jan 05 14:43:03 +0000 2017,#Mídia: Press Release from Business Wire : Di...,Neutro
6006,Wed Feb 08 22:52:10 +0000 2017,Vacinação contra febre amarela é intensificada...,Positivo


In [6]:
df_test.set_index('Id', inplace=True)
df_test.head()

Unnamed: 0_level_0,Created At,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
3568,Thu Jan 05 12:00:34 +0000 2017,RT @JDanieldf: Pedindo para que MG reaja? Reag...
1323,Fri Jan 06 11:54:50 +0000 2017,Homem que matou ex-mulher e jogou corpo em cis...
7976,Sat Feb 11 15:51:14 +0000 2017,"New post: ""Três adolescentes são apreendidos p..."
2408,Wed Jan 04 18:08:43 +0000 2017,RT @AnaPaulaVolei: Mais 2 helicópteros!!A cara...
4435,Wed Jan 04 18:12:12 +0000 2017,"RT @UOLNoticias: Custaram R$ 21,8 milhões: Mes..."


### Análise exploratória de dados

In [7]:
from sklearn.model_selection import train_test_split

dados_train, dados_val = train_test_split(df_train, test_size=0.25, random_state=0)
print(f"Train shape: {dados_train.shape}")
print(f"Val shape: {dados_val.shape}")
print(f"Test shape: {df_test.shape}")
print()
print(f"Número total de tweets: {dados_train.shape[0] + dados_val.shape[0] + df_test.shape[0]}")

Train shape: (4919, 3)
Val shape: (1640, 3)
Test shape: (1640, 2)

Número total de tweets: 8199


In [8]:
pd.set_option('display.max_columns', None)
dados_train.head()

Unnamed: 0_level_0,Created At,Text,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
675,Fri Jan 06 09:47:32 +0000 2017,"Com o pai preso, filhos de Cunha curtem 'a vid...",Neutro
3828,Wed Jan 04 20:47:53 +0000 2017,RT @MulherTamarindo: pessoas de Minas\nlarguem...,Negativo
3642,Sat Jan 07 09:45:20 +0000 2017,RT @joseluisfreita2: Vereador chega algemado à...,Negativo
5742,Thu Jan 26 14:21:23 +0000 2017,Rio faz bloqueio contra febre amarela em munic...,Positivo
3615,Mon Jan 09 16:15:18 +0000 2017,RT @jornalhoje: Minas Gerais é o quarto estado...,Positivo


In [9]:
pd.set_option('display.max_columns', None)
dados_val.head()

Unnamed: 0_level_0,Created At,Text,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3219,Fri Jan 06 22:35:16 +0000 2017,"RT @EstadaoPolitica: Em calamidade financeira,...",Negativo
6788,Sun Jan 08 14:38:51 +0000 2017,"RT @xenofonte: Com três anos, presídio privado...",Positivo
5651,Sun Jan 22 13:28:49 +0000 2017,Mutirão de vacinação contra a febre amarela se...,Positivo
1280,Sun Jan 08 14:07:05 +0000 2017,"Governo pagou R$ 2,4 bilhões a empreiteiras al...",Neutro
3805,Sat Jan 07 03:36:30 +0000 2017,RT @mmarques57: Ipatinga-MG - Governo da PTist...,Negativo


In [10]:
dados_train['Created At'] = pd.to_datetime(dados_train['Created At'])
dados_val['Created At'] = pd.to_datetime(dados_val['Created At'])
df_test['Created At'] = pd.to_datetime(df_test['Created At'])

In [11]:
# Descobrindo período que os tweets foram postados
print("Período de postagem dos tweets")

datas = [min(dados_train['Created At']), max(dados_train['Created At']), 
        min(dados_val['Created At']), max(dados_val['Created At']),
        min(df_test['Created At']), max(df_test['Created At'])]
print(f"Início: {min(datas)}")
print(f"Final: {max(datas)}")

Período de postagem dos tweets
Início: 2016-12-31 18:51:58+00:00
Final: 2017-02-13 11:03:33+00:00


In [12]:
# Estatísticas dos dados de treinamento
dados_train.groupby(['Classificacao']).count()

Unnamed: 0_level_0,Created At,Text
Classificacao,Unnamed: 1_level_1,Unnamed: 2_level_1
Negativo,1473,1473
Neutro,1449,1449
Positivo,1997,1997


In [13]:
# Estatísticas dos dados de validação
dados_val.groupby(['Classificacao']).count()

Unnamed: 0_level_0,Created At,Text
Classificacao,Unnamed: 1_level_1,Unnamed: 2_level_1
Negativo,497,497
Neutro,501,501
Positivo,642,642


In [14]:
train = dados_train.drop(['Created At'], axis=1)
train.head()

Unnamed: 0_level_0,Text,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
675,"Com o pai preso, filhos de Cunha curtem 'a vid...",Neutro
3828,RT @MulherTamarindo: pessoas de Minas\nlarguem...,Negativo
3642,RT @joseluisfreita2: Vereador chega algemado à...,Negativo
5742,Rio faz bloqueio contra febre amarela em munic...,Positivo
3615,RT @jornalhoje: Minas Gerais é o quarto estado...,Positivo


In [15]:
val = dados_val.drop(['Created At'], axis=1)
val.head()

Unnamed: 0_level_0,Text,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
3219,"RT @EstadaoPolitica: Em calamidade financeira,...",Negativo
6788,"RT @xenofonte: Com três anos, presídio privado...",Positivo
5651,Mutirão de vacinação contra a febre amarela se...,Positivo
1280,"Governo pagou R$ 2,4 bilhões a empreiteiras al...",Neutro
3805,RT @mmarques57: Ipatinga-MG - Governo da PTist...,Negativo


### Funções de pré-processamento

In [16]:
def clean_tweets(tweet):
    tweet = re.sub('@(\\w{1,15})\b', '', tweet)
    tweet = tweet.replace("via ", "")
    tweet = tweet.replace("RT ", "")
    tweet = tweet.lower()
    return tweet
    
def clean_url(tweet):
    tweet = re.sub('http\\S+', '', tweet, flags=re.MULTILINE)   
    return tweet
    
def remove_stop_words(tweet):
    stops = set(stopwords.words("portuguese"))
    stops.update(['.',',','"',"'",'?',':',';','(',')','[',']','{','}'])
    toks = [tok for tok in tweet if not tok in stops and len(tok) >= 3]
    return toks
    
def stemming_tweets(tweet):
    stemmer = SnowballStemmer('portuguese')
    stemmed_words = [stemmer.stem(word) for word in tweet]
    return stemmed_words

def remove_number(tweet):
    newTweet = re.sub('\\d+', '', tweet)
    return newTweet

def remove_hashtags(tweet):
    result = ''

    for word in tweet.split():
        if word.startswith('#') or word.startswith('@'):
            result += word[1:]
            result += ' '
        else:
            result += word
            result += ' '

    return result

In [17]:
def preprocessing(tweet, swords, url, stemming, ctweets, number, hashtag):

    if ctweets:
        tweet = clean_tweets(tweet)

    if url:
        tweet = clean_url(tweet)

    if hashtag:
        tweet = remove_hashtags(tweet)
    
    twtk = TweetTokenizer(strip_handles=True, reduce_len=True)

    if number:
        tweet = remove_number(tweet)
    
    tokens = [w.lower() for w in twtk.tokenize(tweet) if w != "" and w is not None]

    if swords:
        tokens = remove_stop_words(tokens)

    if stemming:
        tokens = stemming_tweets(tokens)

    text = " ".join(tokens)

    return text

### Pré-processamento

In [21]:
train['NewText'] = train['Text'].apply(lambda x: preprocessing(x, swords = False, url = True, stemming = False, 
                                                                     ctweets = True, number = True, hashtag = True))

In [22]:
val['NewText'] = val['Text'].apply(lambda x: preprocessing(x, swords = False, url = True, stemming = False, 
                                                                     ctweets = True, number = True, hashtag = True))

In [23]:
df_test['NewText'] = df_test['Text'].apply(lambda x: preprocessing(x, swords = False, url = True, stemming = False, 
                                                                     ctweets = True, number = True, hashtag = True))

### BERT com pré-processamento

In [18]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [19]:
def cria_labels(sentimento):
    if sentimento == 'Positivo':
        return 1
    elif sentimento == 'Negativo':
        return 0
    else:
        return 2

In [24]:
# Filtrando os dados que serão treinados
train_df = train[['NewText', 'Classificacao']]
train_df['Classificacao'] = train_df['Classificacao'].apply(cria_labels)
train_df.head()

Unnamed: 0_level_0,NewText,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
675,"com o pai preso , filhos de cunha curtem ' a v...",2
3828,mulhertamarindo : pessoas de minas larguem o p...,0
3642,joseluisfreita : vereador chega algemado à câm...,0
5742,rio faz bloqueio contra febre amarela em munic...,1
3615,jornalhoje : minas gerais é o quarto estado a ...,1


In [25]:
# Filtrando os dados de validação
val_df = val[['NewText', 'Classificacao']]
val_df['Classificacao'] = val_df['Classificacao'].apply(cria_labels)
val_df.head()

Unnamed: 0_level_0,NewText,Classificacao
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
3219,"estadaopolitica : em calamidade financeira , g...",0
6788,"xenofonte : com três anos , presídio privado e...",1
5651,mutirão de vacinação contra a febre amarela se...,1
1280,"governo pagou r $ , bilhões a empreiteiras alv...",2
3805,mmarques : ipatinga-mg - governo da ptista cec...,0


In [26]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [30]:
bert_args = {
    "num_train_epochs": 1,
    "train_batch_size": 4,
    "eval_batch_size": 4,
    "learning_rate": 2e-5, 
    'max_seq_length': 350,
    'evaluate_during_training': True,
    'wandb_project': 'competition-sentiment',
    'output_dir': '/content/drive/My Drive/NLP/Competition1/BertComPreProc',
    'overwrite_output_dir': True
}

In [31]:
model = ClassificationModel(
    "bert", "neuralmind/bert-base-portuguese-cased", args=bert_args, num_labels=3
)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

In [None]:
import gc

gc.collect()                # Coleta lixo.
torch.cuda.empty_cache()    # Limpa o chache.

model.train_model(train_df, eval_df=val_df)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/4919 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_bert_350_3_2


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for training.


In [None]:
y_pred2, raw_outputs = model.predict(list(df_test['NewText']))

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

  0%|          | 0/52 [00:00<?, ?it/s]

In [None]:
def replace(sentimento):
    if sentimento == 1:
        return "Positivo"
    elif sentimento == 0:
        return "Negativo"
    else:
        return "Neutro"

In [None]:
predictions2 = pd.DataFrame(None, columns=['Id', 'Category'], index=None)

predictions2['Id'] = df_test['Id']
predictions2['Category'] = y_pred2

predictions2['Category'] = predictions2['Category'].apply(replace)

In [None]:
predictions2

Unnamed: 0,Id,Category
0,3568,Negativo
1,1323,Neutro
2,7976,Positivo
3,2408,Negativo
4,4435,Negativo
...,...,...
1635,3536,Neutro
1636,6881,Positivo
1637,627,Neutro
1638,2165,Neutro


In [None]:
predictions2.to_csv('/content/drive/My Drive/NLP/Competition1/preds_bert_com_preproc.csv', sep=',', index=None)

### BERT sem pré-processamento

In [None]:
train_spp = pd.read_csv('/content/drive/My Drive/NLP/Competition1/train.csv')
test_spp = pd.read_csv('/content/drive/My Drive/NLP/Competition1/test.csv')

train_spp = train_spp[['Text', 'Classificacao']]
test_spp  = test_spp[['Id', 'Text']]

In [None]:
from sklearn.model_selection import train_test_split

train_spp['Classificacao'] = train_spp['Classificacao'].apply(modifica)

train, val = train_test_split(train_spp, test_size=0.25, random_state=0)
print(f"Train shape: {train.shape}")
print(f"Train shape: {val.shape}")
print(f"Train shape: {test_spp.shape}")

Train shape: (4919, 2)
Train shape: (1640, 2)
Train shape: (1640, 2)


In [None]:
model_args_bert = {
    "num_train_epochs": 4,
    "train_batch_size": 32,
    "eval_batch_size": 32,
    "learning_rate": 2e-5, 
    'max_seq_length': 350,
    'evaluate_during_training': True,
    'wandb_project': 'nlp-competition1',
    'output_dir': '/content/drive/My Drive/NLP/Competition1/ModelSemPreProc',
    'overwrite_output_dir': True
}

In [None]:
import gc
import torch
gc.collect()                # Coleta lixo.
torch.cuda.empty_cache()    # Limpa o chache.

bert_model_sp = ClassificationModel(
    "bert", "neuralmind/bert-base-portuguese-cased", args=model_args_bert, num_labels=3
)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

In [None]:
gc.collect()                # Coleta lixo.
torch.cuda.empty_cache()    # Limpa o chache.

bert_model_sp.train_model(train, eval_df=val)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/4919 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_bert_350_3_2


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for training.
[34m[1mwandb[0m: Currently logged in as: [33malisonpr[0m (use `wandb login --relogin` to force relogin)


Running Epoch 0 of 4:   0%|          | 0/154 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_bert_350_3_2


Running Epoch 1 of 4:   0%|          | 0/154 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_bert_350_3_2


Running Epoch 2 of 4:   0%|          | 0/154 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_bert_350_3_2


Running Epoch 3 of 4:   0%|          | 0/154 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_bert_350_3_2
INFO:simpletransformers.classification.classification_model: Training of bert model complete. Saved to /content/drive/My Drive/NLP/Competition1/ModelSemPreProc.


(616,
 {'eval_loss': [0.16879291767970875,
   0.13865949685434595,
   0.1489485835459513,
   0.1532282373921659],
  'global_step': [154, 308, 462, 616],
  'mcc': [0.9254533118768793,
   0.9447649197302324,
   0.9485324881410823,
   0.9512186769345341],
  'train_loss': [0.08019743859767914,
   0.015742581337690353,
   0.18792709708213806,
   0.006747514009475708]})

In [None]:
y_pred4, raw_outputs = bert_model_sp.predict(list(test_spp['Text']))

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1640 [00:00<?, ?it/s]

  0%|          | 0/52 [00:00<?, ?it/s]

In [None]:
predictions4 = pd.DataFrame(None, columns=['Id', 'Category'], index=None)

predictions4['Id'] = test_spp['Id']
predictions4['Category'] = y_pred4

predictions4['Category'] = predictions4['Category'].apply(replace)

In [None]:
predictions4

Unnamed: 0,Id,Category
0,3568,Negativo
1,1323,Neutro
2,7976,Positivo
3,2408,Negativo
4,4435,Negativo
...,...,...
1635,3536,Neutro
1636,6881,Positivo
1637,627,Neutro
1638,2165,Neutro


In [None]:
predictions4.to_csv('/content/drive/My Drive/NLP/Competition1/preds_bert_sem_preproc.csv', sep=',', index=None)