# Experiments - Paper Reproduction

The goal of this notebook is reproducing the results from the paper [Building a Sentiment Corpus of Tweets in Brazilian Portuguese](https://arxiv.org/abs/1712.08917).

According to the publication, the steps performed were:
 - Data Representation:
     - Bag-of-words with occurrence of terms
     - Presence of negation words (“not”, “never”,...) (Avanço et al., 2016)
     - Positive and negative emoticons (Avanço et al., 2016)
     - Positive and negative emojis (Avanço et al., 2016)
     - Presence of positive and negative words (Avanço et al., 2016)
     - PoS tags (NLPnet tagger (Fonseca et al., 2015))
 - Algorithms:
    - Linear SVM (C: 1)
    - Bernoulli Naive Bayes (alpha:0.1)
    - Logistic Regression
    - Multilayer Perceptron (2 layers, 200 neurons, learning-rate:00.1)
    - Decision Tree classifier
    - Random Forest approach with 200 estimators.

## Libraries and Settings

Load basic libraries and append source code directory (src) into the system path.

In [1]:
import os
import sys
sys.path.append(os.path.abspath(os.path.pardir))

Thirdy party libraries

In [2]:
# General
import funcy as fp
import numpy as np
import pandas as pd
from functools import partial

# Visualization / Presentation
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from IPython.core.display import HTML, display

# NLP Libraries
import re
import nlpnet
import stanza
import spacy
from spacy.tokenizer import _get_regex_pattern

Internal libraries

In [3]:
from src import settings
from src.pipeline.resources import load_corpus
from src.pipeline.general import clean_text
from src.pipeline.executors import simple_pipeline_executor

Presentation settings

In [4]:
%matplotlib inline 
pd.set_option('max_colwidth', 150)

## Load and Split Dataset

In [5]:
frame = load_corpus()
frame.head()

Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat
0,863044774588272640,#encontro,"[1, 1, 1, 1, 1, 1, 1]",0,1,test,que coisa linda O programa estava mostrando uma familia que adotou um adolescente de NUMBER anos que amor !,False
1,865583716088766467,#encontro,"[1, 1, 1, 1, 1, 1, 1]",0,1,test,por mais com as irmãs galvão adorei elas,False
2,865063232201011201,#TheNoite,"[1, 0, 1, 1, 1, 0, 0]",2,1,test,mr CATRA USERNAME lançando sua nova música PPK CHORA no USERNAME k k k 👅 😉 #MrCatra #PpkChora,False
3,864668391008763905,#masterchefbr,"[0, 0, 0, 0, 0, 0, 0]",0,0,test,quem viu aquela lutadora modela barbuda tatuada #MasterChefBR,False
4,865572794016378882,#encontro,"[-1, -1, -1, -1, -1, -1, -1]",0,-1,test,tô passada com esse cara quanta merda pode sair da boca de alguém em alguns minutos 😠,False


Separate training and test records and delete the original frame.

In [6]:
training = frame.loc[(frame.group == 'train') & (frame.sentiment.isin(['-1', '0', '1']))].copy(deep=True)
test = frame.loc[(frame.group == 'test') & (frame.sentiment.isin(['-1', '0', '1']))].copy(deep=True)
del frame

Check class balance on training set

In [7]:
training.sentiment.value_counts()

1     5741
-1    3839
0     3410
Name: sentiment, dtype: int64

## Tokenize and Extract Features from Tweets

Instantiate NLP libraries to tokenize text and extract information (e.g., lemma, pos tag, and polarization).

There are three libraries used:
 - [nlpnet](http://nilc.icmc.usp.br/nlpnet/): Used to be able to reproduce exactly the same pos tags used by TweetSentBR.
 - [Spacy](https://spacy.io/): Lightweight and versatile library to perform tokenization, and get lemma and pos tags.
 - [Stanza](https://stanfordnlp.github.io/stanza/): 

In [8]:
nlpnet.set_data_dir(settings.NLPNET_POS_TAGGER_PATH)
nlpnet_nlp = nlpnet.POSTagger().tag

# Load Stanza to get access to tokenization, words expansion, pos tags, lemmas and polarization
stanza_nlp = stanza.Pipeline('pt', processors='tokenize, mwt, pos, lemma', use_gpu=False, tokenize_no_ssplit=True, verbose=0)

# Load Spacy to get access tokenization, lemma and pos tag
spacy_nlp = spacy.load('pt')

# Extend default token regex to avoid splitting hashtag, smile, emoji and emoticon.
re_token_match = _get_regex_pattern(spacy_nlp.Defaults.token_match)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
spacy_nlp.tokenizer.token_match = re.compile(re_token_match).match

Iterate over tweets to trim repeating spaces and words with more than 2 contiguous characters repeated. 

In [9]:
partial_clean_text = partial(clean_text, unify_html_tags=False, unify_urls=False, trim_repeating_spaces=True, unify_hashtags=False,
                             unify_mentions=False, unify_numbers=False, trim_repeating_letters=True)

training.loc[:, 'clean_text'] = training['text'].apply(partial_clean_text)
test.loc[:, 'clean_text'] = test['text'].apply(partial_clean_text)

In [10]:
disposable_frame = training.query('text != clean_text')
print(f'Records changed: {len(disposable_frame):,} from {len(training):,} ({len(disposable_frame) / len(training) * 100:.2f}%)')
display(disposable_frame.head(10))
del disposable_frame

Records changed: 1,510 from 12,990 (11.62%)


Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat,clean_text
301,864531050508288000,#videoShowAoVivo,[1],0,1,train,sofia linda esse seu batom tá show e joaquim como sempre lindo bjusss da paraiba,False,sofia linda esse seu batom tá show e joaquim como sempre lindo bjuss da paraiba
318,864301806544990208,#maisvoce,[-1],0,-1,train,USERNAME coitada da taís araújo ! 🍲 😂 😂 😂 🍲 kkkk #MaisVocê,False,USERNAME coitada da taís araújo ! 🍲 😂 😂 😂 🍲 kk #MaisVocê
323,865426213510053889,#ConversaComBial,[1],0,1,train,ahhh vai começar uhuuu 😍 😍 👏 🏼 👏 🏼 👏 🏼 👏 🏼 👏 🏼 💜 💜 USERNAME #MaiaraeMaraisaNoBial,False,ahh vai começar uhuu 😍 😍 👏 🏼 👏 🏼 👏 🏼 👏 🏼 👏 🏼 💜 💜 USERNAME #MaiaraeMaraisaNoBial
324,865059540303310848,#TheNoite,[1],0,1,train,o mestre mandou é o melhor quadro velho kkk,False,o mestre mandou é o melhor quadro velho kk
335,864335520167362560,#TheNoite,[1],0,1,train,história bêbada do USERNAME hj kkk,False,história bêbada do USERNAME hj kk
338,864889627232141314,#videoShowAoVivo,[1],0,1,train,outra coisa jocaota tem brilho tem graça amooo,False,outra coisa jocaota tem brilho tem graça amoo
348,864334590353195008,#TheNoite,[1],0,1,train,MINHA SERIEEE,False,MINHA SERIEE
372,864467549215436805,#maisvoce,[1],0,1,train,pra me ganhar de vez VIAVIANE FAZ PLANILHAS É NERD ESTUDAAA ATRIZ DE VERDADE SE ORGANIZA pasmanterepivanomaisvoce,False,pra me ganhar de vez VIAVIANE FAZ PLANILHAS É NERD ESTUDAA ATRIZ DE VERDADE SE ORGANIZA pasmanterepivanomaisvoce
393,862156481885601793,#masterchefbr,[-1],0,-1,train,ahhh uma semana pra ter de novo :(,False,ahh uma semana pra ter de novo :(
401,864685396357242880,#masterchefbr,[-1],0,-1,train,A paola kkk,False,A paola kk


Tokenize and extract features from sentences and tokens on both, training and test sets.

In [11]:
from src.pipeline.processors import *
from src.pipeline.computers import *
from src.pipeline.extractors import *

main_word_processors = [#process_word_polarity, process_word_pos_tag,
                         process_negative_words, process_sentilex_word_polarity, process_emoticon_polarity, process_emoji_polarity]
main_sentence_processors = [#compute_polarity_features, compute_pos_tag_features,
                             compute_negative_words_features, compute_sentilex_polarity_features, compute_emoticon_polarity_features, compute_emoji_polarity_features]

training[['tokens', 'features']] = simple_pipeline_executor(training.clean_text.tolist(), extract_tokens_and_features, spacy_nlp, main_word_processors, main_sentence_processors)
test[['tokens', 'features']] = simple_pipeline_executor(test.clean_text.tolist(), extract_tokens_and_features, spacy_nlp, main_word_processors, main_sentence_processors)

  polarities_percentage = polarities_count / total
  polarities_percentage = polarities_count / total
  return array(a, dtype, copy=False, order=order)


Use nlpnet to extract pos tags from tokens and create additional features from it.

In [12]:
extra_word_processors = [#process_word_polarity, 
                        process_word_pos_tag]
extra_sentence_processors = [#compute_polarity_features, 
                            compute_pos_tag_features]

training['pos_tag_features'] = simple_pipeline_executor(training.clean_text.tolist(), extract_features, nlpnet_nlp, extra_word_processors, extra_sentence_processors)
test['pos_tag_features'] = simple_pipeline_executor(test.clean_text.tolist(), extract_features, nlpnet_nlp, extra_word_processors, extra_sentence_processors)

Assert there are no *NA* values on features.

In [13]:
for column in ['features', 'aux_features', 'pos_tag_features']:
    if column in training.columns:
        X = training[column].to_numpy()
        X = np.stack(X, axis=0)
        assert all([len(item) == 0 for item in np.where(np.isnan(X))]), f'There are na values in {column} column.'

In [14]:
training.head()

Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat,clean_text,tokens,features,pos_tag_features
283,863587647016636417,#altasHoras,[-1],0,-1,train,apareceu o índice de morte na minha cidade tô muito assustado #BelemPedePaz,False,apareceu o índice de morte na minha cidade tô muito assustado #BelemPedePaz,"[aparecer, o, índice, de, morte, o, meu, cidade, tô, muito, assustar, #BelemPedePaz]","[0, 2, 10, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
284,863591661594697728,#altasHoras,[0],0,0,train,O tchan já pode substituir a morena pela bella gil #AltasHoras,False,O tchan já pode substituir a morena pela bella gil #AltasHoras,"[O, tchan, já, poder, substituir, o, moreno, pelar, bella, gil, #AltasHoras]","[0, 0, 11, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
285,863385491344941060,#édecasa,[1],0,1,train,rafael ainda nem nasceu e já escuta USERNAME #anjo #40semanas #EDeCasa,False,rafael ainda nem nasceu e já escuta USERNAME #anjo #40semanas #EDeCasa,"[rafael, ainda, nem, nascer, e, já, escutar, USERNAME, #anjo, #40semanas, #EDeCasa]","[1, 0, 11, 0, 0, 0, 0, 0, 0]","[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
286,865073884411953152,#ConversaComBial,[1],0,1,train,até que enfim um excelente programa de entrevistas na TV aberta 👍,False,até que enfim um excelente programa de entrevistas na TV aberta 👍,"[até, que, enfim, um, excelente, programar, de, entrevisto, o, TV, aberto, 👍]","[0, 0, 10, 2, 0, 0, 0, 0, 1]","[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
287,862176233949409281,#masterchefbr,[1],0,1,train,master chef me da fome de madrugadato virando coruja sonambulo,False,master chef me da fome de madrugadato virando coruja sonambulo,"[master, chef, me, da, fome, de, madrugadato, virar, corujar, sonambulo]","[0, 1, 9, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## Format Features for Models

In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.pipeline import Pipeline
from src.pipeline.resources import load_stopwords

# Shift labels (-1, 0, 1) to the right (0, 1, 2) to comply with sklearn requirements.
y_training = training.sentiment.apply(lambda x: int(x) + 1).to_numpy()
y_test = test.sentiment.apply(lambda x: int(x) + 1).to_numpy()


# Transform tokens into Bag of Words and then compute TF-IDF
text_clf = Pipeline([
    #('vect', CountVectorizer(lowercase=True, stop_words=stopwords.words('portuguese'))),
    ('vect', HashingVectorizer(analyzer='word', ngram_range=(1, 1), n_features=5000, lowercase=False, stop_words=None)),
    ('tfidf', TfidfTransformer()),
    ]
)
X_training = text_clf.fit_transform(training.tokens.apply(lambda x: ' '.join(x))).toarray()
X_test = text_clf.transform(test.tokens.apply(lambda x: ' '.join(x))).toarray()

# Combine token based features with manually created features (e.g., pos tag, negations, and word polarity)
feature_columns = ['features', 'pos_tag_features']
features = [X_training] + [np.stack(training.features.to_list(), axis=0) for column in feature_columns]
X_training = np.concatenate(features, axis=1)

features = [X_test] + [np.stack(test.features.to_list(), axis=0) for column in feature_columns]
X_test = np.concatenate(features, axis=1)

del features

In [16]:
print(X_training.shape)
print(X_test.shape)

(12990, 5018)
(2010, 5018)


## Model Training

In [17]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import KFold

from src.utils import y_hat_to_sparse, y_to_sparse

In [18]:
models_parameters = {
    'RF': {'n_estimators':200, 'criterion':'entropy', 'n_jobs':-1},
    'LR': {'n_jobs': -1},
    'LinearSVM': {'C':1.0, 'dual':False},
    'PolinomialSVM': {'C':10.0, 'kernel': 'poly'},
    'BernoulliNB': {'alpha':0.1},
    'MultinomialNB': {'alpha':1.6},
    'MLP': {'activation':'tanh', 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes':(200, 200)},
    'DT': {'criterion': 'gini', 'max_depth': None},
}

models_to_train = {
    'RF': RandomForestClassifier,
    'LR': LogisticRegression,
    'LinearSVM': LinearSVC,
    ##'PolinomialSVM': SVC,
    'BernoulliNB': BernoulliNB,
    ##'MultinomialNB': MultinomialNB,
    'MLP': MLPClassifier,
    'DT': DecisionTreeClassifier,
}

splits = 20
trained_models = []
validation_scores = []

fold = KFold(n_splits=20, shuffle=True)
for fold_idx, (training_idx, validation_idx) in tqdm(enumerate(fold.split(range(X_training.shape[0]))), total=splits):
    for model_name, model_class in models_to_train.items():
        model = model_class(**models_parameters.get(model_name, {}))
        model.fit(X_training[training_idx], y_training[training_idx])

        trained_models.append((model_name, fold_idx, model))

        preds = model.predict(X_training[validation_idx])
        pred_labels = np.rint(preds)

        sparse_y = y_to_sparse(y_training[validation_idx])
        sparse_pred = y_hat_to_sparse(pred_labels)

        eval_metric = metrics.f1_score(sparse_y, sparse_pred, average=None)
        validation_scores.append((model_name, fold_idx, eval_metric))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




In [19]:
evaluation_frame = pd.DataFrame(validation_scores, columns=['Algorithm', 'Iteration', 'RawMetrics'])
evaluation_f1_matrix = np.stack(evaluation_frame['RawMetrics'].to_list(), axis=0)
evaluation_frame = pd.concat([evaluation_frame, pd.DataFrame(evaluation_f1_matrix, columns=['F1-Neg', 'F1-Neu', 'F1-Pos'])], axis=1)
evaluation_frame['F1-Measure'] = np.mean(evaluation_f1_matrix, axis=1)

display(HTML('<h3>Individual Predictions</h3>'))
display(evaluation_frame.head(10))

display(HTML('<h3>Summarized Predictions</h3>'))
evaluation_summary_frame = (evaluation_frame
                            [['Algorithm', 'F1-Neg', 'F1-Neu', 'F1-Pos', 'F1-Measure']]
                            .groupby('Algorithm')
                            .agg([np.mean, np.std])
                           )
display(evaluation_summary_frame)

display(HTML('<h3>Best Model</h3>'))
best_model_index = evaluation_summary_frame[('F1-Measure', 'mean')].argmax()
best_model_name = evaluation_summary_frame[('F1-Measure', 'mean')].index[best_model_index]
evaluation_summary_frame.iloc[best_model_index].to_frame().T

Unnamed: 0,Algorithm,Iteration,RawMetrics,F1-Neg,F1-Neu,F1-Pos,F1-Measure
0,RF,0,"[0.6091370558375634, 0.41353383458646614, 0.728125]",0.609137,0.413534,0.728125,0.583599
1,LR,0,"[0.6142131979695431, 0.42145593869731807, 0.7565891472868216]",0.614213,0.421456,0.756589,0.597419
2,LinearSVM,0,"[0.641025641025641, 0.4350877192982456, 0.7519999999999999]",0.641026,0.435088,0.752,0.609371
3,BernoulliNB,0,"[0.5751295336787564, 0.4640522875816993, 0.7269736842105263]",0.57513,0.464052,0.726974,0.588719
4,MLP,0,"[0.6032608695652174, 0.4460431654676259, 0.7737003058103976]",0.603261,0.446043,0.7737,0.607668
5,DT,0,"[0.5416666666666666, 0.4142394822006472, 0.685337726523888]",0.541667,0.414239,0.685338,0.547081
6,RF,1,"[0.6221079691516709, 0.4511784511784512, 0.7231270358306188]",0.622108,0.451178,0.723127,0.598804
7,LR,1,"[0.5783783783783784, 0.41379310344827586, 0.7000000000000001]",0.578378,0.413793,0.7,0.564057
8,LinearSVM,1,"[0.6158038147138964, 0.47826086956521735, 0.7201309328968903]",0.615804,0.478261,0.720131,0.604732
9,BernoulliNB,1,"[0.5854922279792746, 0.49122807017543857, 0.6678321678321678]",0.585492,0.491228,0.667832,0.581517


Unnamed: 0_level_0,F1-Neg,F1-Neg,F1-Neu,F1-Neu,F1-Pos,F1-Pos,F1-Measure,F1-Measure
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
Algorithm,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
BernoulliNB,0.589472,0.031033,0.442861,0.026232,0.693504,0.020203,0.575279,0.016853
DT,0.530762,0.022269,0.412198,0.040145,0.65233,0.024943,0.531764,0.018156
LR,0.614124,0.024179,0.411251,0.037408,0.714635,0.021764,0.580003,0.020962
LinearSVM,0.62403,0.027419,0.457559,0.028927,0.719853,0.019693,0.600481,0.0162
MLP,0.630612,0.031406,0.468619,0.035683,0.731347,0.020781,0.610193,0.014778
RF,0.624433,0.023487,0.432048,0.028045,0.727564,0.016715,0.594682,0.013609


Unnamed: 0_level_0,F1-Neg,F1-Neg,F1-Neu,F1-Neu,F1-Pos,F1-Pos,F1-Measure,F1-Measure
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
MLP,0.630612,0.031406,0.468619,0.035683,0.731347,0.020781,0.610193,0.014778


In [20]:
best_model = models_to_train[best_model_name](**models_parameters.get(best_model_name, {}))
best_model.fit(X_training, y_training)

preds = best_model.predict(X_test)
pred_labels = np.rint(preds)

sparse_y = y_to_sparse(y_test)
sparse_pred = y_hat_to_sparse(pred_labels)

eval_metric = metrics.f1_score(sparse_y, sparse_pred, average=None)

In [21]:
display(HTML('<h3>Test</h3>'))
display(pd.DataFrame([eval_metric], columns=['F1-Neg', 'F1-Neu', 'F1-Pos'])
        .assign(F1=lambda f: f.apply(np.mean, axis=1))
        .mean()
        .to_frame()
        .T
        [['F1-Pos', 'F1-Neu', 'F1-Neg', 'F1']]
       )

Unnamed: 0,F1-Pos,F1-Neu,F1-Neg,F1
0,0.764487,0.459016,0.69572,0.639741
