# Experiments - Paper Reproduction

The goal of this notebook is reproducing the results from the paper [Building a Sentiment Corpus of Tweets in Brazilian Portuguese](https://arxiv.org/abs/1712.08917).

According to the publication, the steps performed were:
 - Data Representation:
     - Bag-of-words with occurrence of terms
     - Presence of negation words (“not”, “never”,...) (Avanço et al., 2016)
     - Positive and negative emoticons (Avanço et al., 2016)
     - Positive and negative emojis (Avanço et al., 2016)
     - Presence of positive and negative words (Avanço et al., 2016)
     - PoS tags (NLPnet tagger (Fonseca et al., 2015))
 - Algorithms:
    - Linear SVM (C: 1)
    - Bernoulli Naive Bayes (alpha:0.1)
    - Logistic Regression
    - Multilayer Perceptron (2 layers, 200 neurons, learning-rate:00.1)
    - Decision Tree classifier
    - Random Forest approach with 200 estimators.

## Libraries and Settings

Load basic libraries and append source code directory (src) into the system path.

In [None]:
import os
import sys
sys.path.append(os.path.abspath(os.path.pardir))

Thirdy party libraries

In [2]:
# General
import funcy as fp
import numpy as np
import pandas as pd
from functools import partial

# Visualization / Presentation
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from IPython.core.display import HTML, display

# NLP Libraries
import re
import nlpnet
import stanza
import spacy
from spacy.tokenizer import _get_regex_pattern

Internal libraries

In [3]:
from src import settings
from src.pipeline.resources import load_corpus
from src.pipeline.general import clean_text
from src.pipeline.executors import simple_pipeline_executor

Presentation settings

In [4]:
%matplotlib inline 
pd.set_option('max_colwidth', 150)

## Load and Split Dataset

In [5]:
frame = load_corpus()
frame.head()

Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat
0,863044774588272640,#encontro,"[1, 1, 1, 1, 1, 1, 1]",0,1,test,que coisa linda O programa estava mostrando uma familia que adotou um adolescente de NUMBER anos que amor !,False
1,865583716088766467,#encontro,"[1, 1, 1, 1, 1, 1, 1]",0,1,test,por mais com as irmãs galvão adorei elas,False
2,865063232201011201,#TheNoite,"[1, 0, 1, 1, 1, 0, 0]",2,1,test,mr CATRA USERNAME lançando sua nova música PPK CHORA no USERNAME k k k 👅 😉 #MrCatra #PpkChora,False
3,864668391008763905,#masterchefbr,"[0, 0, 0, 0, 0, 0, 0]",0,0,test,quem viu aquela lutadora modela barbuda tatuada #MasterChefBR,False
4,865572794016378882,#encontro,"[-1, -1, -1, -1, -1, -1, -1]",0,-1,test,tô passada com esse cara quanta merda pode sair da boca de alguém em alguns minutos 😠,False


Separate training and test records and delete the original frame.

In [6]:
training = frame.loc[(frame.group == 'train') & (frame.sentiment.isin(['-1', '0', '1']))].copy(deep=True)
test = frame.loc[(frame.group == 'test') & (frame.sentiment.isin(['-1', '0', '1']))].copy(deep=True)
del frame

Check class balance on training set

In [7]:
training.sentiment.value_counts()

1     5741
-1    3839
0     3410
Name: sentiment, dtype: int64

## Tokenize and Extract Features from Tweets

Instantiate NLP libraries to tokenize text and extract information (e.g., lemma, pos tag, and polarization).

There are three libraries used:
 - [nlpnet](http://nilc.icmc.usp.br/nlpnet/): Used to be able to reproduce exactly the same pos tags used by TweetSentBR.
 - [Spacy](https://spacy.io/): Lightweight and versatile library to perform tokenization, and get lemma and pos tags.
 - [Stanza](https://stanfordnlp.github.io/stanza/): 

In [8]:
nlpnet.set_data_dir(settings.NLPNET_POS_TAGGER_PATH)
nlpnet_nlp = nlpnet.POSTagger().tag

# Load Stanza to get access to tokenization, words expansion, pos tags, lemmas and polarization
stanza_nlp = stanza.Pipeline('pt', processors='tokenize, mwt, pos, lemma', use_gpu=False, tokenize_no_ssplit=True, verbose=0)

# Load Spacy to get access tokenization, lemma and pos tag
spacy_nlp = spacy.load('pt')

# Extend default token regex to avoid splitting hashtag, smile, emoji and emoticon.
re_token_match = _get_regex_pattern(spacy_nlp.Defaults.token_match)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
spacy_nlp.tokenizer.token_match = re.compile(re_token_match).match

Iterate over tweets to trim repeating spaces and words with more than 2 contiguous characters repeated. 

In [9]:
partial_clean_text = partial(clean_text, unify_html_tags=False, unify_urls=False, trim_repeating_spaces=True, unify_hashtags=False,
                             unify_mentions=False, unify_numbers=False, trim_repeating_letters=True)

training.loc[:, 'clean_text'] = training['text'].apply(partial_clean_text)
test.loc[:, 'clean_text'] = test['text'].apply(partial_clean_text)

In [10]:
disposable_frame = training.query('text != clean_text')
print(f'Records changed: {len(disposable_frame):,} from {len(training):,} ({len(disposable_frame) / len(training) * 100:.2f}%)')
display(disposable_frame.head(10))
del disposable_frame

Records changed: 1,510 from 12,990 (11.62%)


Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat,clean_text
301,864531050508288000,#videoShowAoVivo,[1],0,1,train,sofia linda esse seu batom tá show e joaquim como sempre lindo bjusss da paraiba,False,sofia linda esse seu batom tá show e joaquim como sempre lindo bjuss da paraiba
318,864301806544990208,#maisvoce,[-1],0,-1,train,USERNAME coitada da taís araújo ! 🍲 😂 😂 😂 🍲 kkkk #MaisVocê,False,USERNAME coitada da taís araújo ! 🍲 😂 😂 😂 🍲 kk #MaisVocê
323,865426213510053889,#ConversaComBial,[1],0,1,train,ahhh vai começar uhuuu 😍 😍 👏 🏼 👏 🏼 👏 🏼 👏 🏼 👏 🏼 💜 💜 USERNAME #MaiaraeMaraisaNoBial,False,ahh vai começar uhuu 😍 😍 👏 🏼 👏 🏼 👏 🏼 👏 🏼 👏 🏼 💜 💜 USERNAME #MaiaraeMaraisaNoBial
324,865059540303310848,#TheNoite,[1],0,1,train,o mestre mandou é o melhor quadro velho kkk,False,o mestre mandou é o melhor quadro velho kk
335,864335520167362560,#TheNoite,[1],0,1,train,história bêbada do USERNAME hj kkk,False,história bêbada do USERNAME hj kk
338,864889627232141314,#videoShowAoVivo,[1],0,1,train,outra coisa jocaota tem brilho tem graça amooo,False,outra coisa jocaota tem brilho tem graça amoo
348,864334590353195008,#TheNoite,[1],0,1,train,MINHA SERIEEE,False,MINHA SERIEE
372,864467549215436805,#maisvoce,[1],0,1,train,pra me ganhar de vez VIAVIANE FAZ PLANILHAS É NERD ESTUDAAA ATRIZ DE VERDADE SE ORGANIZA pasmanterepivanomaisvoce,False,pra me ganhar de vez VIAVIANE FAZ PLANILHAS É NERD ESTUDAA ATRIZ DE VERDADE SE ORGANIZA pasmanterepivanomaisvoce
393,862156481885601793,#masterchefbr,[-1],0,-1,train,ahhh uma semana pra ter de novo :(,False,ahh uma semana pra ter de novo :(
401,864685396357242880,#masterchefbr,[-1],0,-1,train,A paola kkk,False,A paola kk


Tokenize and extract features from sentences and tokens on both, training and test sets.

In [11]:
from src.pipeline.processors import *
from src.pipeline.computers import *
from src.pipeline.extractors import *

main_word_processors = [#process_word_polarity, process_word_pos_tag,
                         process_negative_words, process_sentilex_word_polarity, process_emoticon_polarity, process_emoji_polarity]
main_sentence_processors = [#compute_polarity_features, compute_pos_tag_features,
                             compute_negative_words_features, compute_sentilex_polarity_features, compute_emoticon_polarity_features, compute_emoji_polarity_features]

training[['tokens', 'features']] = simple_pipeline_executor(training.clean_text.tolist(), extract_tokens_and_features, spacy_nlp, main_word_processors, main_sentence_processors)
test[['tokens', 'features']] = simple_pipeline_executor(test.clean_text.tolist(), extract_tokens_and_features, spacy_nlp, main_word_processors, main_sentence_processors)

  polarities_percentage = polarities_count / total
  polarities_percentage = polarities_count / total
  return array(a, dtype, copy=False, order=order)


Use nlpnet to extract pos tags from tokens and create additional features from it.

In [12]:
extra_word_processors = [#process_word_polarity, 
                        process_word_pos_tag]
extra_sentence_processors = [#compute_polarity_features, 
                            compute_pos_tag_features]

training['pos_tag_features'] = simple_pipeline_executor(training.clean_text.tolist(), extract_features, nlpnet_nlp, extra_word_processors, extra_sentence_processors)
test['pos_tag_features'] = simple_pipeline_executor(test.clean_text.tolist(), extract_features, nlpnet_nlp, extra_word_processors, extra_sentence_processors)

Assert there are no *NA* values on features.

In [13]:
for column in ['features', 'aux_features', 'pos_tag_features']:
    if column in training.columns:
        X = training[column].to_numpy()
        X = np.stack(X, axis=0)
        assert all([len(item) == 0 for item in np.where(np.isnan(X))]), f'There are na values in {column} column.'

In [14]:
training.head()

Unnamed: 0,id,hashtag,votes,hard,sentiment,group,text,repeat,clean_text,tokens,features,pos_tag_features
283,863587647016636417,#altasHoras,[-1],0,-1,train,apareceu o índice de morte na minha cidade tô muito assustado #BelemPedePaz,False,apareceu o índice de morte na minha cidade tô muito assustado #BelemPedePaz,"[aparecer, o, índice, de, morte, o, meu, cidade, tô, muito, assustar, #BelemPedePaz]","[0, 2, 10, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
284,863591661594697728,#altasHoras,[0],0,0,train,O tchan já pode substituir a morena pela bella gil #AltasHoras,False,O tchan já pode substituir a morena pela bella gil #AltasHoras,"[O, tchan, já, poder, substituir, o, moreno, pelar, bella, gil, #AltasHoras]","[0, 0, 11, 0, 0, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
285,863385491344941060,#édecasa,[1],0,1,train,rafael ainda nem nasceu e já escuta USERNAME #anjo #40semanas #EDeCasa,False,rafael ainda nem nasceu e já escuta USERNAME #anjo #40semanas #EDeCasa,"[rafael, ainda, nem, nascer, e, já, escutar, USERNAME, #anjo, #40semanas, #EDeCasa]","[1, 0, 11, 0, 0, 0, 0, 0, 0]","[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
286,865073884411953152,#ConversaComBial,[1],0,1,train,até que enfim um excelente programa de entrevistas na TV aberta 👍,False,até que enfim um excelente programa de entrevistas na TV aberta 👍,"[até, que, enfim, um, excelente, programar, de, entrevisto, o, TV, aberto, 👍]","[0, 0, 10, 2, 0, 0, 0, 0, 1]","[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
287,862176233949409281,#masterchefbr,[1],0,1,train,master chef me da fome de madrugadato virando coruja sonambulo,False,master chef me da fome de madrugadato virando coruja sonambulo,"[master, chef, me, da, fome, de, madrugadato, virar, corujar, sonambulo]","[0, 1, 9, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## Format Features for Models

In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.pipeline import Pipeline
from src.pipeline.resources import load_stopwords

# Shift labels (-1, 0, 1) to the right (0, 1, 2) to comply with sklearn requirements.
y_training = training.sentiment.apply(lambda x: int(x) + 1).to_numpy()
y_test = test.sentiment.apply(lambda x: int(x) + 1).to_numpy()


# Transform tokens into Bag of Words and then compute TF-IDF
text_clf = Pipeline([
    #('vect', CountVectorizer(lowercase=True, stop_words=stopwords.words('portuguese'))),
    ('vect', HashingVectorizer(analyzer='word', ngram_range=(1, 1), n_features=5000, lowercase=True, stop_words=None)),
    ('tfidf', TfidfTransformer()),
    ]
)
X_training = text_clf.fit_transform(training.tokens.apply(lambda x: ' '.join(x))).toarray()
X_test = text_clf.transform(test.tokens.apply(lambda x: ' '.join(x))).toarray()

# Combine token based features with manually created features (e.g., pos tag, negations, and word polarity)
feature_columns = ['features', 'pos_tag_features']
features = [X_training] + [np.stack(training.features.to_list(), axis=0) for column in feature_columns]
X_training = np.concatenate(features, axis=1)

features = [X_test] + [np.stack(test.features.to_list(), axis=0) for column in feature_columns]
X_test = np.concatenate(features, axis=1)

del features

In [16]:
print(X_training.shape)
print(X_test.shape)

(12990, 5018)
(2010, 5018)


## Model Training

In [17]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import KFold

from src.utils import y_hat_to_sparse, y_to_sparse

In [18]:
models_parameters = {
    'RF': {'n_estimators':200, 'criterion':'entropy', 'n_jobs':-1},
    'LR': {'n_jobs': -1},
    'LinearSVM': {'C':1.0, 'dual':False},
    'PolinomialSVM': {'C':10.0, 'kernel': 'poly'},
    'BernoulliNB': {'alpha':0.1},
    'MLP': {'activation':'tanh', 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes':(200, 200)},
    'DT': {'criterion': 'gini', 'max_depth': None},
}

models_to_train = {
    'RF': RandomForestClassifier,
    'LR': LogisticRegression,
    'LinearSVM': LinearSVC,
    'PolinomialSVM': SVC,
    'BernoulliNB': BernoulliNB,
    'MLP': MLPClassifier,
    'DT': DecisionTreeClassifier,
}

trained_models = []
test_scores = []

for ix in tqdm(range(5)):

    iteration_index = np.arange(X_training.shape[0])
    np.random.shuffle(iteration_index)
    
    for model_name, model_class in models_to_train.items():
        model = model_class(**models_parameters.get(model_name, {}))
        model.fit(X_training[iteration_index], y_training[iteration_index])

        trained_models.append((model_name, ix, model))

        preds = model.predict(X_test)
        pred_labels = np.rint(preds)

        sparse_y = y_to_sparse(y_test)
        sparse_pred = y_hat_to_sparse(pred_labels)

        eval_metric = metrics.f1_score(sparse_y, sparse_pred, average=None)
        test_scores.append((model_name, ix, eval_metric))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))




In [19]:
evaluation_frame = pd.DataFrame(test_scores, columns=['Algorithm', 'Iteration', 'RawMetrics'])
evaluation_f1_matrix = np.stack(evaluation_frame['RawMetrics'].to_list(), axis=0)
evaluation_frame = pd.concat([evaluation_frame, pd.DataFrame(evaluation_f1_matrix, columns=['F1-Neg', 'F1-Neu', 'F1-Pos'])], axis=1)
evaluation_frame['F1-Measure'] = np.mean(evaluation_f1_matrix, axis=1)

display(HTML('<h3>Individual Predictions</h3>'))
display(evaluation_frame.head(10))

display(HTML('<h3>Summarized Predictions</h3>'))
evaluation_summary_frame = (evaluation_frame
                            [['Algorithm', 'F1-Neg', 'F1-Neu', 'F1-Pos', 'F1-Measure']]
                            .groupby('Algorithm')
                            .agg([np.mean, np.std])
                           )
display(evaluation_summary_frame.sort_values(by=('F1-Measure', 'mean'), ascending=False))

Unnamed: 0,Algorithm,Iteration,RawMetrics,F1-Neg,F1-Neu,F1-Pos,F1-Measure
0,RF,0,"[0.6737012987012987, 0.486796785304248, 0.7480438184663538]",0.673701,0.486797,0.748044,0.636181
1,LR,0,"[0.6573426573426574, 0.3910171730515192, 0.729589428975932]",0.657343,0.391017,0.729589,0.59265
2,LinearSVM,0,"[0.6672268907563025, 0.49793388429752067, 0.7551020408163265]",0.667227,0.497934,0.755102,0.640088
3,PolinomialSVM,0,"[0.45549132947976884, 0.1957585644371941, 0.6726986624704956]",0.455491,0.195759,0.672699,0.441316
4,BernoulliNB,0,"[0.6470092670598147, 0.4829600778967868, 0.7009966777408637]",0.647009,0.48296,0.700997,0.610322
5,MLP,0,"[0.6803069053708439, 0.5252725470763132, 0.7519042437431991]",0.680307,0.525273,0.751904,0.652495
6,DT,0,"[0.5440677966101696, 0.41837732160312807, 0.6659328563566318]",0.544068,0.418377,0.665933,0.542793
7,RF,1,"[0.6785137318255251, 0.4604486422668241, 0.7493540051679586]",0.678514,0.460449,0.749354,0.629439
8,LR,1,"[0.6573426573426574, 0.3947368421052632, 0.7306238185255199]",0.657343,0.394737,0.730624,0.594234
9,LinearSVM,1,"[0.6677880571909167, 0.49896907216494846, 0.755507791509941]",0.667788,0.498969,0.755508,0.640755


Unnamed: 0_level_0,F1-Neg,F1-Neg,F1-Neu,F1-Neu,F1-Pos,F1-Pos,F1-Measure,F1-Measure
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
Algorithm,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
MLP,0.67784,0.008584,0.488028,0.061362,0.762316,0.008613,0.642728,0.017031
LinearSVM,0.667003,0.000501,0.499071,0.000674,0.755584,0.000292,0.640553,0.000265
RF,0.678409,0.005909,0.479488,0.01204,0.750038,0.004971,0.635978,0.003905
BernoulliNB,0.647009,0.0,0.48296,0.0,0.700997,0.0,0.610322,0.0
LR,0.657343,0.0,0.393249,0.002037,0.73021,0.000567,0.593601,0.000868
DT,0.54916,0.006384,0.420688,0.005373,0.665153,0.010627,0.545,0.006294
PolinomialSVM,0.455491,0.0,0.195759,0.0,0.672699,0.0,0.441316,0.0
