<a href="https://colab.research.google.com/github/eduartheinen/foursquare-tips/blob/master/foursquare_tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**TODO/Table of Contents:**
1. [x] **Data Preprocessing**
2. [x] **Feature Engineering:**
>BOW, TF-IDF, LSA, Class-Balanced Loss;
3. [x] **ML Models:**
>Naive Bayes, Logistic Regression, SVM, XGBoost;
4. [ ] **Deep Learning Models:**
> Bi-LSTM (learns, but sequential data would be better), 
>
> BERT fine tunning (future work);
5. [x] **GridSearch:**
> Cross validated combination of best parameters and features;
6. [ ] **Plot and Discuss Results**
> visualize  accuracy, precision, recall and f1 metrics with box plots for each model and configuration, Confusion-Matrix, plot ROC-AUC
7. [ ] **Model/Decision interpretation with LIME**
8. [ ] **EMOJIS**
9. [ ] **Análise de Frequência de Termos -- Classes Positiva e Negativa**
> Comparação antes e depois tf-idf
10. [ ] **Nuvem de Palavras**
11. [ ] **Comparação bi-tri-gramas**
12. [ ] **Apresentar predição**
13. [ ] **Artigos para sustentar escolhas**
14. [ ] **Explicar diferença entre modelos**

In [6]:
!pip install -U spacy setuptools wheel xgboost plotly chart-studio # ipyml transformers
!python -m spacy download pt_core_news_sm # comment this line after first run

Requirement already up-to-date: spacy in /usr/local/lib/python3.7/dist-packages (3.0.3)
Requirement already up-to-date: setuptools in /usr/local/lib/python3.7/dist-packages (53.1.0)
Requirement already up-to-date: wheel in /usr/local/lib/python3.7/dist-packages (0.36.2)
Requirement already up-to-date: xgboost in /usr/local/lib/python3.7/dist-packages (1.3.3)
Requirement already up-to-date: plotly in /usr/local/lib/python3.7/dist-packages (4.14.3)
Requirement already up-to-date: chart-studio in /usr/local/lib/python3.7/dist-packages (1.1.0)
2021-02-28 00:33:42.537818: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')


In [7]:
# uncomment only if using BERT
# !wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_pytorch_checkpoint.zip
# !wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt -P bert_checkpoint/
# !unzip bert-base-portuguese-cased_pytorch_checkpoint.zip -d bert_checkpoint/

In [8]:
import re
import string
import spacy
import pandas as pd
import numpy as np

from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import KFold

#**0. Foursquare Tips Dataset**
Composed of user reviews in portuguese, referring to localities of São Paulo/Brazil and collected with the Foursquare API from the categories: Food, Shop & Service and Nightlife Spot. 

>```dataset_test.csv``` has a total of 179,181 reviews.
>
>```tips_scenario1_train.csv``` contains 1708 reviews labeled as **negative, neutral or positive**.
>
>```tips_scenario2_train.csv``` contains 1788 reviews labeled as **negative or positive**.


In [9]:
#@title
import chart_studio.plotly as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots


path = 'https://raw.githubusercontent.com/eduartheinen/foursquare-tips/master/data/'

df1 = pd.read_csv(path + 'tips_scenario1_train.csv').dropna(how='any')
df2 = pd.read_csv(path + 'tips_scenario2_train.csv').dropna(how='any')

fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Scenario 1","Scenario 2"))

fig.add_trace(go.Bar(x=['negative', 'neutral', 'positive'],
                     y=[len(df1[df1.rotulo==c]) for c in df1.rotulo.unique()],
                     marker_color=['red', 'gray', 'blue']),
              row=1, col=1)

fig.add_trace(go.Bar(x=['negative', 'positive'],
                     y=[len(df2[df2.rotulo==c]) for c in df2.rotulo.unique()],
                     marker_color=['red', 'blue']), 
              row=1, col=2)
    
fig.update_layout(autosize=False, showlegend=False, width=1000, height=500, title='Samples Distribution')

# **1. Data Preprocessing**

- Raw dataset entries contain a sentence of 30 words, followed by its label.

- After removing urls, punctuation marks and numbers, each sentence is processed with the **SpaCy NLP API**, trained in Portuguese.

- This step tokenizes and adds properties to each term of the sentence;

- Properties as:
>   **```lemma_```** return the word's canonical form, 
>
>   **```pos_```** it's part of speech tag (noun, verb, adjective, ...),
>
>   **```is_stop```** determine if it is a stop word.

- **For instance,** the sentence:
> "Eu fui morar na Estação da Luz. Porque estava muito escuro dentro do meu coração"

- **...would become**:
> ```tokens = ['morar', 'estação', 'luz', 'escuro', 'coração']```
>
> ```lemmas = ['morar', 'estação', 'luzir', 'escuro', 'coração']```
>
> ```pos = ['VERB', 'NOUN', 'NOUN', 'ADJ', 'NOUN']```

#**2. Feature Engineering**

One of the first fundamental choices involded in the construction of a Machine/Deep Learning Model is the representation of real world observations as features that can be read and understood by the model.

The **observations** in the dataset are **sentence with 30 words** followed by a **label** that indicates the **overall sentiment** of that sentence.

As the sentences are user reviews written in portuguese, they retain the subjective and flexible nature of colloquial language. Thus, an ideal machine learning model should be able to map not only the correlation between words and labels, but their context and it's effects on the sentence's meaning.

##2.1 Text Representation: Bag of Words and TF-IDF

![bow.png](https://i.imgur.com/TTjtkUH.png)

##2.2 Single Value Decomposition and Latent Semantic Analysis
![lsa.png](https://i.imgur.com/EMWATJX.png)


In [22]:
#@title
class FoursquareTipsDataset():
    def __init__(self, df, ngram_range=None):
      # extracting lemmas and POS tags with spacy even though we are not using them yet
      self.sentences, self.terms, self.lemmas, self.pos = self.preprocess(df.texto)
      self.labels = df.rotulo.reset_index(drop=True)
      self.feature_type_ = 'bow'

      # bag of words
      self.count_vectorizer = CountVectorizer(ngram_range=ngram_range)
      self.bow = self.count_vectorizer.fit_transform(self.sentences)

      # tfidf
      self.tfidf_vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_df=0.99, min_df=0.002) # removed if present in less than 3.6 documents
      self.tfidf = self.tfidf_vectorizer.fit_transform(self.sentences)

      # SVD/LSA
      print('fitting bow_lsa')
      self.svd_bow = self.fit_svd_bow(self.bow)
      print('fitting tfidf_lsa')
      self.svd_tfidf = self.fit_svd_tfidf(self.tfidf)

      # for easy indexing
      self.sentences = pd.DataFrame(self.sentences)
      self.lemmas = pd.DataFrame(self.lemmas)
      self.pos = pd.DataFrame(self.pos)

    def feature_type(self, feature_type):
      self.feature_type_ = feature_type

    @staticmethod
    def preprocess(reviews):
        sentences = []
        lemmas = []
        pos = []
        terms = []

        for sentence in tqdm(reviews):
            sentence = re.sub(r'http\S+', '', sentence)  # removes urls before punctuation
            punctuation_to_space = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
            sentence = sentence.translate(punctuation_to_space)  # change punctuations to spaces
            sentence = str.lower(sentence)
            sentence = re.sub('\d+', '', sentence)  # removes numbers
            sentence = re.sub(' +', ' ', sentence)  # removes double spaces

            # spacy processing -- nlp(sentence) -- adds properties to words,
            # like "lemma_", "pos_" and "is_stop" for stop_words.
            sentence = list(filter(lambda w: not w.is_stop, nlp(sentence)))
            lemmas.append([w.lemma_ for w in sentence if not w.is_stop])
            pos.append([w.pos_ for w in sentence if not w.is_stop])
            terms.append(sentence)

            # sklearn count/tfidf vectorizers require raw text
            sentences.append(' '.join([w.text for w in sentence]))

        return sentences, terms, lemmas, pos

    def fit_svd_bow(self, data):
      for c in range(1400, 2000, 100):
        svd = TruncatedSVD(n_components=c, n_iter=10)
        svd.fit(data)
        if svd.explained_variance_ratio_.sum() > 0.98:
          break
      print(f'{c} components explained {svd.explained_variance_ratio_.sum():.4f} of feature variance.')
      return svd.fit_transform(data)

    def fit_svd_tfidf(self, data):
      for c in range(1, data.shape[1], 20):
        svd = TruncatedSVD(n_components=c, n_iter=10)
        svd.fit(data)
        if svd.explained_variance_ratio_.sum() > 0.98:
          break
      print(f'{c} components explained {svd.explained_variance_ratio_.sum():.4f} of feature variance.')
      return svd.fit_transform(data)

    def __getitem__(self, i):
      if self.feature_type_ == 'svd_tfidf':
        return self.svd_tfidf[i], self.labels.iloc[i]

      if self.feature_type_ == 'svd_bow':
        return self.svd_bow[i], self.labels.iloc[i]

      if self.feature_type_ == 'tfidf':
        return self.tfidf[i].toarray(), self.labels.iloc[i]

      return self.bow[i].toarray(), self.labels.iloc[i]

    def __len__(self):
        return len(self.sentences)

##\<Load and Process Dataset\>

In [23]:
#@title
nlp = spacy.load('pt_core_news_sm')
path = 'https://raw.githubusercontent.com/eduartheinen/foursquare-tips/master/data/'

df = pd.read_csv(path + 'tips_scenario1_train.csv').dropna(how='any')
data = FoursquareTipsDataset(df, ngram_range=(1, 2))

100%|██████████| 1708/1708 [00:14<00:00, 117.40it/s]


fitting bow_lsa
1500 components explained 0.9880 of feature variance.
fitting tfidf_lsa
861 components explained 0.9819 of feature variance.


##2.3 Class-Balanced Loss Based on Effective Number of Samples
https://openaccess.thecvf.com/content_CVPR_2019/papers/Cui_Class-Balanced_Loss_Based_on_Effective_Number_of_Samples_CVPR_2019_paper.pdf

In [13]:
#@title
N = data.bow.shape[0] * data.bow.shape[1]
beta = (N - 1) / N
num_classes = 2
samples_per_class = [len(data.labels[data.labels == -1]), \
                     len(data.labels[data.labels == 1])]

effective_num = 1.0 - np.power(beta, samples_per_class)
weights = (1.0 - beta) / np.array(effective_num)
weights = weights / np.sum(weights) * num_classes

scaled_class_samples = [len(df1[df1.rotulo==c]) for c in df1.rotulo.unique()]

fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Original","Scaled Proportionally to Samples"))
fig.add_trace(go.Bar(x=['negative', 'neutral', 'positive'],
                     y=scaled_class_samples,
                     marker_color=['red', 'gray', 'blue']),
              row=1, col=1)

scaled_class_samples[0] = scaled_class_samples[0] * weights[0]
scaled_class_samples[2] = scaled_class_samples[2] * weights[1]

fig.add_trace(go.Bar(x=['negative', 'neutral', 'positive'],
                     y=scaled_class_samples,
                     marker_color=['red', 'gray', 'blue']), 
              row=1, col=2)
    
fig.update_layout(autosize=False, showlegend=False, width=1000, height=500, title='Samples Distribution')

# class_weights = {-1:weights[0], 0:1, 1:weights[1]}
class_weights = {-1:weights[0], 1:weights[1]}

fig.show()

#**3. Probabilistic and Machine Learning Models**

## 3.1 Naive Bayes

![lsa.png](https://i.imgur.com/dyhj5yi.png)

https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf


In [34]:
from sklearn.naive_bayes import ComplementNB
features = ['bow', 'tfidf']
scores = {f:[] for f in features}

for f in features:
  data.feature_type(f)
  kf = KFold(n_splits=10, shuffle=True)
  for train_index, test_index in kf.split(range(0, len(data))):
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index]
    
    cnb = ComplementNB()
    cnb.fit(x_train, y_train)
    scores[f].append(cnb.score(x_test, y_test))

cnb_scores = scores
nb_bow = np.mean(scores['bow'])
nb_tfidf = np.mean(scores['tfidf'])
scores

{'bow': [0.7134502923976608,
  0.6549707602339181,
  0.672514619883041,
  0.672514619883041,
  0.5906432748538012,
  0.695906432748538,
  0.5964912280701754,
  0.7192982456140351,
  0.6941176470588235,
  0.6588235294117647],
 'tfidf': [0.7251461988304093,
  0.7426900584795322,
  0.7543859649122807,
  0.7309941520467836,
  0.695906432748538,
  0.7192982456140351,
  0.6842105263157895,
  0.7192982456140351,
  0.7235294117647059,
  0.7529411764705882]}

#Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']

scores = {f:[] for f in features}
scores.update({f+'_weighted':[] for f in features})

for f in features:
  data.feature_type(f)
  kf = KFold(n_splits=10, shuffle=True)
  for train_index, test_index in kf.split(range(0, len(data))):
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index:]

    lr_cv = LogisticRegression(C=1e3, n_jobs=-1, max_iter=500)
    lr_cv.fit(x_train, y_train)
    scores[f].append(lr_cv.score(x_test, y_test))

    lr_cv = LogisticRegression(C=1, n_jobs=-1, max_iter=500, 
                                class_weight=class_weights)
    lr_cv.fit(x_train, y_train)
    scores[f+'_weighted'].append(lr_cv.score(x_test, y_test))

lr_scores = scores
scores

#Support Vector Classification

Data that is not linearly separable in its original dimension space can be projected into a higher-dimensional space with a kernel function. 

This function takes as input vectors in the original space and returns the dot product of the vectors in the higher-dimensional feature space.

![lsa.png](https://miro.medium.com/max/913/1*gXvhD4IomaC9Jb37tzDUVg.png)



In [None]:
from sklearn.svm import SVC

features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']

scores = {f:[] for f in features}
scores.update({f+'_weighted':[] for f in features})

train_index = test_index = int(len(data)*0.75) # testing with 0.15 of dataset
# for train_index, test_index in kf.split(data.bow):
for f in features:
  x_train, y_train = data[:train_index]
  x_test, y_test = data[test_index:]

  scv = SVC(kernel='poly', C=1.0)
  scv.fit(x_train, y_train)
  scores[f].append(scv.score(x_test, y_test))

  scv = SVC(kernel='poly', C=1.0, class_weight=class_weights)
  scv.fit(x_train, y_train)
  scores[f+'_weighted'].append(scv.score(x_test, y_test))

svc_scores = scores

for f in features:
  tmp = np.mean(scores[f])
  print(f'{f}:{tmp}')

#XGBoost

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score


features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']

scores = {f:[] for f in features}
scores.update({f+'_weighted':[] for f in features})

# train_index = test_index = int(len(data)*0.75) # testing with 0.15 of dataset
for train_index, test_index in kf.split(data.bow):
  for f in features:
    data.feature_type(f)
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index]

    xgb = XGBClassifier()
    xgb.fit(x_train, y_train)
    scores[f].append(xgb.score(x_test, y_test))

#Bi-directional LSTM

In [None]:
"""#Bi-directional LSTM"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, random_split

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_index = test_index = int(len(data) * 0.75)  # testing with 0.15 of dataset
hidden_dim = 64
learning_rate = 0.001
weight_decay = 0.01
max_epochs = 100
criterion = nn.NLLLoss()


class BiLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super(BiLSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes

        # GloVe, but contains only vectors that refer to words on the dataset vocabulary
        # self.word_embeds = nn.Embedding.from_pretrained(embedding_vectors)
        self.lstm = nn.LSTM(input_dim, hidden_dim // 2, num_layers=1, bidirectional=True, batch_first=True)

        # Maps the output of the LSTM into label space.
        self.hidden2label = nn.Linear(hidden_dim, num_classes)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (torch.nn.init.xavier_uniform_(torch.zeros(2, 1, self.hidden_dim // 2)).to(device),
                torch.nn.init.xavier_uniform_(torch.zeros(2, 1, self.hidden_dim // 2)).to(device))

    def forward(self, sentence):
        self.hidden = self.init_hidden()
        # as of batch_first=True requires input shape (batch, seq, feature)
        sentence = sentence.view(1, 1, len(sentence))
        lstm_out, _ = self.lstm(sentence, self.hidden)
        lstm_feats = self.hidden2label(lstm_out)
        labels = F.log_softmax(lstm_feats[0], dim=1)

        return labels

In [None]:
def training_epoch(model, optimizer, train_x, train_y):  # criterion, scheduler
    model.train()
    losses = []
    progress_bar = tqdm(range(len(train_x)), desc='Training', leave=False)
    for i in progress_bar:
        inputs = torch.Tensor(train_x[i]).to(device)
        target = torch.LongTensor([0 if train_y[i] == -1 else 1]).to(device)

        # Clean old gradients
        # optimizer.zero_grad()
        model.zero_grad()

        # Forwards pass
        output = model(inputs)
        loss = criterion(output, target)

        # acc = accuracy_score(model, output)

        # Perform gradient descent, backwards pass
        loss.backward()

        # Take a step in the right direction
        optimizer.step()
        # scheduler.step()

        losses.append(loss.item())

    # plot_losses(losses, 'batch #', 'neg_log_likelihood_loss', 'Training Loss')
    return sum(losses) / len(losses)


def validate_epoch(model, test_x, test_y):  # criterion, scheduler
    model.eval()
    output = []
    targets = []

    with torch.no_grad():
        progress_bar = tqdm(range(len(test_x)), desc='Validating', leave=False)
        for i in progress_bar:
            inputs = torch.Tensor(test_x[i]).to(device)

            # Forwards pass
            output.append(model(inputs)[0].tolist())  # TODO: save only data, ignore tensors
            targets.append([1, 0] if test_y[i] == -1 else [0, 1])

            # Calculating the F-Score
            # positive = [i for i, t in enumerate(target.view(-1)) if t != 0]
            # predictions = (target - best_path).view(-1)
            # true_positive = [i for i, t in enumerate(predictions[positive]) if t == 0]

            # if len(true_positive) > 0:
            #     p = len(true_positive) / len(predictions)
            #     r = len(true_positive) / len(positive)
            #     mean_F.append(2 * p * r / (p + r))

    # mean_F = np.mean(mean_F)
    # return mean_F if mean_F > best_F else best_F
    # return np.mean(mean_F)

    return 0

In [None]:
model = BiLSTM(data.bow.shape[1], hidden_dim, num_classes).to(device)
optimizer = optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()),
                        lr=learning_rate, weight_decay=weight_decay)

loss_file = 'losses.csv'
train_losses = []
best_Fs = []
best_F = 0
epoch = 0

# resume training
checkpoint_file = 'ner-bilstm-crf-feb20.pth'

In [None]:
# training epochs
for epoch in range(max_epochs):
    train_loss = training_epoch(model, optimizer, train_loader)  # train_loader
    new_F = validate_epoch(model, valid_loader)

    if new_F > best_F:
        best_F = new_F

    if epoch % 10 == 0:
      torch.save({
          'epoch': epoch,
          'model_state_dict': model.state_dict(),
          'optimizer_state_dict': optimizer.state_dict(),
          'train_losses': train_losses,
          'best_Fs': best_Fs
      }, checkpoint_file)
        

    # with open(loss_file, 'a', newline='\n') as csvfile:
    #     writer = csv.writer(csvfile)
    #     writer.writerow([epoch + 1, train_loss, new_F])

    tqdm.write(
        f'epoch #{epoch + 1:3d}\ttrain_loss: {train_loss:.6f}\tcurrent_F: {new_F:.6f}\tbest_F: {best_F:.6f} \n',
    )

    train_losses.append(train_loss)
    best_Fs.append(new_F)

    epoch += 1

plot_losses(train_losses, 'training epoch #', 'neg_log_likelihood_loss', 'Training Loss')
plot_losses(best_Fs, 'validation epoch #', 'F score', 'F score')