<a href="https://colab.research.google.com/github/eduartheinen/foursquare-tips/blob/master/foursquare_tips_scenario2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title
!pip install -U spacy setuptools wheel xgboost plotly chart-studio # ipyml transformers
!python -m spacy download pt_core_news_sm # comment this line after first run

In [1]:
#@title
import re
import string
from collections import Counter

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#spacy
import spacy
#pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torchtext.vocab import Vocab
#sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.model_selection import KFold
# from sklearn.naive_bayes import ComplementNB
# from sklearn.svm import LinearSVC

from tqdm import tqdm
# from xgboost import XGBClassifier

#**0. Foursquare Tips Dataset**
Composed of user reviews in portuguese, referring to localities of São Paulo/Brazil and collected with the Foursquare API from the categories: Food, Shop & Service and Nightlife Spot. 

>```dataset_test.csv``` has a total of 179,181 reviews.
>
>```tips_scenario1_train.csv``` contains 1708 reviews labeled as **negative, neutral or positive**.
>
>```tips_scenario2_train.csv``` contains 1788 reviews labeled as **negative or positive**.


In [2]:
#@title
import chart_studio.plotly as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots


path = 'https://raw.githubusercontent.com/eduartheinen/foursquare-tips/master/data/'

df1 = pd.read_csv(path + 'tips_scenario1_train.csv').dropna(how='any')
df2 = pd.read_csv(path + 'tips_scenario2_train.csv').dropna(how='any')

fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Scenario 1","Scenario 2"))

fig.add_trace(go.Bar(x=['negative', 'neutral', 'positive'],
                     y=[len(df1[df1.rotulo==c]) for c in df1.rotulo.unique()],
                     marker_color=['red', 'gray', 'blue']),
              row=1, col=1)

fig.add_trace(go.Bar(x=['negative', 'positive'],
                     y=[len(df2[df2.rotulo==c]) for c in df2.rotulo.unique()],
                     marker_color=['red', 'blue']), 
              row=1, col=2)
    
fig.update_layout(autosize=False, showlegend=False, width=1000, height=500, title='Samples Distribution')

# **1. Data Preprocessing**

- Raw dataset entries contain a sentence of 30 words, followed by its label.

- After removing urls, punctuation marks and numbers, each sentence is processed with the **SpaCy NLP API**, trained in Portuguese.

- This step tokenizes and adds properties to each term of the sentence;

- Properties as:
>   **```lemma_```** return the word's canonical form, 
>
>   **```pos_```** it's part of speech tag (noun, verb, adjective, ...),
>
>   **```is_stop```** determine if it is a stop word.

- **For instance,** the sentence:
> "Eu fui morar na Estação da Luz. Porque estava muito escuro dentro do meu coração"

- **...would become**:
> ```tokens = ['morar', 'estação', 'luz', 'escuro', 'coração']```
>
> ```lemmas = ['morar', 'estação', 'luzir', 'escuro', 'coração']```
>
> ```pos = ['VERB', 'NOUN', 'NOUN', 'ADJ', 'NOUN']```

#**2. Feature Engineering**

One of the first fundamental choices involded in the construction of a Machine/Deep Learning Model is the representation of real world observations as features that can be read and understood by the model.

The **observations** in the dataset are **sentence with 30 words** followed by a **label** that indicates the **overall sentiment** of that sentence.

As the sentences are user reviews written in portuguese, they retain the subjective and flexible nature of colloquial language. Thus, an ideal machine learning model should be able to map not only the correlation between words and labels, but their context and it's effects on the sentence's meaning.

##2.1 Text Representation: Bag of Words and TF-IDF

![bow.png](https://i.imgur.com/TTjtkUH.png)

##2.2 Single Value Decomposition and Latent Semantic Analysis
![lsa.png](https://i.imgur.com/EMWATJX.png)


In [3]:
#@title
class FoursquareTipsDataset():
    def __init__(self, df, ngram_range=None):
        # extracting lemmas and POS tags with spacy even though we are not using them yet
        self.sentences, self.terms, self.lemmas, self.pos = self.preprocess(df.texto)
        self.labels = df.rotulo.reset_index(drop=True)
        self.feature_type_ = 'bow'

        self.vocab = Vocab(  # Collections.Counter for words in sentences
            Counter([str(term) for sentence in self.terms for term in sentence]))
        # min_freq=min_freq, specials=special_tags)

        # bag of words
        self.count_vectorizer = CountVectorizer(ngram_range=ngram_range)
        self.bow = self.count_vectorizer.fit_transform(self.sentences)

        # tfidf
        self.tfidf_vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_df=0.99,
                                                min_df=0.002)  # removed if present in less than 3.6 documents
        self.tfidf = self.tfidf_vectorizer.fit_transform(self.sentences)

        # SVD/LSA
        print('fitting bow_lsa')
        self.svd_bow = self.fit_svd_bow(self.bow)
        print('fitting tfidf_lsa')
        self.svd_tfidf = self.fit_svd_tfidf(self.tfidf)

        # for easy indexing
        self.sentences = pd.DataFrame(self.sentences)
        self.terms = pd.DataFrame(self.terms)
        self.lemmas = pd.DataFrame(self.lemmas)
        self.pos = pd.DataFrame(self.pos)

    def feature_type(self, feature_type):
        self.feature_type_ = feature_type

    @staticmethod
    def preprocess(reviews):
        sentences = []
        lemmas = []
        pos = []
        terms = []

        for sentence in tqdm(reviews):
            sentence = re.sub(r'http\S+', '', sentence)  # removes urls before punctuation
            punctuation_to_space = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
            sentence = sentence.translate(punctuation_to_space)  # change punctuations to spaces
            sentence = str.lower(sentence)
            sentence = re.sub('\d+', '', sentence)  # removes numbers
            sentence = re.sub(' +', ' ', sentence)  # removes double spaces

            # spacy processing -- nlp(sentence) -- adds properties to words,
            # like "lemma_", "pos_" and "is_stop" for stop_words.
            sentence = list(filter(lambda w: not w.is_stop, nlp(sentence)))
            lemmas.append([w.lemma_ for w in sentence if not w.is_stop])
            pos.append([w.pos_ for w in sentence if not w.is_stop])
            terms.append([str(w) for w in sentence if not w.is_stop])

            # sklearn count/tfidf vectorizers require raw text
            sentences.append(' '.join([w.text for w in sentence]))

        return sentences, terms, lemmas, pos

    def fit_svd_bow(self, data):
        for c in range(1400, 2000, 100):
            svd = TruncatedSVD(n_components=c, n_iter=10)
            svd.fit(data)
            if svd.explained_variance_ratio_.sum() > 0.98:
                break
        print(f'{c} components explained {svd.explained_variance_ratio_.sum():.4f} of feature variance.')
        return svd.fit_transform(data)

    def fit_svd_tfidf(self, data):
        for c in range(1, data.shape[1], 20):
            svd = TruncatedSVD(n_components=c, n_iter=10)
            svd.fit(data)
            if svd.explained_variance_ratio_.sum() > 0.98:
                break
        print(f'{c} components explained {svd.explained_variance_ratio_.sum():.4f} of feature variance.')
        return svd.fit_transform(data)

    def __getitem__(self, i):
        if self.feature_type_ == 'svd_tfidf':
            return self.svd_tfidf[i], self.labels.iloc[i]

        if self.feature_type_ == 'svd_bow':
            return self.svd_bow[i], self.labels.iloc[i]

        if self.feature_type_ == 'tfidf':
            return self.tfidf[i].toarray(), self.labels.iloc[i]

        if self.feature_type_ == 'sequence':
            return self.terms.iloc[i]. \
                       apply(lambda s: [self.vocab.stoi[t] for t in s]). \
                       reset_index(drop=True), \
                   self.labels.iloc[i].reset_index(drop=True)

        return self.bow[i].toarray(), self.labels.iloc[i]

    def __len__(self):
        return self.bow.shape[0]

##\<Load and Process Dataset\>

In [4]:
#@title
nlp = spacy.load('pt_core_news_sm')
nlp.Defaults.stop_words.add('a')
nlp.Defaults.stop_words.add('e')

path = 'https://raw.githubusercontent.com/eduartheinen/foursquare-tips/master/data/'

df = pd.read_csv(path + 'tips_scenario2_train.csv').dropna(how='any')
data = FoursquareTipsDataset(df, ngram_range=(1, 2))

100%|██████████| 1788/1788 [00:15<00:00, 112.82it/s]


fitting bow_lsa
1600 components explained 0.9912 of feature variance.
fitting tfidf_lsa
901 components explained 0.9821 of feature variance.


## \<Validation Data\>
We set asside 0.15 of samples to validate and report models accuracy, precision, f1 score, recall and plot confusion matrixes.

- **accuracy**: exact matches between y_true and y_pred;
- **precision**: ```tp / (tp + fp)``` (intuitively) the ability not to label as positive a negative sample;
- **recall**:```tp / (tp + fn)``` (intuitively) the ability to find all the positive samples;
- **f1 score**: ```2 * (precision * recall) / (precision + recall)``` weighted average of precision and recall;

In [5]:
#@title
train_valid_split = int(len(data) * 0.85)  # excluding 0.15 for validation

#**3. Probabilistic and Machine Learning Models**

## 3.1 Naive Bayes

![lsa.png](https://i.imgur.com/dyhj5yi.png)

[Tackling the Poor Assumptions of Naive Bayes Text Classifiers (2003)](https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf)


In [6]:
#@title
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, confusion_matrix, classification_report

features = ['bow', 'tfidf']

accuracy = {f:[] for f in features}
precision = {f:[] for f in features}
recall = {f:[] for f in features}
f1 = {f:[] for f in features}

for f in features:
  data.feature_type(f)
  kf = KFold(n_splits=5, shuffle=True)
  
  for train_index, test_index in kf.split(range(0, train_valid_split)):
    # get data
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index]

    # model fit and predict
    cnb = ComplementNB()
    cnb.fit(x_train, y_train)
    y_pred = cnb.predict(x_test)

    # append scores/metrics
    accuracy[f].append(accuracy_score(y_test, y_pred))
    precision[f].append(list(precision_score(y_test, y_pred, average=None)))
    recall[f].append(list(recall_score(y_test, y_pred, average=None)))
    f1[f].append(list(f1_score(y_test, y_pred, average=None)))

## Naive Bayes - Accuracy
**accuracy**: exact matches between y_true and y_pred;

In [12]:
#@title
class_names = ['negative', 'positive']
class_colors = ['red', 'blue']

# create figures
fig1 = make_subplots(rows=1, cols=1)  #, subplot_titles=(['Accuracy']))#, 'Precision BOW', 'Precision TFIDF', 'Recall', 'F1']))
fig2 = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(features))
fig3 = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(features))
fig4 = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(features))


mean_accuracy = []
mean_precision = []
mean_recall = []
mean_f1 = []

col_counter = 1
for f in features:

  fig1.add_trace(go.Box(y=accuracy[f], name=f), row=1, col=1)
  mean_accuracy.append(np.mean(accuracy[f]))

  precision_feature = pd.DataFrame(precision[f])
  mean_precision.append([precision_feature.iloc[:, i].mean() for i in range(0,2)])  # another FOR

  recall_feature = pd.DataFrame(recall[f])
  mean_recall.append([recall_feature.iloc[:, i].mean() for i in range(0,2)])  # another FOR
  
  f1_feature = pd.DataFrame(f1[f])
  mean_f1.append([f1_feature.iloc[:, i].mean() for i in range(0,2)]) # and another FOR
  
  for c in range(0,2):
    fig2.add_trace(go.Box(y=precision_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=1, col=col_counter)
    fig3.add_trace(go.Box(y=recall_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=1, col=col_counter)
    fig4.add_trace(go.Box(y=f1_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=1, col=col_counter)
  
  col_counter += 1

  fig1.update_layout(title='CNB Accuracy', showlegend=False, autosize=False, width=500, height=500)
  fig2.update_layout(title='CNB Precision', showlegend=False, autosize=False, width=800, height=500)
  fig3.update_layout(title='CNB Recall', showlegend=False, autosize=False, width=800, height=500)
  fig4.update_layout(title='CNB F1 Score', showlegend=False, autosize=False, width=800, height=500)

fig1.show()

## Naive Bayes - Precision 
**precision**: ```tp / (tp + fp)``` (intuitively) the ability not to label as positive a negative sample;

In [10]:
#@title
class_names = ['negative', 'positive']
pd.DataFrame(mean_precision, columns=class_names, index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
bow,0.779299,0.835916
tfidf,0.765859,0.83375


In [13]:
#@title
fig2.show()

## Naive Bayes - Recall
**recall**:```tp / (tp + fn)``` (intuitively) the ability to find all the positive samples;

In [14]:
#@title
pd.DataFrame(mean_precision, columns=class_names, index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
bow,0.779299,0.835916
tfidf,0.765859,0.83375


In [15]:
#@title
fig3.show()

## Naive Bayes - F1 score 
**f1 score:** ```2 * (precision * recall) / (precision + recall)``` weighted average of precision and recall;

In [16]:
#@title
pd.DataFrame(mean_f1, columns=class_names, index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
bow,0.778877,0.836922
tfidf,0.770437,0.830061


In [17]:
#@title
fig4.show()

## Naive Bayes - Confusion Matrix

In [18]:
#@title
# report with valid data
x_train, y_train = data[:train_valid_split]
x_valid, y_valid = data[train_valid_split:]

cnb = ComplementNB()
cnb.fit(x_train, y_train)
y_pred = cnb.predict(x_valid)

fig = go.Figure(data=go.Heatmap(z=confusion_matrix(y_valid, y_pred)))
fig.update_layout(title='CNB Confusion Matrix', showlegend=False, autosize=False, width=400, height=400)
fig.show()

#3.2 Support Vector Classification

Data that is not linearly separable in its original dimension space can be projected into a higher-dimensional space with a kernel function. 

This function takes as input vectors in the original space and returns the dot product of the vectors in the higher-dimensional feature space.

![svc.png](https://i.imgur.com/0fgHDhe.png)

In [None]:
#@title
from sklearn.svm import LinearSVC

features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']

accuracy = {f:[] for f in features}
precision = {f:[] for f in features}
recall = {f:[] for f in features}
f1 = {f:[] for f in features}

for f in features:
  data.feature_type(f)
  kf = KFold(n_splits=5, shuffle=True)
  
  for train_index, test_index in kf.split(range(0, train_valid_split)):
    # get data
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index]

    # model fit and predict
    svc = LinearSVC(C=1.0, max_iter=1000)
    svc.fit(x_train, y_train)
    y_pred = svc.predict(x_test)
    
    # append scores/metrics
    accuracy[f].append(accuracy_score(y_test, y_pred))
    precision[f].append(list(precision_score(y_test, y_pred, average=None)))
    recall[f].append(list(recall_score(y_test, y_pred, average=None)))
    f1[f].append(list(f1_score(y_test, y_pred, average=None)))

## SVC - Accuracy
**accuracy**: exact matches between y_true and y_pred;

In [30]:
#@title
def plot_analysis(accuracy, precision, recall, f1, model_name):

  features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
  class_names = ['negative', 'positive']
  class_colors = ['red', 'blue']

  mean_accuracy = []
  mean_recall = []
  mean_precision = []
  mean_f1 = []

  # create figures
  fig1 = make_subplots(rows=1, cols=1)
  fig2 = make_subplots(rows=1, cols=4, shared_yaxes=True, subplot_titles=(features))
  fig3 = make_subplots(rows=1, cols=4, shared_yaxes=True, subplot_titles=(features))
  fig4 = make_subplots(rows=1, cols=4, shared_yaxes=True, subplot_titles=(features))

  col_counter = 1
  row_counter = 1

  for f in features:
    fig1.add_trace(go.Box(y=accuracy[f], name=f), row=1, col=1)
    mean_accuracy.append(np.mean(accuracy[f]))

    precision_feature = pd.DataFrame(precision[f])
    mean_precision.append([precision_feature.iloc[:, i].mean() for i in range(0,2)])  # another FOR

    recall_feature = pd.DataFrame(recall[f])
    mean_recall.append([recall_feature.iloc[:, i].mean() for i in range(0,2)])  # another FOR
    
    f1_feature = pd.DataFrame(f1[f])
    mean_f1.append([f1_feature.iloc[:, i].mean() for i in range(0,2)]) # and another FOR
    
    for c in range(0,2):
      fig2.add_trace(go.Box(y=precision_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=row_counter, col=col_counter)
      fig3.add_trace(go.Box(y=recall_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=row_counter, col=col_counter)
      fig4.add_trace(go.Box(y=f1_feature.iloc[:, c], name=class_names[c], marker_color=class_colors[c]), row=row_counter, col=col_counter)


    col_counter += 1
    if col_counter % 4 == 1:
      row_counter = 2 
      col_counter = 1


    fig1.update_layout(title=f'{model_name} Accuracy', showlegend=False, autosize=False, width=800, height=500)
    fig2.update_layout(title=f'{model_name} Precision', showlegend=False, autosize=False, width=1200, height=400)
    fig3.update_layout(title=f'{model_name} Recall', showlegend=False, autosize=False, width=1200, height=400)
    fig4.update_layout(title=f'{model_name} F1 Score', showlegend=False, autosize=False, width=1200, height=400)

  # out of fors
  return fig1, fig2, fig3, fig4, mean_accuracy, mean_precision, mean_recall, mean_f1


fig1, fig2, fig3, fig4, mean_accuracy, mean_precision, mean_recall, mean_f1 = plot_analysis(accuracy, precision, recall, f1, 'SVC')
fig1.show()

## SVC - Precision 
**precision**: ```tp / (tp + fp)``` (intuitively) the ability not to label as positive a negative sample;

In [25]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_precision, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
bow,0.817132,0.756506
svd_tfidf,0.780398,0.786984
tfidf,0.777085,0.798678
svd_bow,0.765017,0.754446


In [31]:
#@title
fig2.show()

## SVC - Recall
**recall**:```tp / (tp + fn)``` (intuitively) the ability to find all the positive samples;

In [28]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_recall, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
tfidf,0.709661,0.851094
svd_tfidf,0.684921,0.853939
svd_bow,0.615272,0.862048
bow,0.605376,0.90199


In [32]:
#@title
fig3.show()

## SVC - F1 score 
**f1 score:** ```2 * (precision * recall) / (precision + recall)``` weighted average of precision and recall;

In [34]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_f1, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
tfidf,0.741451,0.823801
svd_tfidf,0.727673,0.818418
bow,0.692352,0.821317
svd_bow,0.681659,0.804546


In [35]:
fig4.show()

## SVC - Confusion Matrix

In [37]:
# report with valid data
data.feature_type('svd_tfidf')
x_train, y_train = data[:train_valid_split]
x_valid, y_valid = data[train_valid_split:]

svc = LinearSVC(C=1.0)
svc.fit(x_train, y_train)
y_pred = svc.predict(x_valid)

fig = go.Figure(data=go.Heatmap(z=confusion_matrix(y_valid, y_pred)))
fig.update_layout(title='SVC Confusion Matrix', showlegend=False, autosize=False, width=500, height=500)
fig.show()

##XGBoost

![xgb.png](https://i.imgur.com/pRm7g3K.png)
[XGBoost: A Scalable Tree Boosting System (2016)](https://dl.acm.org/doi/pdf/10.1145/2939672.2939785)

In [None]:
#@title
from xgboost import XGBClassifier


features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']

accuracy = {f:[] for f in features}
precision = {f:[] for f in features}
recall = {f:[] for f in features}
f1 = {f:[] for f in features}

for f in features:
  data.feature_type(f)
  kf = KFold(n_splits=5, shuffle=True)
  
  for train_index, test_index in kf.split(range(0, train_valid_split)):
    print(f'{f}')
    # get data
    x_train, y_train = data[train_index]
    x_test, y_test = data[test_index]

    # model fit and predict
    xgb = XGBClassifier()
    xgb.fit(x_train, y_train)
    y_pred = xgb.predict(x_test)
    
    # append scores/metrics
    accuracy[f].append(accuracy_score(y_test, y_pred))
    precision[f].append(list(precision_score(y_test, y_pred, average=None)))
    recall[f].append(list(recall_score(y_test, y_pred, average=None)))
    f1[f].append(list(f1_score(y_test, y_pred, average=None)))

## XGB - Accuracy
**accuracy**: exact matches between y_true and y_pred;

In [40]:
#@title
fig1, fig2, fig3, fig4, mean_accuracy, mean_precision, mean_recall, mean_f1 = plot_analysis(accuracy, precision, recall, f1, 'XGBoost')
fig1.show()

## XGB - Precision 
**precision**: ```tp / (tp + fp)``` (intuitively) the ability not to label as positive a negative sample;

In [42]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_precision, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
bow,0.814312,0.748928
tfidf,0.788124,0.75479
svd_tfidf,0.755589,0.736844
svd_bow,0.687418,0.718471


In [43]:
#@title
fig2.show()

## SVC - Recall
**recall**:```tp / (tp + fn)``` (intuitively) the ability to find all the positive samples;

In [44]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_recall, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
tfidf,0.609976,0.879571
bow,0.58967,0.899423
svd_tfidf,0.578277,0.864783
svd_bow,0.565473,0.812069


In [45]:
#@title
fig3.show()

## SVC - F1 score 
**f1 score:** ```2 * (precision * recall) / (precision + recall)``` weighted average of precision and recall;

In [46]:
#@title
features = ['bow', 'svd_bow', 'tfidf', 'svd_tfidf']
pd.DataFrame(mean_f1, columns=['negative', 'positive'], index=features).sort_values(by='negative', ascending=False)

Unnamed: 0,negative,positive
tfidf,0.685239,0.811412
bow,0.683785,0.81723
svd_tfidf,0.654876,0.79541
svd_bow,0.620114,0.762185


In [47]:
#@title
fig4.show()

## XGB - Confusion Matrix

In [51]:
#@title
# report with valid data
data.feature_type('svd_tfidf')
x_train, y_train = data[:train_valid_split]
x_valid, y_valid = data[train_valid_split:]

xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_valid)

fig = go.Figure(data=go.Heatmap(z=confusion_matrix(y_valid, y_pred)))
fig.update_layout(title='SVC Confusion Matrix', showlegend=False, autosize=False, width=500, height=500)
fig.show()