# Basic tutorial: text classification
#### Author: Matteo Caorsi

This short tutorial provides you with a basic tutorial on *text classification* using *giotto-deep*.

## Scope

Text classification, sometimes called **sentiment analysis** consist is classifying text excerpts into predefined classes. For example, one may try to classify movies reviews into five classes with differet level of the author enjoyment:

| Movie review | liking score |
|----|-----|
| This movie makes me cry! It's so moving! | 5/5 |
|I hate this movie: bad acting and no plot! Definitely not recommended. | 1/5 |

## Content

The main steps of the tutorial are the following:
 1. Creation of a dataset and preprocessing
 2. Creation of a model
 3. Define metrics and losses
 4. Run benchmarks
 5. Visualise results interactively

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np
from gtda.diagrams import BettiCurve
from gtda.plotting import plot_betti_surfaces
import torch
from torch import nn
from torch.utils.data.sampler import SubsetRandomSampler
from torch.optim import Adam, SparseAdam, SGD
from torchvision import transforms

from gdeep.models import FFNet
from gdeep.visualization import persistence_diagrams_of_activations
from gdeep.data.datasets import DatasetBuilder
from gdeep.data import PreprocessingPipeline
from gdeep.data import TransformingDataset
from gdeep.data.preprocessors import Normalization, TokenizerTextClassification
from gdeep.data.datasets import DataLoaderBuilder
from gdeep.trainer import Trainer
from gdeep.analysis.interpretability import Interpreter
from gdeep.visualization import Visualiser
from gdeep.models import ModelExtractor
from gdeep.search import GiottoSummaryWriter


# Initialize the tensorboard writer

In order to analyse the results of your models, you need to start tensorboard.
On the terminal, move inside the `/examples` folder. There run the following command:

```
tensorboard --logdir=runs
```

Then go [here](http://localhost:6006/) after the training to see all the visualization results.

In [None]:
writer = GiottoSummaryWriter()


# Create your dataset

In this section we will import the [AG_NEWS](https://pytorch.org/text/stable/datasets.html#ag-news) dataset. This dataset contains Thomson-Reuters news and they are classified into four macro categories.

In [None]:
# many time we get an IterableDataset which is good for memory consumption, but cannot be subsampled!
# we can entire batches and sample them using the conversion to map type:
bd = DatasetBuilder(name="AG_NEWS", convert_to_map_dataset=True)
ds_tr_str, ds_val_str, ds_ts_str = bd.build()


The data is presented as a tuple of the form `(label, text)`.

In [None]:
print("Before the preprocessing: \n", ds_tr_str[0])


## Required preprocessing

Neural networks cannot direcly deal with strings. We have first to preprocess the dataset in three main ways:
 1. Tokenise each string into its words (and maybe adjust each word to remove plurals, interjections, capital letters...)
 2. Build a vocabulary out of these tokens (each modified word of point a. is called a token)
 3. Embed each token into a vector, so that each sentence becomes a list of vectors

The **first two steps** are performed by the `TokenizerTextClassification` class. The embedding will be added directly to the model (see below).

In [None]:
ptd = TokenizerTextClassification()

# in case you need to combine multiple preprocessing:
# ppp = PreprocessingPipeline(((PreprocessTextData(), IdentityTransform(), TextDataset),
#                             (Normalisation(), IdentityTransform(), BasicDataset)))


ptd.fit_to_dataset(ds_tr_str)
transformed_textds = ptd.attach_transform_to_dataset(ds_tr_str)

transformed_textts = ptd.attach_transform_to_dataset(ds_val_str)

print("After the preprocessing: \n", transformed_textds[0])

# the only part of the training/test set we are interested in
train_indices = list(range(64 * 20))
test_indices = list(range(64 * 10))

dl_tr2, dl_ts2, _ = DataLoaderBuilder((transformed_textds, transformed_textts)).build(
    (
        {"batch_size": 16, "sampler": SubsetRandomSampler(train_indices)},
        {"batch_size": 16, "sampler": SubsetRandomSampler(test_indices)},
    )
)


## Define and train your model

Our model is going to be a simple embedding layer (i.e. a layer whose goal is to transform list of tokens into vectors) , followed by a simple feed-forward layer.

Furthermore, since the output of the embedding layer has three dimensions, we need to remove one before the tensor can be fed to the final feed-forward layer: this is done - in this very simple tutorial - via an average operation.

In [None]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text):
        embedded = self.embedding(text)
        mean = torch.mean(embedded, dim=1)
        return self.fc(mean)


In the next cell we initialise the model parameters and the model as well.

In [None]:
vocab_size = len(ptd.vocabulary)
emsize = 64
# print(vocab_size, emsize)
model = TextClassificationModel(vocab_size, emsize, 4)
print(model)


# Training the model

We are finally ready to train our model: we have already built a dataset (that is compatible with the model). We also know that the task is a classification task, hence we can use the CrossEntropy loss!

Let's see how in giotto-deep, in only a couple of lines, one can run the training.

In [None]:
loss_fn = nn.CrossEntropyLoss()

pipe = Trainer(model, (dl_tr2,), loss_fn, writer)

# train the model
pipe.train(SGD, 7, False, {"lr": 0.01}, {"batch_size": 20})


# Simply use interpretability tools

In the next two cells we use interpretability tool on the model to see what and how our model has learned. Basically, we want to understand how each word in a sentence contributes to the classification choice.

But first, how about sending to the tensorboard the model structure an interactive inspection? It's as simple as writing the code in the next cell.

In [None]:
vs = Visualiser(pipe)
vs.plot_interactive_model()


In [None]:
# Interpretability!

inter = Interpreter(pipe.model, method="LayerIntegratedGradients")

inter.interpret_text(
    "I am writing about money and business",
    0,
    ptd.vocabulary,
    ptd.tokenizer,
    layer=pipe.model.embedding,
    n_steps=500,
    return_convergence_delta=True
)


vs.plot_interpreter_text(inter)
