### Text Classification: Sentiment Analysis

Hello there! This is an example of sentiment analysis based on the second chapter of Natural Language Processing with Transformers [book](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/) (and [GitHub](https://github.com/nlp-with-transformers)). This notebook aims to extract the most relevant elements of the first end-to-end hands-on chapter with the 🤗 HuggingFace ecosystem. I might change a thing or two for my own experiments, given that I have been using this material for personal learning.

#### Load Data

We start by loading the data from the 🤗 Datasets library. We are using the `emotion` dataset which contains tweets written in English. This dataset was proposed on a [paper](https://aclanthology.org/D18-1404/) by E. Saravia <i>et al.<i> where the authors not only collected the data, but labeled the sentiment inferred from hashtags.

In [None]:
from datasets import load_dataset

dataset = load_dataset("emotion")

#### EDA

We now perform a small EDA on the dataset for examining the corpus.

In [None]:
print("This is the dataset object:\n", dataset)
print("__________________________________________________________")
print("This is the dataset object type:\n", type(dataset))
print("__________________________________________________________")
print("The dataset is split into train, validation and test. Inside each partition - taking train for instance -, we have:\n", dataset["train"])
print("__________________________________________________________")
print("And inside the first element of the train partition:\n", dataset["train"][0])
print("__________________________________________________________")
print(f"Splitting from text and label, we get: {dataset['train'][0]['text']} and {dataset['train'][0]['label']}")

We can explore the object `features` attribute to see all information about the labeled data and the `set_format` method to improve manipulation by using `pandas` to compute quantities in the corpus.

In [None]:
dataset["train"].features

In [None]:
dataset.set_format("pandas")
df = dataset["train"][:]   
df.head()

We can map the label inter to its respective class by using the `.int2str` method in the features attribute 

In [None]:
dataset["train"].features["label"].int2str(0)

Applying to the whole dataset, we get:

In [None]:
def label_int2str(label):
    return dataset["train"].features["label"].int2str(label)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

Now getting some visualizations on the dataset, we can plot how the dataset is balanced (or not) and the distributions of words of each sentiment:

In [None]:
import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot(kind="barh")

In [None]:
df["Words per tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words per tweet", by="label_name", grid=False)

#### Tokenization

Now that the data has been visualized and we got a grasp of its contents, we can proceed with training. In order to do that, we need to use a tokenizer that will convert a word into its numerical representation for ML processing. From 🤗 Transformers, we can get the `AutoTokenizer` that gets the tokenizer from the model that we are using for training. In this tutorial, we are following the instructions of the book and using `DistilBERT`.

In [None]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

*Important*: On page 35, the authors mention the importance of using the same tokenizer that the model was trained with when using pretrained models. Different tokenization "runs" can lead to different representations of the vocabulary.

Now, we test the tokenization for one sentence to see its behavior 

In [None]:
text = "Tokenizing text is a core task of NLP."
encoded_text = tokenizer(text)
print(encoded_text)
print('__________________________________________________________')
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
print('__________________________________________________________')
print(tokenizer.convert_tokens_to_string(tokens))

Now we proceed with applying tokenization to the whole dataset:

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)
    # Padding is for preserving dimensions
    # Truncation is for maximum context size

dataset.reset_format() # Resetting the format from pandas to the default one
print(tokenize(dataset["train"][:2]))

dataset_encoded = dataset.map(tokenize, batched=True, batch_size=None)

print(dataset_encoded["train"].column_names) # We got two more "columns" in our dataset: input_ids and attention_mask

#### Training

Two possibilities:
- Feature Extraction
- Fine Tuning

##### Feature Extraction

In [None]:
from transformers import AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
model = AutoModel.from_pretrained(model_ckpt).to(device)

In [None]:
# if TensorFlow, TFAutoModel
# Some Models were trained only on one or other framework. If that, add from_pt/from_tf = True

In [None]:
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
inputs['input_ids'].size()

In [None]:
# Dimensions: 