# Training a sentiment classifier

Task: Create a function that takes a text string as input and outputs it’s sentiment (positive or negative).

In [None]:
import evaluate
import numpy as np
import pandas as pd
import gradio as gr
from datasets import load_dataset
from transformers import AutoConfig, AutoTokenizer, DataCollatorWithPadding, pipeline
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

pd.options.display.max_colwidth = 300

# Training data

For training a sentiment classifier, we need a dataset that contains text documents and sentiment labels (positive or negative).

Here we are going to use movie reviews from the Internet movie database (IMDB). Each movie review consists of a textual review and a rating value (originally 1 to 10 stars, but has been converted to positive/negative here). 

There's a nice helper function that takes care of loading the data from Internet.

We don't need the whole dataset. Let's take a random sample.

In [None]:
imdb = load_dataset("imdb")

imdb_small_train = imdb['train'].shuffle(seed=42).select(range(1000))
imdb_small_test = imdb['test'].shuffle(seed=42).select(range(500))

How does the data look like?

In [None]:
imdb_small_train.select(range(10)).to_pandas()

## Training a model

We don't train the sentiment classifier model from scratch. Instead, we reuse a [pre-trained language model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that has been trained on very large text collections to predict the similarity of sentences.

We only need to fine-tune it for the sentiment prediction task.

A minor technical detail: The expected input for the pre-trained model is not the raw text but a list of token indexes. We need to load and apply the tokenizer matching the model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
tokenizer('I wonder how does this sentence looks like tokenized?')

Next, preprocess the IMDB dataset by applying the tokenizer on each document.

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

tokenized_imdb_train = imdb_small_train.map(preprocess_function, batched=True)
tokenized_imdb_test = imdb_small_test.map(preprocess_function, batched=True)

We want to fine-tune the pre-trained model to output POSITIVE or NEGATIVE. Here we configure a two-class classifier on top of the pre-trained language model.

In [None]:
config = AutoConfig.from_pretrained('sentence-transformers/all-MiniLM-L6-v2',
                                    num_labels=2,
                                    id2label={0: 'NEGATIVE', 1: 'POSITIVE'},
                                    label2id={'NEGATIVE': 0, 'POSITIVE': 1})

model = AutoModelForSequenceClassification.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', config=config)

Training the sentiment classifier.

In [None]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir='./models',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    report_to='none',
    eval_steps=100,
    save_steps=100,
    evaluation_strategy='steps',
    save_strategy='steps',
    metric_for_best_model='accuracy',
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb_train,
    eval_dataset=tokenized_imdb_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
# model = AutoModelForSequenceClassification.from_pretrained('models/checkpoint-300')

## The sentiment function

We have just trained a model that predicts if a text has positive or negative sentiment.

Let's package the tokenizer and the trained model into a simple function called `sentiment` that takes a text string as input and outputs the predicted sentiment (and also a score that indicates how certain the prediction is).

In [None]:
sentiment = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [None]:
sentiment('The movie was awesome!!!')

In [None]:
sentiment('Acting was bad and the plot was horrible')

## A simple UI for testing

In [None]:
def sentiment_wrapper(text):
    return str(sentiment(text)[0])

app = gr.Interface(fn=sentiment_wrapper, inputs="text", outputs="text")
app.launch()