Below button will open this notebook in Google CoLab! 

<a href="https://colab.research.google.com/github/anjisun221/css_codes/blob/main/ay21t1/Lab05_text_classification/Lab05_text_classification%20-%20Students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Lab 5 - Text classification

In this lab, you will learn:
* How to use pre-trained model to classify text
* How to fine-tune the model to build a classification model
* How to evaluate the performance of a model

This lab is written by Jisun AN (jisunan@smu.edu.sg) and Michelle KAN (michellekan@smu.edu.sg).


#### **[Important]** Change Runtime Type to GPU

For this lab, you need to use "**GPU**." On the Menu, click "**Runtime**" --> "**Change Runtime Type**" and select Hardware Accelerator as "**GPU**"!!

# 0. Import Packages

In [None]:
# Packages for data analysis
import pandas as pd
import numpy as np

# Packages for train/test dataset split
from sklearn.model_selection import train_test_split

# Uncomment below if you want to see errors in more detail.
# import os 
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"


# 1. Getting the data

In this lab, we will use restaurant review data. 

This data is manually annotated by humans according to their aspect and sentiment. 

One review may have two or more aspects and thus two or more sentiment. 

We note that we excluded those conflicting reviews.

"restaurant_reviews.tsv" is tab-separated file which fields are: 

- `sid` is review id
- `text` is a review
- `aspect` refers to the review area of interest. It consists of any of these five labels: <i>food, service, ambience, price</i> 
- `sentiment` consists of one of these labels: <i>positive, negative, neutral</i>



In [None]:
df = pd.read_table("https://raw.githubusercontent.com/anjisun221/css_codes/main/ay21t1/Lab05_text_classification/restaurant_reviews_v2.tsv", sep="\t")
print(df.shape)
df.head()


#### Q1. How many rows are 'positive'? 

In [None]:
# Your code here 

You answer: 

#### Q2. How many rows are about 'price'? 

In [None]:
# Your code here 

Your answer:

#### Q3. What's the text of the 10th row?

In [None]:
# Your code here

Your answer:

![Huffingface](https://huggingface.co/front/assets/huggingface_logo-noborder.svg)

#2. Transformer and Hugging Face

A transformer is a deep learning model that adopts the mechanism of attention and it has helped to make big breakthroughs in many Natural Language Processing tasks, such as text classification, question and anwsering, and language translations. 

You can simply consider it as a model that know a lot of language. The pre-trained model is trained with a large amount of texts (billions of sentences) so that it can be fine-tuned for down-stream tasks such as text classificataion. 

[Hugging Face](https://huggingface.co/) is an  (NLP)-focused startup with a large open-source community, in particular around the Transformers library. 🤗/Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package: `transformers`
 

#### **[Important]** Change Runtime Type to GPU

For this lab, you need to use "**GPU**." On the Menu, click "**Runtime**" --> "**Change Runtime Type**" and select Hardware Accelerator as "**GPU**"!!

Let's install the Transformers and Datasets libraries to run this notebook.

In [None]:
# this will take some time
!pip install datasets transformers[sentencepiece]

The most basic object in the 🤗 Transformers library is the `pipeline`. It connects a model with its necessary preprocessing and postprocessing steps (e.g., tokenizing), allowing us to directly input any text and get an intelligible answer. 

Below shows one example of text classification pipeline, `sentiment classification`. 

The text classification pipeline can currently be loaded from `pipeline()` using the following task identifier: `sentiment-analysis` (for classifying sequences according to positive or negative sentiments).

The models that this pipeline can use are models that have been fine-tuned on a sequence classification task. See the up-to-date list of available models on [huggingface.co/models](https://huggingface.co/models).





In [None]:
## This will enable your coLab to use GPU!!! 
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

### Text classification with Hugging Face's Pipeline

In [None]:
from transformers import pipeline

Do you want to classify your sentent based on its sentiment? 

It's two lines of code wigh Hugging Face pipeline!

In [None]:
# Load the "sentiment prediction" model.
classifier = pipeline("sentiment-analysis", device = 0)

# input: sentence, output: sentiment lable and score
classifier("I've been waiting for a HuggingFace course my whole life.")

We can even pass several sentences!



In [None]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

1. The text is preprocessed into a format the model can understand (e.g., tokenizing--split a sentence into a list of words and vectorization--map each word to a numeric value).
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

In [None]:
# let's check back our data
df.head()

Since we have a ready-made model, we will use our data to evalaute the model.

In [None]:
# We extract text and label to build the model

sentences = list(df['text'].values) # text will be the inpur to the model
y_sentiment_true = list(df['sentiment'].values) # true labels (sentiment)
y_aspect_true = list(df['aspect'].values) # true labels (aspect)


In [None]:
# sentences is a list of texts of our review data
sentences[0]

In [None]:
sentiment_classifier = pipeline("sentiment-analysis", device = 0)
sentiment_result = sentiment_classifier(sentences)
sentiment_result[0]


To evaluate the prediction results with true labels, we extract the labels from the results. 

Since our true labels are 'positive' and 'negative' but the predicted labels are in capital letters ('POSITIVE' and 'NEGATIVE') we will need to make them be lower case.

In [None]:
# extract the predicted labels using list comprehension 
y_sentiment_pred = [result['label'].lower() for result in sentiment_result]
y_sentiment_pred[0]

We will evalaute the classifcation result by using below four metrics: 

* Accuracy: the percentage of texts that were categorized with the correct tag.
* Precision: the percentage of examples the classifier got right out of the total number of examples that it predicted for a given tag.
* Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.
* F1 Score: the harmonic mean of precision and recall.

For that, we will use sci-kit's `classification_report` function

In [None]:
from sklearn.metrics import classification_report


In [None]:
# input: true labels and predicted labels
print(classification_report(y_sentiment_true, y_sentiment_pred)) 

Pretty good, right!?

But, note that these reviews are quite clean data, and social media is way more dirty and noisy, so you may see very good result when trying it out with your own data!

## Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [None]:
classifier = pipeline("zero-shot-classification", device=0)
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!



In [None]:
classifier(
    "Trump will win",
    candidate_labels=["agree", "disagree"],
)


### Exercise 1. Try the zero-shot classification by yourself. 

Come up with a possible set of labels and example sentence. Then, try to do zero-shot classification with your own exaple. How does it work? Does it make any sense? 

Share your labels, sentences, and results [here](https://padlet.com/anjisun221/smt203zeroshot)


In [None]:
# Your code here


## Evaluating zero-shot classification

Our restaurant review data also has 'aspect' labels.

We will now use zero-shot classification to predict the aspect labels given each review. Then, we will evalaute the model's performance on our data.

In [None]:
# Below code will take some time to finish. 
aspect_classifier = pipeline("zero-shot-classification", device=0)
aspect_result = aspect_classifier(list(sentences), candidate_labels=['service', 'food', 'price', 'ambience'],)
aspect_result[0]

Since the model returns probabilities for all four labels, we will assign the label with the highest probability to the text. 

`np.argmax()` will return the index of the list with the higest value.

For the first text, the 'food' label has the probability of 0.687, so the index of the highest value is 0.  

In [None]:
np.argmax(aspect_result[0]['scores']) 

In [None]:
y_aspect_pred = [result['labels'][np.argmax(result['scores'])] for result in aspect_result]
y_aspect_pred[0]

In [None]:
# using classification_report to evaluate the predicted labels.
print(classification_report(y_aspect_true, y_aspect_pred))


Given that we have four labels, f1-score of 0.67 isn't bad at all! 

### Exercise 2. Sentiment analysis by Zero-shot classification 

1. Use zero-shot classification to classify the restaurant reviews into 'positive' and 'negative' categories. 
2. Extract the predicted labels from the results
3. Evaluate the predicted labels with true labels.
4. What's the macro average of f1-score? 
5. Is it better than using `sentiment-analysis` model in the previous section? 

In [None]:
# Your code here

Your answer to the question 4 of the exercise 2: ??

Your answer to the question 5 of the exercise 2: ??

# 3. [Optional] Build your own classification model by fine-tuning the pre-trained model 

So far, the pre-trained models seem to work quite well with our restaurant reveiw data. But what if they don't work with your own data--the evaluation results are too bad to use for your study? 

In that case, you can build your own classification model by fine-tuning the pre-trained model with your own labeled data. 

Here, we will learn how you can fine-tune the pre-trained BERT model to build your own classification model! 

Basically, we will need to go through some steps that was hidden in the `pipeline`. 

In [None]:
# We extract text and label to build the model
sentences = list(df['text'].values)
y_str = list(df['aspect'].values)


Our labels are in string format, but to build the classifier, we need to transform it into numerical form. 

In [None]:
y = []
for each in y_str:
    if each == "ambience":
        y.append(0)
    elif each == "food":
        y.append(1)
    elif each == "price":
        y.append(2)        
    elif each == "service":
        y.append(3)


In [None]:
# the 'food' label should be encoded as '1'
print(y_str[0], y[0])

### Train and test datasets

To build our own classifier, we will split our labeled data (restaurant review) into train, validation, and test dataset. 

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually. It has the following syntax:

    train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)

The function takes the following parameters:
- `X, y`: the dataset you're selecting to use. Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- `train_size`: This parameter sets the size of the training dataset. There are three options: None, which is the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0.
- `test_size`: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.
- `random_state`: The default mode performs a random split using `np.random`. Alternatively, you can add an integer using an exact number.

In [None]:
# Randomly split the data into training (80%) and test (20%) datasets
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.20, random_state=999)


We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:

In [None]:
# Randomly split the training data into training (80%) and validation (20%) datasets
sentences_train, sentences_val, y_train, y_val = train_test_split(sentences_train, y_train, test_size=.2, random_state=999)


### Tokenization

Alright, we've read in our dataset. Now let's tackle tokenization. We'll eventually train a classifier using
pre-trained DistilBert, so let's use the DistilBert tokenizer.

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Now we can simply pass our texts to the tokenizer. We'll pass `truncation=True` and `padding=True`, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
length. This will allow us to feed batches of sequences into the model at the same time.

In [None]:
train_encodings = tokenizer(sentences_train, truncation=True, padding=True)
val_encodings = tokenizer(sentences_val, truncation=True, padding=True)
test_encodings = tokenizer(sentences_test, truncation=True, padding=True)

Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
`torch.utils.data.Dataset` object and implementing `__len__` and `__getitem__`. In TensorFlow, we pass our input
encodings and labels to the `from_tensor_slices` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
`DistilBertForSequenceClassification.forward` method of the model we will train.

In [None]:
import torch

class myDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = myDataset(train_encodings, y_train)
val_dataset = myDataset(val_encodings, y_val)
test_dataset = myDataset(test_encodings, y_test)


## Train

Now that our datasets our ready, we can fine-tune a model either with the 🤗
`Trainer`/`TFTrainer` or with native PyTorch/TensorFlow. See [training](https://huggingface.co/transformers/training.html).

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# if it's not a binary classification, num_labels should be given! 
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

The training loss is kind of a classification error. 

You can increase the parameter `num_train_epochs` to train the model longer. This may help to increase the performance of the model. 


## Test

Now, let's evalaute the model performance.

In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.evaluate()

The model results in macro average f1-score of 0.78. 

Our model performs better than zero-shot based model!

## Save & Load the model

Since we see that our model performs well with the test data, we can use it to predict the data that don't have labels! 

For that, let's try to save and load the new model. 

In [None]:
trainer.save_model()

In [None]:
new_model = DistilBertForSequenceClassification.from_pretrained("./results", num_labels=4)

new_trainer = Trainer(
    model=new_model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

## Prediction

You can predict the labels of your unlabelled data. 

We will use the sentences in the test data (`sentences_test`) and predict the aspect labels. Here, we will pretend that they don't have any true labels! 


In [None]:
sentences_test[0]

In [None]:
len(sentences_test)

In [None]:
# create dataset for prediction
new_encodings = tokenizer(sentences_test, truncation=True, padding=True)
# create dummy labels with the number of sentences to predict. 
y_new = np.full(len(sentences_test), 1)
new_dataset = myDataset(new_encodings, y_new)

In [None]:
new_predictions = new_trainer.predict(new_dataset)

You can extract the predicted labels with the highest probability using the below code. 

In [None]:
new_preds = np.argmax(new_predictions.predictions, axis=-1)
new_preds

Note that the labels are encoded into numeric form. You may need to convert back to the string format. 

Remeber that:
```
y = []
for each in y_str:
    if each == "ambience":
        y.append(0)
    elif each == "food":
        y.append(1)
    elif each == "price":
        y.append(2)        
    elif each == "service":
        y.append(3)
```



### [Optional] Exercise 3. Classify the below new texts!

You have two new texts. Please use the fine-tuned model to classify those texts into four aspects. 

Print out the result. 

In [None]:
new_texts = ["Not too crazy about their sake martini", 
             "But the staff was so horrible to us."]


In [None]:
# Your code here