### Statistical Learning for Data Science 2 (229352) 
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #12

### Install transformers

In [None]:
!python -m pip install transformers accelerate sentencepiece emoji pythainlp --quiet
!python -m pip install --no-deps thai2transformers==0.1.2 --quiet

Transformers Documentations: https://huggingface.co/docs/transformers/index

##  Sequence Classification

In [None]:
from transformers import pipeline

classifier = pipeline(task="sentiment-analysis",
                      model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
classifier("I love to hate you")

[{'label': 'NEGATIVE', 'score': 0.9974361062049866}]

## A closer look: Tokenization + Classification

### Load tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
text = "I love you"

tokens = tokenizer.tokenize(text)

tokens

['i', 'love', 'you']

In [None]:
sentence = tokenizer.convert_tokens_to_ids(tokens)

sentence

[1045, 2293, 2017]

In [None]:
sentence = tokenizer(text,  return_tensors="pt")

sentence

{'input_ids': tensor([[ 101, 1045, 2293, 2017,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

### Load model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [None]:
model(**sentence).logits

tensor([[-4.2756,  4.6393]], grad_fn=<AddmmBackward0>)

### Classification

In [None]:
import torch

torch.softmax(model(**sentence).logits, axis=1)

tensor([[1.3436e-04, 9.9987e-01]], grad_fn=<SoftmaxBackward0>)

In [None]:
from thai2transformers.preprocess import process_transformers

input_text = process_transformers("ขอเงินกู้<mask>หน่อย<pad>")

print(input_text)

ขอเงินกู้<mask>หน่อย<pad>


In [None]:
thai_classifier = pipeline(task="fill-mask",
                           tokenizer=AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased"),
                           model="airesearch/wangchanberta-base-att-spm-uncased")

thai_classifier(input_text)

[{'score': 0.13870567083358765,
  'token': 5682,
  'token_str': 'นอกระบบ',
  'sequence': 'ขอเงินกู้นอกระบบ หน่อย'},
 {'score': 0.0380280502140522,
  'token': 10887,
  'token_str': 'ในบัญชี',
  'sequence': 'ขอเงินกู้ในบัญชี หน่อย'},
 {'score': 0.023623887449502945,
  'token': 1045,
  'token_str': 'ธนาคาร',
  'sequence': 'ขอเงินกู้ธนาคาร หน่อย'},
 {'score': 0.022009387612342834,
  'token': 4501,
  'token_str': 'สหกรณ์',
  'sequence': 'ขอเงินกู้สหกรณ์ หน่อย'},
 {'score': 0.020918430760502815,
  'token': 561,
  'token_str': 'คืน',
  'sequence': 'ขอเงินกู้คืน หน่อย'}]

See an example of the classification model deployed on HuggingFace space at: https://huggingface.co/spaces/Donlapark/sample-text-classification

# Fine-tuning

In [None]:
!python -m pip install datasets evaluate --quiet

### We will fine-tune classification model on the Yelp review dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [None]:
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

### Modify the tokenizer so that it can be applied to our dataset

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased",
                                          use_fast=True)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function,
                                 batched=True,
                                 remove_columns=["text"])

In [None]:
tokenized_datasets["train"][100]['input_ids'][:20]

[101,
 1422,
 11471,
 1111,
 9092,
 1116,
 1132,
 189,
 6034,
 1344,
 119,
 1252,
 1111,
 1141,
 1106,
 1253,
 8693,
 1177,
 14449,
 1193]

### We will only train on a small subset of the dataset

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle().select(range(1000))

### Load model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Specify training argument

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer",
                                  evaluation_strategy="epoch",
                                  learning_rate=2e-5,
                                  optim="adamw_torch") ##to use Pytorch's AdamW optimizer

### Train the model

In [None]:
from transformers import Trainer

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=small_train_dataset,

    eval_dataset=small_eval_dataset,

)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [None]:
import torch

sentence = tokenizer("I hate you", return_tensors="pt").to("cuda")

torch.softmax(model(**sentence).logits, axis=1)

tensor([[0.2488, 0.2066, 0.1772, 0.1602, 0.2071]], device='cuda:0',
       grad_fn=<SoftmaxBackward0>)

## Exercise

1. Choose your own task (can be image or audio related) that can be performed using one of the HuggingFace models.
2. Use the HugginFace model to create a Streamlit app in a HuggingFace space that asks for the user's input and then perform the said task.
3. Deploy the model on HuggingFace space.

To see what Transformers can do, you might want to check out the links below:

https://huggingface.co/docs/transformers/task_summary

https://huggingface.co/docs/transformers/index

[List of HuggingFace models](https://huggingface.co/models)

[Streamlit Documentation](https://docs.streamlit.io/library/api-reference/widgets)

#### Insert your HuggingFace Space link here:

# Upload model to HuggingFace Hub

We will upload the tokenizer and the model on HuggingFace hub. First we need to install a library that allows us to log-in our HuggingFace account from colab.

In [None]:
!python -m pip install huggingface_hub --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m204.8/295.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Enter a credential to login, then create a new model hub, which will be used to store your model.

In [None]:
!huggingface-cli login
!huggingface-cli repo create finetuned_yelp --type model


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 

Finally, you can now save your tokenizer and model.

To load the mode and tokenizer from the HuggingFace space, use (change `username` to your HuggingFace username):

Now you can load the model within HuggingFace Space using `pipeline("sentiment-analysis", model="your_username/finetuned_yelp")`. [Here](https://huggingface.co/spaces/Donlapark/sample-text-classification)'s an example.



In [None]:
tokenizer.push_to_hub("finetuned_yelp")
model.push_to_hub("finetuned_yelp")