### Install transformers

In [None]:
!python -m pip install transformers accelerate sentencepiece emoji pythainlp --quiet
!python -m pip install --no-deps thai2transformers==0.1.2 --quiet

Transformers Documentations: https://huggingface.co/docs/transformers/index

##  Sequence Classification

In [None]:
from transformers import pipeline

classifier = pipeline(task="sentiment-analysis",
                      model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
classifier("I love to hate you")

## A closer look: Tokenization + Classification

### Load tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
text = "I love you"

tokens = tokenizer.tokenize(text)

tokens

In [None]:
sentence = tokenizer.convert_tokens_to_ids(tokens)

sentence

In [None]:
sentence = tokenizer(text,  return_tensors="pt")

sentence

### Load model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

### Classification

In [None]:
import torch

torch.softmax(model(**sentence).logits, axis=1)

In [None]:
from thai2transformers.preprocess import process_transformers

input_text = process_transformers("ขอเงินกู้<mask>หน่อย<pad>")

thai_classifier = pipeline(task="fill-mask",
                           tokenizer=AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased"),
                           model="airesearch/wangchanberta-base-att-spm-uncased")

thai_classifier(input_text)

See an example of the classification model deployed on HuggingFace space at: https://huggingface.co/spaces/Donlapark/sample-text-classification

# Fine-tuning

In [None]:
!python -m pip install datasets evaluate --quiet

### We will fine-tune classification model on the Yelp review dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [None]:
dataset["train"][100]

### Modify the tokenizer so that it can be applied to our dataset

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased",
                                          use_fast=True)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function,
                                 batched=True,
                                 remove_columns=["text"])

In [None]:
tokenized_datasets["train"][100]['input_ids'][:20]

### We will only train on a small subset of the dataset

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle().select(range(1000))

### Load model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

### Specify training argument

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer",
                                  evaluation_strategy="epoch",
                                  learning_rate=2e-5,
                                  optim="adamw_torch") ##to use Pytorch's AdamW optimizer

### Train the model

In [None]:
from transformers import Trainer

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=small_train_dataset,

    eval_dataset=small_eval_dataset,

)

In [None]:
trainer.train()

In [None]:
import torch

sentence = tokenizer("I hate you", return_tensors="pt").to("cuda")

torch.softmax(model(**sentence).logits, axis=1)

## Exercise

1. Choose your own task (can be image or audio related) that can be performed using one of the HuggingFace models.
2. Use the HugginFace model to create a Streamlit app in a HuggingFace space that asks for the user's input and then perform the said task.
3. Deploy the model on HuggingFace space.

To see what Transformers can do, you might want to check out the links below:

https://huggingface.co/docs/transformers/task_summary

https://huggingface.co/docs/transformers/index

[List of HuggingFace models](https://huggingface.co/models)

[Streamlit Documentation](https://docs.streamlit.io/library/api-reference/widgets)

#### Insert your HuggingFace Space link here:

# Upload model to HuggingFace Hub

We will upload the tokenizer and the model on HuggingFace hub. First we need to install a library that allows us to log-in our HuggingFace account from colab.

In [None]:
!python -m pip install huggingface_hub --quiet

Enter a credential to login, then create a new model hub, which will be used to store your model.

In [None]:
!huggingface-cli login
!huggingface-cli repo create finetuned_yelp --type model

Finally, you can now save your tokenizer and model.

To load the mode and tokenizer from the HuggingFace space, use (change `username` to your HuggingFace username):

Now you can load the model within HuggingFace Space using `pipeline("sentiment-analysis", model="your_username/finetuned_yelp")`. [Here](https://huggingface.co/spaces/Donlapark/sample-text-classification)'s an example.



In [None]:
tokenizer.push_to_hub("finetuned_yelp")
model.push_to_hub("finetuned_yelp")