# About.
- This is a simple code shows a basic understanding of workflow of NLP task.
- Uses ðŸ¤— and fine-tuning of pretrained LLM, which is the most common nowadays.
- Task : IMDb, movie review classification.

# 0. Setup.

In [1]:
# Random seed.
from transformers import set_seed
set_seed(42)

# 1. Data Preparation.

In [2]:
from datasets import load_dataset

ds       = load_dataset('imdb')
train_ds = ds['train']
test_ds  = ds['test']

# 2. Tokenization.

In [3]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(checkpoint)

def tokenize(ds):
    return tokenizer(ds['text'], truncation=True, padding=True)

train_tokenized = train_ds.map(tokenize, batched=True)
test_tokenized  = test_ds.map(tokenize, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

> #### Note) Tokenize using `.map()`.  
> Always define a function and use `.map()`, instead of directly using tokenizer!  
> It allows batch processing and other memory-efficient benefits.

# 3. Pretrained Model.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 4. Fine-Tuning.

In [6]:
from transformers import TrainingArguments, Trainer

# TrainingArguments.
training_args = TrainingArguments(
    output_dir                   = "./temp",
    per_device_train_batch_size  = 16,
    per_device_eval_batch_size   = 16,
    num_train_epochs             = 1,
)

# Evaluation Metrics.
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Trainer.
trainer = Trainer(model         = model, 
                  args          = training_args,
                  train_dataset = train_tokenized,
                  eval_dataset  = test_tokenized,
                  compute_metrics=compute_metrics)

trainer.train()

# Loss on each step.
trainer.state.log_history

Step,Training Loss
500,0.3313
1000,0.2464
1500,0.21


TrainOutput(global_step=1563, training_loss=0.25986074409802146, metrics={'train_runtime': 452.1608, 'train_samples_per_second': 55.29, 'train_steps_per_second': 3.457, 'total_flos': 3311684966400000.0, 'train_loss': 0.25986074409802146, 'epoch': 1.0})

# 5. Evaluation.

In [28]:
results = trainer.evaluate()
results

# 6. All-in-one.


In [None]:
# 1. Data Preparation.
from datasets import load_dataset
ds       = load_dataset('imdb')
train_ds = ds['train']
test_ds  = ds['test']

# 2. Tokenization.
from transformers import AutoTokenizer
checkpoint = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(checkpoint)

def tokenize(ds):
    return tokenizer(ds['text'], truncation=True, padding=True)

train_tokenized = train_ds.map(tokenize, batched=True)
test_tokenized  = test_ds.map(tokenize, batched=True)

# 3. Pretrained Model.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 4. Fine-Tuning.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir                   = "./temp",
    per_device_train_batch_size  = 16,
    per_device_eval_batch_size   = 16,
    num_train_epochs             = 1,
)

trainer = Trainer(model         = model, 
                  args          = training_args,
                  train_dataset = train_tokenized,
                  eval_dataset  = test_tokenized)

trainer.train()

# 5. Evaluation.
results = trainer.evaluate()

# 7. Using `pipeline`.

| Pipeline                        | Description                                                  | Domain       |
|:--------------------------------|:------------------------------------------------------------|:------------ |
| `feature-extraction`            | Extract vector representations of text.                     | NLP          |
| `fill-mask`                     | Fill in missing parts of a sentence.                        | NLP          |
| `ner`                           | Perform named entity recognition (e.g., identify names, locations). | NLP          |
| `question-answering`            | Answer questions based on given context.                    | NLP          |
| `sentiment-analysis`            | Classify text sentiment (e.g., positive, negative).         | NLP          |
| `summarization`                 | Generate a concise summary from a larger text.              | NLP          |
| `text-generation`               | Generate text based on a prompt.                            | NLP          |
| `translation`                   | Translate text into different languages.                    | NLP          |
| `zero-shot-classification`      | Classify text into arbitrary labels without task-specific fine-tuning. | NLP          |
| `text2text-generation`          | General text-to-text transformations (e.g., summarization, translation). | NLP          |
| `table-question-answering`      | Answer questions about tabular data.                        | NLP/Table    |
| `document-question-answering`   | Answer questions based on a document or PDF.                | NLP/Document |
| `code-generation`               | Generate code snippets from text descriptions.              | NLP/Code     |
| `conversational`                | Engage in multi-turn conversations like a chatbot.          | NLP          |
| `audio-classification`          | Classify audio into categories (e.g., speech, music).       | Audio        |
| `automatic-speech-recognition`  | Convert spoken audio to text.                               | Audio        |
| `image-classification`          | Classify images into categories.                            | Vision       |
| `object-detection`              | Detect objects in images with bounding boxes.               | Vision       |
| `image-to-text`                 | Generate captions for images.                               | Multi-modal  |
| `depth-estimation`              | Predict depth maps from images for 3D applications.         | Vision       |


In [41]:
from transformers import pipeline
clf        = pipeline(task='sentiment-analysis')                    # Automatically choose the pretrained model.

clf('It was good.')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998384714126587}]