# IndoML 2023 Tuturial: Part 2
## The Era of LLMs!

### In-context learning and Prompt Engineering


1. We will use recent LLMs like GPT-3/FLAN-T5/LLAMA to query the models in natural language to get answers/predictions.
2. These models are finetuned on instructions or human-feedbacks to enable them to perform a task through "prompting".
3. Best part is we wouldn't need to train our models to get started, direct inference from these pretrained models is fine.
    * NOTE: Although there can be methods to finetune these models on our data to get better results, we will not be covering that in this tutorial.

### Methods that we will try:

1. FLAN-T5

## Load `dataset`

In [1]:
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

tqdm.pandas()

dataset = load_dataset("AmazonScience/massive")


  from .autonotebook import tqdm as notebook_tqdm
2023-08-26 10:15:58.521414: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load `AutoTokenizer` and `AutoModelForSeq2SeqLM`

In [5]:
# pip install -q transformers accelerate bitsandbytes
from transformers import AutoTokenizer, AutoModelForCausalLM

# checkpoint = "bigscience/mt0-base"
# checkpoint = "bigscience/bloomz-3b"
checkpoint = "google/flan-t5-xxl"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)

TypeError: __init__() got an unexpected keyword argument 'load_in_4bit'

## Model Specific Example of Prompt Engineering

In [None]:
inputs = tokenizer.encode("Detect Intent: I am going to school", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

<pad> I am going to school.</s>


## Preprocessing

- We will create prompts for each test sample in the dataset. 
- There are few ways to format these prompt and this step is called "Prompt Engineering".
    - Few-shot In-context learning
    - Zero-shot In-context learning

In [None]:
# Add a new feature column to the dataset
# Prompt: What is the intent of the following sentence?\m "{utt}"
def add_prompt(example):
    example["prompt"] = f'What is the intent of the following utterance?\n"{example["utt"]}"\n\nAns:'
    return example

extended_eval_set = dataset['validation'].map(add_prompt)


Map: 100%|██████████| 103683/103683 [00:13<00:00, 7549.40 examples/s]


In [3]:
print(extended_eval_set[0]['prompt'])

x = extended_eval_set[0]['prompt']
tok_x = tokenizer(x, return_tensors="pt")
y = model.generate(tok_x['input_ids'].to("cuda"), num_beams=5, num_return_sequences=5, max_length=100)
print(tokenizer.decode(y[0]))

NameError: name 'extended_eval_set' is not defined

In [3]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [4]:
tokenized_datasets['train'][0].keys()

dict_keys(['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments', 'input_ids', 'attention_mask'])

In [5]:
# Remove columns that are not needed
to_remove = list(tokenized_datasets['train'][0].keys())

# Don't remove these columns
to_remove.remove('input_ids')
to_remove.remove('attention_mask')
to_remove.remove('intent')

tokenized_datasets = tokenized_datasets.remove_columns(to_remove)
tokenized_datasets = tokenized_datasets.rename_column('intent', 'labels')


In [6]:
tokenized_datasets['train'].features

{'labels': ClassLabel(names=['datetime_query', 'iot_hue_lightchange', 'transport_ticket', 'takeaway_query', 'qa_stock', 'general_greet', 'recommendation_events', 'music_dislikeness', 'iot_wemo_off', 'cooking_recipe', 'qa_currency', 'transport_traffic', 'general_quirky', 'weather_query', 'audio_volume_up', 'email_addcontact', 'takeaway_order', 'email_querycontact', 'iot_hue_lightup', 'recommendation_locations', 'play_audiobook', 'lists_createoradd', 'news_query', 'alarm_query', 'iot_wemo_on', 'general_joke', 'qa_definition', 'social_query', 'music_settings', 'audio_volume_other', 'calendar_remove', 'iot_hue_lightdim', 'calendar_query', 'email_sendemail', 'iot_cleaning', 'audio_volume_down', 'play_radio', 'cooking_query', 'datetime_convert', 'qa_maths', 'iot_hue_lightoff', 'iot_hue_lighton', 'transport_query', 'music_likeness', 'email_query', 'play_music', 'audio_volume_mute', 'social_post', 'alarm_set', 'qa_factoid', 'calendar_set', 'play_game', 'alarm_remove', 'lists_remove', 'transpor

## Dataloaders [Optional]

- Instead of going through individual samples in the dataset, we would like to 
batches of samples to train our model. 
- Dataloader
    1. Creates an "iterator" over the dataset, which returns a batch of samples every turn
    2. Handles shuffling
    3. Collates samples into batches, by padding the input sequences to the maximum length in the batch

In [7]:
# from torch.utils.data import DataLoader

# train_dataloader = DataLoader(
#     tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
# )
# eval_dataloader = DataLoader(
#     tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
# )

In [8]:
# Check the shape of the tensors
# for batch in tqdm(train_dataloader):
#     break
# {k: v.shape for k, v in batch.items()}

```python
{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 25]),
 'attention_mask': torch.Size([8, 25])}
```

## Load model using an `AutoModel`

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=tokenized_datasets['train'].features['labels'].num_classes)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.

In [10]:
# # Let's check the output of a single forward pass
# outputs = model(**batch)
# print(outputs.loss, outputs.logits.shape)

## Setup Evaluation Metric

In [11]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [12]:
# We need to define a compute_metric function that is supported by the Trainer output
# It basically converts the logits to predictions and then calls the metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Setup Trainer and Finetuning

1. When testing your code for the first time, it is a better idea to use a smaller dataset,
so that you can quickly iterate over your code.
2. We have done the same here by using 128 + 32 samples from the dataset, for training and validation respectively.

In [13]:
train_data_subset = tokenized_datasets["train"].select(range(4096))
eval_data_subset = tokenized_datasets["validation"].select(range(2096))

In [14]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # warmup_steps=500,
    # weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    report_to="none" # set "wandb" here to log to wandb
)

**Wandb Logging**

When training ML models, keeping track of various hyperparameters, the training and validation losses, and evaluation metrics is very important.
`wandb` can help you here.

To install wandb, run following commands in your terminal:

```bash
pip install wandb
# Then login to your wandb account
wandb login
```

Once, that is done, in TrainingArguments, set the `report_to` argument to `wandb`.

In [15]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    # train_dataset=tokenized_datasets["train"],
    # eval_dataset=tokenized_datasets["validation"],
    train_dataset=train_data_subset, # Using the smaller subsets
    eval_dataset=eval_data_subset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [16]:
stats = trainer.train()

***** Running training *****
  Num examples = 4096
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1280
  Number of trainable parameters = 66999612
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,2.1266,4.311208,0.205153
2,1.3326,4.608765,0.259542
3,0.8773,4.879213,0.278626
4,0.6287,5.112392,0.292462
5,0.5658,5.199319,0.304389
6,0.4456,5.414936,0.302481
7,0.3588,5.228592,0.311069
8,0.2249,5.661997,0.306298
9,0.2291,5.564895,0.307252
10,0.2399,5.539486,0.310115


***** Running Evaluation *****
  Num examples = 2096
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-128
Configuration saved in ./results/checkpoint-128/config.json
Model weights saved in ./results/checkpoint-128/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-128/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-128/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2096
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-256
Configuration saved in ./results/checkpoint-256/config.json
Model weights saved in ./results/checkpoint-256/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-256/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-256/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2096
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-384
Configuration saved in ./results/checkpoint-384/config.json
Model w

In [17]:
# Print Stats
from pprint import pprint
pprint(trainer.evaluate())

***** Running Evaluation *****
  Num examples = 2096
  Batch size = 32


{'epoch': 10.0,
 'eval_accuracy': 0.3110687022900763,
 'eval_loss': 5.2285919189453125,
 'eval_runtime': 1.0504,
 'eval_samples_per_second': 1995.343,
 'eval_steps_per_second': 62.83}
