<a href="https://colab.research.google.com/github/VladCiocan/APIJSONParser/blob/master/aya_expanse_sft_bengali.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Aya Expanse On More Languages

While Aya Expanse models are highly optimized through post-training for instruction following performance on 23 languages which cover half of the world's population, these models were pre-trained on a very large corpus of text which contains many more langauges. Knowledge of many languages acquired in pre-training combined with strong, cross-lingual representations means that Aya Expanse models often perform well in languages that were not explicitly optimized for in post-training, even with little to no additional training data in that language.

We can further improve the performance of Aya Expanse models on a language which is not part of the original set of 23 optimized languages by supervised fine-tuning (SFT) on a small dataset of instructions for a particular target language. In this notebook, we provide an example of fine-tuning Aya Expanse on a Bengali dataset and demonstrate that with a small amount of fine-tuning data, we can train Aya Expanse to perform well in Bengali.

In [None]:
# install dependencies
!pip install -U bitsandbytes transformers peft accelerate trl datasets sentencepiece wandb

# optional for faster, lower memory usage attention
!pip install flash-attn --no-build-isolation

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,TrainingArguments
from peft import LoraConfig
import torch
from datasets import load_dataset
from trl import SFTTrainer

In [None]:
USE_GPU = True
if USE_GPU:
    device = "cuda:0"
else:
    device = "cpu"

# you may want to change the following parameters depending on your GPU configuration

# free T4 instance
# QUANTIZE_4BIT = True
# USE_GRAD_CHECKPOINTING = True
# TRAIN_BATCH_SIZE = 2
# TRAIN_MAX_SEQ_LENGTH = 512
# USE_FLASH_ATTENTION = False
# GRAD_ACC_STEPS = 16

# equivalent A100 setting
QUANTIZE_4BIT = True
USE_GRAD_CHECKPOINTING = True
TRAIN_BATCH_SIZE = 16
TRAIN_MAX_SEQ_LENGTH = 512
USE_FLASH_ATTENTION = True
GRAD_ACC_STEPS = 2

# Loading and Testing the Base Model
Let's load the Aya Expanse 8B model and tokenizer. If you would like to use Aya Expanse 32B, change `MODEL_NAME` to `"CohereForAI/aya-expanse-32b"`.


In [None]:
MODEL_NAME = "CohereForAI/aya-expanse-8b"

# Load Model
quantization_config = None
if QUANTIZE_4BIT:
  quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_use_double_quant=True,
      bnb_4bit_compute_dtype=torch.bfloat16,
  )

attn_implementation = None
if USE_FLASH_ATTENTION:
  attn_implementation="flash_attention_2"

model = AutoModelForCausalLM.from_pretrained(
          MODEL_NAME,
          quantization_config=quantization_config,
          attn_implementation=attn_implementation,
          torch_dtype=torch.bfloat16,
        )
model = model.to(device)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

You shouldn't move a model that is dispatched using accelerate hooks.


In [None]:
def get_message_format(prompts):
  messages = []

  for p in prompts:
    messages.append(
        [{"role": "user", "content": p}]
      )

  return messages

def generate_aya(
      model,
      prompts,
      temperature=0.75,
      top_p=1.0,
      top_k=0,
      max_new_tokens=1024
    ):

  messages = get_message_format(prompts)

  input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        padding=True,
        return_tensors="pt",
      )
  input_ids = input_ids.to(model.device)
  prompt_padded_len = len(input_ids[0])

  gen_tokens = model.generate(
        input_ids,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_new_tokens=max_new_tokens,
        do_sample=True,
      )

  # get only generated tokens
  gen_tokens = [
      gt[prompt_padded_len:] for gt in gen_tokens
    ]

  gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
  return gen_text

Let's do a quick test of the model generations in English, Vietnamese, Japanese, and Turkish, which are all part of the original set of 23 optimized languages.

In [None]:
# Test generations on langauges in Aya 23 set
prompts = [
    "Write a list of three fruits and tell me about each of them", # English
    "Viết danh sách ba loại trái cây và kể cho tôi nghe về từng loại trái cây đó", # Vietnamese
    "3 つの果物のリストを書いて、それぞれについて教えてください", # Japanese
    "Üç meyveden oluşan bir liste yazın ve bana her birini anlatın" # Turkish
]

generations = generate_aya(model, prompts)

for p, g in zip(prompts, generations):
  print(
      "PROMPT", p ,"RESPONSE", g, "\n", sep="\n"
    )

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


PROMPT
Write a list of three fruits and tell me about each of them
RESPONSE
Here is a list of three unique fruits along with some details about each:

1. **Dragon Fruit (Pitaya):**
   - Appearance: Dragon fruit is easily recognizable by its vibrant pink or yellow skin, covered in scales, resembling a dragon's skin. It has a green crown at the top. The flesh inside can be white or vibrant pink, dotted with small black seeds.
   - Origin and Growth: Native to Central America and northern South America, dragon fruit grows on epiphytic cacti. It is now cultivated in various tropical regions worldwide. The plant produces single, large fruit that can weigh up to 2 pounds.
   - Taste and Texture: The flesh has a mild, sweet flavor, often described as a cross between a kiwi and a pear. It has a juicy, slightly tangy taste and a creamy, buttery texture. The seeds are edible and add a crunchy element.

2. **Durian:**
   - Appearance: Durian is a spiky, oval-shaped fruit with a thick, hard shell.

As expected, the model performs well in the languages that were part of the original set of 23 optimized languages Let's do a quick test of the model generations in Bengali, which is not part of the original set of 23 optimized languages.


In [None]:
prompts = [
  'Translate from English to Bengali: "Rates are competitive, almost always the best in the market"'
]

generations = generate_aya(model, prompts)

for p, g in zip(prompts, generations):
  print(
      "PROMPT", p ,"RESPONSE", g, "\n", sep="\n"
    )

PROMPT
Translate from English to Bengali: "Rates are competitive, almost always the best in the market"
RESPONSE
বিক্রয় তাদের সহজ, বাধাতমকেছে মার্কেটে সবচেয়ে উন্নত"




While the model is able to generate a response in the correct target language of Bengali, the translation could be improved. The response translated back into English is "Selling their simple, barrier-free solutions is the most advanced on the market".

## Dataset Setup

Here we load an English to Bengali translation dataset from the Aya Collection. We filter the dataset to only include Bengali examples and define a formatting function to format the prompts to follow the chat template used by Aya Expanse models. This formatting function will be passed to the `SFTTrainer` below for training.

In [None]:
# Load an English to Bengali translation dataset from Aya Collection
dataset = load_dataset("CohereForAI/aya_collection", "templated_indic_sentiment")['train']
dataset = dataset.filter(lambda example: example['language']=='ben')

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['inputs'])):
        text = f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{example['inputs'][i]}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{example['targets'][i]}"
        output_texts.append(text)
    return output_texts

Here is an example prompt and response from the dataset:

In [None]:
print(f"PROMPT\n{dataset['inputs'][0]}")
print(f"RESPONSE\n{dataset['targets'][0]}")

PROMPT
Translate from English to Bengali: "This boat's soundbar is still wire-connectivity for all the speakers. The HDMI port doesn't match all the devices, hence it suddenly gets disconnected sometimes."
RESPONSE
"এই বোটের সাউন্ডবারটি এখনও সব স্পিকারের জন্য তারের সংযোগ। এইচডিএমআই পোর্ট সব ডিভাইসের সঙ্গে ম্যাচ করে না, তাই সংযোগ মাঝে মাঝে হঠাৎ বিচ্ছিন্ন হয়ে যায়।"


## SFT Model Training
Below we configure SFT training Aya Expanse on the Bengali dataset constructed above. We use LoRA for efficient fine-tuning and thus only update and save the LoRA adapters during training.

In [None]:
# Training Configuration
training_arguments = TrainingArguments(
    output_dir="results",
    num_train_epochs=20,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACC_STEPS,
    gradient_checkpointing=USE_GRAD_CHECKPOINTING,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=10,
    learning_rate=1e-3,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    warmup_ratio=0.05,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="none"
)

peft_config = LoraConfig(
    lora_alpha=32,
    r=32,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=TRAIN_MAX_SEQ_LENGTH,
    tokenizer=tokenizer,
    args=training_arguments,
    formatting_func=formatting_prompts_func
)

# Train the model
trainer.train()

# Save the model to disk
trainer.model.save_pretrained(save_directory='aya-expanse-bengali-sft')
model.config.use_cache = True
model.eval()


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
10,1.365
20,1.1556
30,1.0439
40,0.9923
50,0.8941
60,0.8811
70,0.8637
80,0.7718
90,0.7221
100,0.7326


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enab

CohereForCausalLM(
  (model): CohereModel(
    (embed_tokens): Embedding(256000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x CohereDecoderLayer(
        (self_attn): CohereFlashAttention2(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Identity()
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=32, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=32, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
 

## Testing the Fine-Tuned Model
Now, let's load the fine-tuned model and test it on the same Bengali prompt as before.

In [None]:
# Test Bengali inference on loaded fine-tuned model

# Load Model and LoRA Adapter
quantization_config = None
if QUANTIZE_4BIT:
  quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_use_double_quant=True,
      bnb_4bit_compute_dtype=torch.bfloat16,
  )

attn_implementation = None
if USE_FLASH_ATTENTION:
  attn_implementation="flash_attention_2"

loaded_sft_model = AutoModelForCausalLM.from_pretrained(
          MODEL_NAME,
          quantization_config=quantization_config,
          attn_implementation=attn_implementation,
          torch_dtype=torch.bfloat16,
        )
loaded_sft_model = loaded_sft_model.to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
loaded_sft_model.load_adapter("aya-expanse-bengali-sft")


prompts = [
  'Translate from English to Bengali: "Rates are competitive, almost always the best in the market"'
]

generations = generate_aya(loaded_sft_model, prompts)

for p, g in zip(prompts, generations):
  print(
      "PROMPT", p ,"RESPONSE", g, "\n", sep="\n"
    )

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

You shouldn't move a model that is dispatched using accelerate hooks.


PROMPT
Translate from English to Bengali: "Rates are competitive, almost always the best in the market"
RESPONSE
"দরগুলি প্রতিযোগিতামূলক, প্রায় সবসময় সেরা"




The model output translated back into English is "The rates are competitive, almost always the best", which is much better than the original model ouput before fine-tuning as expected.