# ChatML format
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chat-markup-language
https://towardsdatascience.com/evaluations-with-chat-formats-7604067023c9

## Chat templates

https://huggingface.co/docs/transformers/main/chat_templating

## TRL library
https://huggingface.co/docs/trl/en/index

**Pre-requisites**

You MUST have gone through the lessons in the section : "Advanced HuggingFace", as the classes used in this section requires knowledge of transformer library.

**NOTE:**

The model weights will be downloaded to your local drive

In [9]:
# !pip install trl>=0.12.2 datasets>=3.1.0

In [2]:
from trl import setup_chat_format
from transformers import AutoTokenizer, AutoModelForCausalLM

## Generate tokens for a prompt/completion

In [4]:
# model_name = "facebook/opt-350m"
# model_name = "google/gemma-2-2b-it"
# model_name = "deepseek-ai/DeepSeek-V2-Lite"

## Both the following are fine-tuned versions of Mistral
# model_name = "HuggingFaceH4/zephyr-7b-beta"
model_name = "teknium/OpenHermes-2.5-Mistral-7B"

# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Print the tokenized text with special tokens
input_text = tokenizer.tokenize("this is a single sentence", add_special_tokens=True)
print("Single sentence: ", input_text)

input_text = tokenizer.tokenize(["this is a prompt.", "this is a completion"], add_special_tokens=True)
print("Sentence pair: ", input_text)

Single sentence:  ['<s>', '▁this', '▁is', '▁a', '▁single', '▁sentence']
Sentence pair:  ['<s>', '▁this', '▁is', '▁a', '▁prompt', '.', '<s>', '▁this', '▁is', '▁a', '▁completion']


## Chat template

In [8]:
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {
        "role": "assistant",
        "content": "I'm doing well, thank you! How can I assist you today?",
    },
]

In [5]:
# Setup chat format on tokenizer from the model ONLY if its not set already
# i.e., if tokenizer_config.json doesn't have the chat_template attribute
if tokenizer.chat_template is  None:
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_name
    )
    model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)


input_text = tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)

<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>



### Add "assistant" marker to the end

In [6]:
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)

<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you! How can I assist you today?<|im_end|>
<|im_start|>assistant



### Tokenize

In [7]:
input_text = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
print(input_text)

[32001, 2188, 13, 16230, 28725, 910, 460, 368, 28804, 32000, 28705, 13, 32001, 13892, 13, 28737, 28742, 28719, 2548, 1162, 28725, 6979, 368, 28808, 1602, 541, 315, 6031, 368, 3154, 28804, 32000, 28705, 13, 32001, 13892, 13]


## Function templates

## Data collators

In [1]:
!python --version

Python 3.12.4


In [2]:
!pip list

Package                      Version
---------------------------- ---------------
accelerate                   1.2.0
ai21                         3.0.1
ai21-tokenizer               0.12.0
aiohappyeyeballs             2.4.3
aiohttp                      3.10.5
aiosignal                    1.2.0
annotated-types              0.6.0
anthropic                    0.40.0
anyio                        4.6.2
argon2-cffi                  21.3.0
argon2-cffi-bindings         21.2.0
asttokens                    2.0.5
async-lru                    2.0.4
async-timeout                4.0.3
attrs                        24.2.0
Babel                        2.11.0
beautifulsoup4               4.12.3
bitsandbytes                 0.45.0
bleach                       6.2.0
Brotli                       1.0.9
cachetools                   5.5.0
certifi                      2024.8.30
cffi                         1.17.1
charset-normalizer           3.3.2
click                        8.1.7
cohere                       