# Generate Instruction-Response Pairs

- This notebook uses Phi-3 3.8B parameters instruct model to generate a synthetic dataset.

- The dataset creation follows the "hack" proposed in the paper "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing": https://arxiv.org/abs/2406.08464

- The dataset creation includes 2 steps:
    - Step 1: Instruction generation
    - Step 2: Response generation

- The generated dataset will be an instruction dataset with "instruction" and "output" fields, similar to what can be found in Alpaca:

```
{
    "instruction": "What is the capital of Vietnam?",
    "output": "The capital of Vietnam is Hanoi."
}
```

# Imports

In [1]:
!pip install accelerate -q

In [2]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from tqdm import trange
import json

# Load Tokenizer & Model

In [3]:
model_id = 'microsoft/Phi-3-mini-128k-instruct'
device = 'cuda'

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map=device,
                                             low_cpu_mem_usage=True,
                                             torch_dtype=torch.float16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Examine phi-3 chat template

In [5]:
messages = [
    {'role': 'user', 'content': 'What is the capital of Vietnam?'},
    {'role': 'assistant', 'content': 'The capital of Vietnam is Hanoi.'}
]

In [6]:
tokenizer.decode(
    tokenizer.apply_chat_template(messages)
)

'<s><|user|> What is the capital of Vietnam?<|end|><|assistant|> The capital of Vietnam is Hanoi.<|end|>'

# Step 1: Instruction Generation

With the chat template below, to generate an instruction, we only need to pass the input as `<s>|user|` and set a stop criterion as `|end|`

In [7]:
tokenizer('<|end|>', add_special_tokens=False).input_ids

[32007]

In [8]:
stop_on_tokens = [32007]

In [9]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [32007]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [10]:
def generate_instruction(
        model,
        tokenizer,
        pre_query_template,
        stopping_criteria,
        device
    ):
    encodings = tokenizer(pre_query_template, return_tensors='pt', add_special_tokens=False).to(device)

    tokens = model.generate(
        input_ids=encodings['input_ids'],
        max_new_tokens=1024,
        temperature=1.,
        top_p=1.,
        do_sample=True,
        stopping_criteria=stopping_criteria,
        pad_token_id=tokenizer.eos_token_id
    )[0]

    instruction = tokenizer.decode(tokens, skip_special_tokens=True)
    return instruction

In [11]:
instruction = generate_instruction(model=model,
                     tokenizer=tokenizer,
                     pre_query_template='<s><|user|>',
                     stopping_criteria=stopping_criteria,
                     device=device)

print(f'Generated instruction: {instruction}')

You are not running the flash-attention implementation, expect numerical differences.


Generated instruction: I need a C header for a software package that's like a suite of utilities. Could you set up constants for version info as four strings, and define numeric constants for status, which should be okay only between 0 and 1, and error codes with descriptions that reset on minor updates. There should be version formatting for major and minor to a string, a string for default config, and functions to get this string and a description. Also, include a default numeric timeout, a function to print errors with code descriptions, a logging function with different verbosity levels, and preprocessor macros for conditional compilation. Lastly, ensure the protection of everything against multiple inclusions and add comments for documentation. The version and error descriptions should be easily changeable.


# Step 2: Response Generation

In [13]:
def generate_response(model, tokenizer, instruction, device):
    messages = [
        {'role': 'user', 'content': instruction}
    ]
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors='pt'
    ).to(device)

    tokens = model.generate(
        input_ids=input_ids,
        max_new_tokens=1024,
        temperature=1.,
        top_p=1.,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )[0]
    instruction_and_response = tokenizer.decode(tokens)
    response = instruction_and_response.split('<|assistant|>')[-1][:-len('<|end|>')]
    return response

In [14]:
response = generate_response(model, tokenizer, instruction, device)

print(f'Generated response: {response}')

Generated response:  Certainly! Here's how you can incorporate the changes:

```c
// Start of the header file
#ifndef SOFTWARE_UTILITIES_H
#define SOFTWARE_UTILITIES_H
// ... [Rest of the predefined macros and includes]

// Define version in a separate header if needed
// #include "software_version.h"

// Error codes array
#define ERROR_CODES [(int*)(^)([0 "InvalidParameter"]) ..., NULL]

// End of the version-related definitions

// Function to check if a numeric code refers to a valid error
int isValidErrorCode(int code);

// ... [Other function declarations]

// End of the utility functions

// Define a macro for preprocessing a config path
#define CONFIG_PATH_PREPROCESSOR(path) #path // Simple example for string concatenation

// End of the header file
#endif // SOFTWARE_UTILITIES_H

// In your source file, implement the isValidErrorCode function
int isValidErrorCode(int code) {
    return (code > DEFAULT_NAKED_ERROR && code <= MAX_NAKED_ERROR);
}
```
I've shown you how to declare 

In [15]:
dataset = [
    {'instruction': instruction, 'response': response}
]

dataset

[{'instruction': "I need a C header for a software package that's like a suite of utilities. Could you set up constants for version info as four strings, and define numeric constants for status, which should be okay only between 0 and 1, and error codes with descriptions that reset on minor updates. There should be version formatting for major and minor to a string, a string for default config, and functions to get this string and a description. Also, include a default numeric timeout, a function to print errors with code descriptions, a logging function with different verbosity levels, and preprocessor macros for conditional compilation. Lastly, ensure the protection of everything against multiple inclusions and add comments for documentation. The version and error descriptions should be easily changeable.",
  'response': ' Certainly! Here\'s how you can incorporate the changes:\n\n```c\n// Start of the header file\n#ifndef SOFTWARE_UTILITIES_H\n#define SOFTWARE_UTILITIES_H\n// ... [R

# Combine 2 steps to create an instruction dataset



In [16]:
dataset = []
dataset_size = 3

for _ in trange(dataset_size):
    instruction = generate_instruction(
        model=model,
        tokenizer=tokenizer,
        pre_query_template='<s><|user|>',
        stopping_criteria=stopping_criteria,
        device=device
    )

    response = generate_response(model, tokenizer, instruction, device)

    entry = {
        'instruction': instruction,
        'response': response
    }

    dataset.append(entry)

100%|██████████| 3/3 [01:54<00:00, 38.12s/it]


In [17]:
with open("instruction-data-phi3.json", mode="w") as file:
    json.dump(dataset, file, indent=4)

In [18]:
dataset

[{'instruction': "I need your help setting up a Scala project using functional programming libraries. I want to convert HTML structures into reactive components with some specific functionalities. Could you provide code to create a package for reactive HTML parsing that includes:\n\n1. A utility to parse optional HTML content that can fail, returning some results.\n2. A function for parsing an HTML string input to extract and generate reactive components based on the content.\n3. A structure that represents parsed HTML, containing raw string data and reactive components with HTML content and a boolean flag.\n4. Methods to extract the raw HTML from this structure, both with and without trimming whitespace.\n\nIt's crucial to include proper error handling and comments. Can you supply me just with the Scala code for these requirements?",
  'response': ' ```scala\nclass HtmlParsingUtils(implicit val sa: Scalatags) extends HtmlParser {\n  def parseOrFail(htmlOrEmptyString: Option[String]): 