# Generate Instruction-Response Pairs

- This notebook uses Phi-3 3.8B parameters instruct model to generate a synthetic dataset.

- The dataset creation follows the "hack" proposed in the paper "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing": https://arxiv.org/abs/2406.08464

- The dataset creation includes 2 steps:
    - Step 1: Instruction generation
    - Step 2: Response generation

- The generated dataset will be an instruction dataset with "instruction" and "output" fields, similar to what can be found in Alpaca:

```
{
    "instruction": "What is the capital of Vietnam?",
    "output": "The capital of Vietnam is Hanoi."
}
```

# Imports

In [1]:
!pip install accelerate -q

In [2]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from tqdm import trange
import json

# Load Tokenizer & Model

In [3]:
model_id = 'microsoft/Phi-3-mini-128k-instruct'
device = 'cuda'

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map=device,
                                             low_cpu_mem_usage=True,
                                             torch_dtype=torch.float16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Examine phi-3 chat template

In [5]:
messages = [
    {'role': 'user', 'content': 'What is the capital of Vietnam?'},
    {'role': 'assistant', 'content': 'The capital of Vietnam is Hanoi.'}
]

In [6]:
tokenizer.decode(
    tokenizer.apply_chat_template(messages)
)

'<s><|user|> What is the capital of Vietnam?<|end|><|assistant|> The capital of Vietnam is Hanoi.<|end|>'

# Step 1: Instruction Generation

With the chat template below, to generate an instruction, we only need to pass the input as `<s>|user|` and set a stop criterion as `|end|`

In [7]:
tokenizer('<|end|>', add_special_tokens=False).input_ids

[32007]

In [8]:
stop_on_tokens = [32007]

In [9]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [32007]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [10]:
def generate_instruction(
        model,
        tokenizer,
        pre_query_template,
        stopping_criteria,
        device
    ):
    encodings = tokenizer(pre_query_template, return_tensors='pt', add_special_tokens=False).to(device)

    tokens = model.generate(
        input_ids=encodings['input_ids'],
        max_new_tokens=1024,
        temperature=1.,
        top_p=1.,
        do_sample=True,
        stopping_criteria=stopping_criteria,
        pad_token_id=tokenizer.eos_token_id
    )[0]

    instruction = tokenizer.decode(tokens, skip_special_tokens=True)
    return instruction

In [11]:
instruction = generate_instruction(model=model,
                     tokenizer=tokenizer,
                     pre_query_template='<s><|user|>',
                     stopping_criteria=stopping_criteria,
                     device=device)

print(f'Generated instruction: {instruction}')

You are not running the flash-attention implementation, expect numerical differences.


Generated instruction: I need your help to construct a PowerShell script with a function named 'Get-CMSLogonUser' that fetches a list of current logon users from a Citrix management platform. I require the function to have optional input parameters: an identifier and a filter, along with several output parameters like page size, page number, all defaulting to 50 and 1 respectively. There also should be an 'Auth' object that I'll pass to it.

The function needs to use an API call to obtain the user list, which involves constructing the URL with the required parameters and performing an API request. It should handle the response and output the data.

Could you improve this initial outline of the script?

```powershell
function Get-CMSLogonUser {
    param (
        [int]$Identifier = 0,
        [string]$Filter = "",
        [int]$PageSize = 50,
        [int]$PageNumber = 1
    )

    # Assuming $authObject exists and contains necessary auth info

    $url = "/api/v1/citrix/vnp/users/sear

# Step 2: Response Generation

In [12]:
def generate_response(model, tokenizer, instruction, device):
    messages = [
        {'role': 'user', 'content': instruction}
    ]
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors='pt'
    ).to(device)

    tokens = model.generate(
        input_ids=input_ids,
        max_new_tokens=1024,
        temperature=1.,
        top_p=1.,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )[0]
    instruction_and_response = tokenizer.decode(tokens)
    response = instruction_and_response.split('<|assistant|>')[-1][:-len('<|end|>')]
    return response

In [13]:
response = generate_response(model, tokenizer, instruction, device)

print(f'Generated response: {response}')

Generated response:  Understood. Here is a snippet of the refactored Python class incorporating the changes:

```python
import requests
from enum import Enum

class PageSize(Enum):
    FIRST = 1
    DEFAULT = 50
    LAST = 100

class PageNumber(Enum):
    FIRST = 1

class AuthError(Exception):
    pass

class CMSResponse:
    def __init__(self, status_code, data):
        self.status_code = status_code
        self.data = data

class CMSAuth:
    def __init__(self, client_id, client_secret):
        self.client_id = client_id
        self.client_secret = client_secret
        # Additional security-related fields could be added here

    def perform_api_call(self, auth_token, payload):
        # Example of OAuth2 request - this may vary to fit exact OAuth2 setup
        headers = {'Authorization': f'Bearer {auth_token}'}
        url = 'https://api.example.com/citrix'
        response = requests.post(url, json=payload, headers=headers)
        if response.status_code != 200:
            

In [14]:
dataset = [
    {'instruction': instruction, 'response': response}
]

dataset

[{'instruction': 'I need your help to construct a PowerShell script with a function named \'Get-CMSLogonUser\' that fetches a list of current logon users from a Citrix management platform. I require the function to have optional input parameters: an identifier and a filter, along with several output parameters like page size, page number, all defaulting to 50 and 1 respectively. There also should be an \'Auth\' object that I\'ll pass to it.\n\nThe function needs to use an API call to obtain the user list, which involves constructing the URL with the required parameters and performing an API request. It should handle the response and output the data.\n\nCould you improve this initial outline of the script?\n\n```powershell\nfunction Get-CMSLogonUser {\n    param (\n        [int]$Identifier = 0,\n        [string]$Filter = "",\n        [int]$PageSize = 50,\n        [int]$PageNumber = 1\n    )\n\n    # Assuming $authObject exists and contains necessary auth info\n\n    $url = "/api/v1/citr

# Combine 2 steps to create an instruction dataset



In [15]:
dataset = []
dataset_size = 3

for _ in trange(dataset_size):
    instruction = generate_instruction(
        model=model,
        tokenizer=tokenizer,
        pre_query_template='<s><|user|>',
        stopping_criteria=stopping_criteria,
        device=device
    )

    response = generate_response(model, tokenizer, instruction, device)

    entry = {
        'instruction': instruction,
        'response': response
    }

    dataset.append(entry)

100%|██████████| 3/3 [02:28<00:00, 49.38s/it]


In [16]:
with open("instruction-data-phi3.json", mode="w") as file:
    json.dump(dataset, file, indent=4)

In [17]:
dataset

[{'instruction': 'Hello! Can you provide a detailed account of the historical event where the Treaty of Versailles was signed after World War I? Make sure to mention the key figures involved, the countries that were part of the negotiation, the date of signing, and the impact it had on international relations. Give a comprehensive answer without missing any crucial details.',
  'response': ' Certainly! The Treaty of Versailles was a historically significant event as it marked the official end of World War I. It was signed on June 28, 1919, exactly five years after the assassination of Archduke Franz Ferdinand, which was one of the triggers for World War I. \n\nThe peace conference for the treaty took place in the Palace of Versailles, France, hence its name. The main figures involved in the negotiation of the treaty were Georges Clemenceau, the French Prime Minister, David Lloyd George, the British Prime Minister, and Woodrow Wilson, the President of the United States. These three lead