# Implementation of Self Instruct : 

Paper Link : https://arxiv.org/abs/2212.10560

### What is our Goal ? : Create 1000 rows of data for finetuning from 10 rows of data?

- So How does a row of data ready for finetuning look like?
- It consists of 2 things. It has a prompt and a completion.
```
{"prompt": "Classify the sentiment of the following movie review:\n\nInput: This movie was absolutely terrible. The acting was wooden, the plot made no sense, and I was bored throughout.\n\nOutput:", "completion": " Negative"}
{"prompt": "Determine whether the following statement is a fact or opinion:\n\nInput: The Earth orbits around the Sun.\n\nOutput:", "completion": " Fact"}
```
- This was an example of classification.


- Here's another.
```
{"prompt": "Summarize the following text in one sentence:\n\nInput: The Industrial Revolution, which took place from the 18th to 19th centuries, was a period during which predominantly agrarian, rural societies in Europe and America became industrial and urban. This period saw the mechanization of manufacturing, improved transportation systems, and significant technological innovations. It transformed the daily lives of people and had far-reaching effects on socioeconomic and cultural conditions.\n\nOutput:", "completion": " The Industrial Revolution was a period of rapid industrialization and urbanization in Europe and America, characterized by technological advancements that dramatically changed society and the economy."}
```


### Steps to do that :

- Begin with a Seed Set of Human-Written Instructions: Start by compiling a collection of well-crafted instructions that serve as the foundation for generating new tasks. Examples of instructions include:

    - {"instruction": "Design a 3-ingredient snack that meets the specified criteria, and provide a short explanation for each ingredient's role in the snack."}
    - {"instruction": "Find a healthier alternative for each of the given options."}
- Generate New Instructions: Utilize a language model to generate a new set of instructions based on the seed tasks. This process involves creating prompts that guide the model to generate instructions similar to the seed tasks. The generated instructions are then filtered to ensure they are not excessively similar to existing instructions using the ROUGE-L similarity score.

- Classify Instructions: Determine whether each instruction corresponds to a classification or non-classification task. This classification is essential for selecting the appropriate template when generating instances for each instruction.

- Generate Instances: For each instruction, use a suitable template (either input-first or output-first) to generate instances. The template is tailored to the task type (classification or non-classification). The generated instances are then extracted from the model's response and added to the metadata of each instruction.
    - Here's how a complete instance looks : 
    - {"instruction": "Design a 3-ingredient snack that meets the specified criteria, and provide a short explanation for each ingredient's role in the snack, "examples": [{"class_label": "Sweet & Savory", "input": {"ingredient1": "Apple slices", "explain1": "Providing natural sweetness and crisp texture", "ingredient2": "Cheddar cheese", "explain2": "Adding a rich, savory flavor and creamy texture", "ingredient3": "Walnuts", "explain3": "Adding crunch and earthy flavor"}, "output": "Sweet & Savory"}, {"class_label": "Sweet Treat", "input": {"ingredient1": "Pineapple chunks", "explain1": "Adding natural sweetness and juicy texture", "ingredient2": "Dark chocolate chips", "explain2": "Providing rich, sweet flavor and indulgent texture", "ingredient3": "Coconut flakes", "explain3": "Adding tropical flavor and extra crunch"}, "output": "Sweet Treat"}, {"class_label": "Savory Snack", "input": {"ingredient1": "Baby carrots", "explain1": "Providing crunchy texture and mild sweetness", "ingredient2": "Cream cheese", "explain2": "Adding rich, creamy flavor and smooth texture", "ingredient3": "Chopped pecans", "explain3": "Adding nutty flavor and extra crunch"}, "output": "Savory Snack"}]}

- Prepare data for finetuning : Now, we take each instance, clean and filter out duplicate instances and shuffles the remaining instances and finally, the script saves the encoded instances to a JSONL file

# Setting up the environment and importing libraries

In [1]:
import os
import json
import random
import re
import string
import time
from datetime import datetime
from groq import Groq  # Import the Groq client
import pandas as pd
import numpy as np
from rouge_score import rouge_scorer
from multiprocessing import Pool
from functools import partial
from templates.clf_task_template import template_1
from templates.instance_gen_template import output_first_template_for_clf, input_first_template_for_gen


# Setting up a function to make use of the free Llama-3 API available from Groq

### Function Arguments :
- Prompt: List of prompts to be passed to API
- Stop Sequences: For every prompt, only return the response up until the stop sequence is encountered

In [2]:
# Configure Groq API
groq_api_key = "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"
client = Groq(api_key=groq_api_key)

# Function to make requests to Groq API
def make_requests(prompts, stop_sequences=[], retries=3):
    response = None
    retry_cnt = 0
    backoff_time = 30
    results = []

    while retry_cnt <= retries:
        try:
            for prompt in prompts:
                chat_completion = client.chat.completions.create(
                    messages=[
                        {"role": "user", "content": prompt}
                    ],
                    model="llama3-8b-8192"
                )
                response = chat_completion.choices[0].message.content.strip()
                for stop_seq in stop_sequences:
                    if stop_seq in response:
                        print("\n\nOriginal response")
                        print(response)
                        print("\n\nFound stop sequence ", stop_seq)
                        print(response.split(stop_seq))
                        response = response.split(stop_seq)[0]
                        break
                results.append({
                    "prompt": prompt,
                    "response": {"choices": [{"text": response}]},
                    "created_at": str(datetime.now()),
                })
            break
        except Exception as e:
            print(f"Error: {e}. Retrying in {backoff_time} seconds...")
            time.sleep(backoff_time)
            backoff_time *= 1.5
            retry_cnt += 1
            
    return results


### Here's an example on how to use the above function

In [3]:
make_requests(
    prompts=["Hello", "What are 3 breeds of cats ?"], 
    stop_sequences=["you"]
)



Original response
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?


Found stop sequence  you
["Hello! It's nice to meet ", '. Is there something I can help ', ' with, or would ', ' like to chat?']


[{'prompt': 'Hello',
  'response': {'choices': [{'text': "Hello! It's nice to meet "}]},
  'created_at': '2024-07-12 11:07:15.762225'},
 {'prompt': 'What are 3 breeds of cats ?',
  'response': {'choices': [{'text': 'Here are three breeds of cats:\n\n1. Siamese\n2. Persian\n3. Maine Coon'}]},
  'created_at': '2024-07-12 11:07:16.224871'}]

# Load Seed Tasks 

Set of Human Written Instructions which we shall be using to expand i.e. from 50 -> 500 

In [4]:
# Load seed tasks from a JSONL file
seed_tasks_path = "data/seed_tasks.jsonl"
seed_tasks = [json.loads(line) for line in open(seed_tasks_path, "r")]

In [5]:
seed_instructions = [task["instruction"] for task in seed_tasks]

# Example of an instruction
seed_instructions[0]

"Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?"

# Generate Similar Instructions

#### Here's in plain English what this function does : 
- The generate_instructions function takes a list of seed tasks and generates a specified number of new instructions. It uses a language model to generate the new instructions based on the seed tasks, and then filters out instructions that are too similar to existing instructions using the ROUGE-L similarity score. The function returns a list of new instructions and a list of metadata for each instruction, which includes the most similar existing instructions and their similarity scores, as well as the average similarity score and a request index.



In [8]:
def generate_instructions(seed_tasks, num_instructions=10):
    instructions = []
    metadata = []
    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    seed_instructions = [task["instruction"] for task in seed_tasks]
    machine_instructions = []
    request_idx = 0

    for i in range(num_instructions):
        # Construct the prompt
        seed_instructions_sample = [task["instruction"] for task in random.sample(seed_tasks, 5)]
        
        prompt = "Generate a new instruction for a task that is similar to the following instructions. Format your response as a JSON object with the key 'instruction'. Strictly output JSON and nothing else.\n\n"
        prompt += "Examples:\n"

        # Add the seed instructions to the prompt
        for task in seed_instructions_sample:
            prompt += f'{{"instruction": "{task}"}}\n'

        prompt += "\nNew instruction:"

        # Call the make_requests function to generate a new instruction
        response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]

        # Parse the JSON response and extract the new instruction
        try:
            new_instruction = json.loads(response)["instruction"]
        except json.JSONDecodeError:
            print(f"Error parsing JSON response: {response}")
            continue

        # Compute ROUGE-L similarity with seed and machine-generated instructions
        all_instructions = seed_instructions + machine_instructions
        with Pool(4) as p:
            rouge_scores = p.map(partial(scorer.score, new_instruction), all_instructions)
        rouge_scores = [score["rougeL"].fmeasure for score in rouge_scores]

        # If the new instruction is too similar to existing instructions, skip it
        if max(rouge_scores) > 0.7:
            continue

        # Add the generated instruction and metadata to the lists
        instructions.append(new_instruction)
        metadata.append({
            "most_similar": {
                all_instructions[i]: rouge_scores[i] for i in np.argsort(rouge_scores)[-10:][::-1]
            },
            "avg_similarity_score": float(np.mean(rouge_scores)),
            "request_idx": request_idx
        })
        machine_instructions.append(new_instruction)
        request_idx += 1

    return instructions, metadata

In [9]:
# Number of instructions to generate
NUM_INSTRUCTIONS_TO_GENERATE = 50

new_instructions, new_metadata = generate_instructions(seed_tasks, num_instructions=NUM_INSTRUCTIONS_TO_GENERATE)

# Save the generated instructions and metadata to a JSONL file
with open("generated_instructions.jsonl", "w") as f:
    for instruction, metadata in zip(new_instructions, new_metadata):
        f.write(json.dumps({
            "instruction": instruction,
            "metadata": metadata
        }) + "\n")

# Checking if Classification or not for generating Instances

In plain English, this function takes a list of instructions and uses a language model to determine whether each instruction is a classification task or not. It does this by constructing a prompt that asks whether the instruction is a classification task, sending the prompt to the language model, parsing the response, and extracting the classification label. The function then returns a list of dictionaries, where each dictionary contains an instruction and a boolean value indicating whether that instruction is a classification task or not.

In [10]:
def classify_instructions(instructions):
    prefix = template_1
    classified_instructions = []

    for instruction in instructions:
        # Construct the prompt
        prompt = prefix + " " + instruction + "\n" + "Is it classification?. Format your response as a JSON object with the key 'is_classification' and value 'yes' or 'no' only for the last instruction.Strictly Output JSON and nothing else.\n\n"
        
        response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]
        print("Response", response)

        # Parse the response and extract the classification label
        is_classification = "yes" if "yes" in response else "no"
        
        if is_classification.lower() == "yes":
            is_classification = True
        else:
            is_classification = False

        # Add the classification label to the instruction data
        instruction_data = {
            "instruction": instruction,
            "is_classification": is_classification
        }
        classified_instructions.append(instruction_data)

    return classified_instructions

In [11]:
import json

# Open the existing JSONL file with metadata
with open("generated_instructions.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

# Extract the instructions from the data
instructions = [item["instruction"] for item in data]

# Use the `classify_instructions` function to determine the value of `is_classification` for each instruction
classified_instructions = classify_instructions(instructions)

# Add the `is_classification` field to each instruction in the data
for item, classification in zip(data, classified_instructions):
    item["metadata"]["is_classification"] = classification["is_classification"]

# Save the updated data to a new JSONL file
with open("classified_instructions.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

Response Here is the response:

{"is_classification": "yes"}
Response {"is_classification": "no"}
Response {
"is_classification": "no"
}
Response {
"is_classification": "no"
}
Response Here is the JSON output:

{
"is_classification": "no"
}
Response {"is_classification": "no"}
Response {
"is_classification": "no"
}
Response {
"is_classification": "no"
}
Response Here is the JSON response:

{
"is_classification": "no"
}

The task "Create a concise FAQ section consisting of three questions and answers for the selected topic" involves generating text based on a given topic, rather than classifying the input into pre-defined categories.
Response {
"is_classification": "no"
}
Response Here is the JSON response:

{
"is_classification": "yes"
}
Response {
"is_classification": "no"
Response {"is_classification": "no"}
Response {
"is_classification": "no"
}
Response {
"is_classification": "yes"
}
Response {"is_classification": "no"}
Response {
"is_classification": "yes"
}
Response {"is_classifi

# Generating Instances

In plain English, this code defines two functions that are used to generate instances for a list of data items. The first function, extract_json, is used to extract a JSON object from a response string. The second function, generate_instances, uses a language model to generate instances for each data item based on the instruction and the classification label. The function sends a prompt to the language model, extracts the instances from the response using the extract_json function, and adds the instances to the metadata of the item. Finally, the function saves the updated data to a new JSONL file.

The extract_json function takes a response string and a prompt string as input, and attempts to extract a JSON object from the response string. It does this by finding the first occurrence of '{' and the last occurrence of '}' in the response string, and then attempting to parse the substring between those indices as a JSON object. If the parsing fails, the function retries up to a maximum number of times (specified by the max_retries parameter). If the function is unable to extract a valid JSON object after the maximum number of retries, it returns None.

The generate_instances function takes a list of data items, an input-first template string, and an output-first template string as input. It iterates over each item in the data list, and for each item, it generates instances based on the instruction and the classification label. If the classification label is True, the function uses the output-first template to generate instances. If the classification label is False, the function uses the input-first template to generate instances. The function sends a prompt to the make_requests function, which makes a request to the Groq API and returns the response. The function then calls the extract_json function to extract the instances from the response, and adds the instances to the metadata of the item. Finally, the function saves the updated data to a new JSONL file.



In [12]:
import json
import re

def extract_json(response, prompt, max_retries=3):
    # Find the first "{" and the last "}"
    start = response.find('{')
    end = response.rfind('}')
    if start != -1 and end != -1:
        json_str = response[start:end+1]
        try:
            return json.loads(json_str)
        except json.JSONDecodeError:
            print("Error: Invalid JSON. Retrying...")
            if max_retries > 0:
                # Make another call to make_requests
                new_response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]
                return extract_json(new_response, prompt, max_retries - 1)
            else:
                print("Error: Max retries reached. Unable to get valid JSON.")
                return None
    else:
        print("Error: No JSON object found in the response. Retrying...")
        if max_retries > 0:
            # Make another call to make_requests
            new_response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]
            return extract_json(new_response, prompt, max_retries - 1)
        else:
            print("Error: Max retries reached. Unable to get valid JSON.")
            return None

def generate_instances(data, input_first_template, output_first_template):
    for item in data:
        instruction = item["instruction"]
        is_classification = item["metadata"]["is_classification"]
        if is_classification:
            # Use the output-first template for classification tasks
            prompt = output_first_template_for_clf + " " + instruction + "\n"
            prompt += "\n\nPlease return the output in JSON format with the keys 'task' and 'examples'."
            response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]
            print("Response", response)
            instances = extract_json(response, prompt)
            if instances:
                item["metadata"]["instances"] = instances
        else:
            # Use the input-first template for non-classification tasks
            prompt = input_first_template + " " + instruction
            prompt += "\n\nPlease return the output in JSON format with the keys 'examples'."
            response = make_requests([prompt], stop_sequences=[])[0]["response"]["choices"][0]["text"]
            print("Response", response)
            instances = extract_json(response, prompt)
            if instances:
                item["metadata"]["instances"] = instances

# Open the classified instructions JSONL file
with open("classified_instructions.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

# Generate instances for each instruction and add them to the metadata
generate_instances(data, input_first_template_for_gen, output_first_template_for_clf)

# Save the updated data to a new JSONL file
with open("instructions_with_instances.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

Response Here is the generated JSON:

{
  "task": "Develop a scenario in which the given set of skills are valuable, and highlight at least three specific ways they would be useful.",
  "examples": [
    {
      "class_label": "Scenario",
      "input": "A team of data analysts is tasked with analyzing customer purchase behavior for a large retail corporation. They need to identify patterns in the data to inform product development and marketing strategies.",
      "output": "The team's skills in data analysis, visualization, and machine learning would be valuable in identifying patterns in customer purchase behavior, allowing them to make data-driven decisions to increase sales and customer retention."
    },
    {
      "class_label": "Scenario",
      "input": "A company is looking to develop an AI-powered chatbot to improve customer service. They need a team that can design the chatbot's dialogue flow and train it to respond to customer inquiries accurately.",
      "output": "The 

# Preparing data for finetuning 

In plain English, this code prepares the data for fine-tuning a language model. It reads the data from a JSONL file, extracts the instruction, input, and output for each example, formats the prompt and completion strings, shuffles the instances, and writes the instances to a new JSONL file. The new file contains the data in the format required for fine-tuning the language model.

In [None]:
import json
import random

def prepare_finetuning_data(input_file, output_file):
    # Open the input file and load the data
    with open(input_file, "r") as f:
        data = [json.loads(line) for line in f]

    # Initialize an empty list to store the instances
    instances = []

    # Iterate over each item in the data
    for item in data:
        # Extract the instruction and classification label from the item
        instruction = item["instruction"]
        is_classification = item["metadata"]["is_classification"]

        # Iterate over each example in the item
        for example in item["metadata"]["instances"]["examples"]:
            # If the task is a classification task, set the input and output accordingly
            if is_classification:
                input = example["input"]
                output = example["class_label"]
            # If the task is not a classification task, set the input and output accordingly
            else:
                input = example["input"]
                output = example["output"]

            # Format the prompt and completion strings based on the instruction and input
            prompt = f"{instruction}\n\nInput: {input}\n\nOutput:"

            # Append a dictionary containing the prompt and completion to the instances list
            instances.append({"prompt": prompt, "completion": output})

    # Shuffle the instances list to ensure that the data is well-mixed for training
    random.shuffle(instances)

    # Write the instances list to a JSONL file
    with open(output_file, "w") as f:
        for instance in instances:
            f.write(json.dumps(instance) + "\n")

# Call the function with the input and output file paths
input_file = "instructions_with_instances.jsonl"
output_file = "finetuning_data.jsonl"
prepare_finetuning_data(input_file, output_file)
