# **Agent Instruct**

## Installing Dependencies

In [1]:
pip install "distilabel[hf-transformers, openai]>=1.0.0"

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Importing the libraries

In [2]:
from datasets import DatasetDict, Dataset
import pandas as pd
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, KeepColumns, LoadDataFromDicts
from distilabel.steps import Step, StepInput
from distilabel.steps.typing import StepOutput
from distilabel.steps.tasks import TextGeneration, SelfInstruct
from typing import List

Logging into huggingface_hub

In [3]:
HF_AUTH_TOKEN='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'
from huggingface_hub import login
login(token=HF_AUTH_TOKEN)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Defining the prompts for LLMs

In [4]:
criteria_for_query_generation = (
    "1. Relevance: Ensure the questions are directly related to the content and context of the input paragraph."
    "2. Diversity: Include a variety of question types such as factual, analytical, inferential, and evaluative."
    "3. Clarity: Make sure each question is clear, concise, and unambiguous."
    "4. Complexity: Incorporate questions of varying difficulty levels, from simple recall to complex analysis."
    "5. Coverage: Cover the entire content of the paragraph, addressing different sections and key points."
    "6. Specificity: Frame questions to be specific and pointed, encouraging precise answers."
    "7. Engagement: Create questions that are interesting and engaging, promoting thoughtful responses."
    "8. Open-endedness: A portion of the generated questions should encourage creative and thoughtful responses, rather than simple factual recall."
    "9. Output: Provide only the five user queries without any introductory or explanatory text."
)

# application_description = "This AI assistant is designed to provide comprehensive and informative answers to a wide range of questions based on the information it has been trained on. It should be able to handle complex queries, identify relevant information, and present answers in a clear and concise manner. The goal is to create an AI that can simulate human-like understanding and reasoning to respond to any query effectively."

application_description = "This AI assistant is designed to generate a series of relevant and thought-provoking questions based on the provided context or input. The goal is to generate questions that cover different aspects of the topic without providing answers. The goal is to create an AI that can simulate human-like understanding and reasoning to respond to any query effectively."

suggestions_prompt_value = "You are an AI assistant tasked with generating suggestions to improve a given question. Your task is to analyze the provided question and generate exactly three distinct suggestions that enhance its complexity, quality, or diversity. These suggestions should maintain the core meaning of the original question while introducing new elements or perspectives. Focus on generating creative and informative suggestions that could lead to more challenging and thought-provoking questions. Do not include any introductory or concluding statements and avoid using any special formatting or headings. Simply provide three clear and concise suggestions."

questions_prompt_value = "You are an AI assistant tasked with generating refined questions based on provided suggestions. Modify the question according to these suggestions to create exactly three new, more refined and complex questions.  Each question should be numbered sequentially, starting with the number 1, and end with a question mark. Do not include any additional text, formatting, or explanation. Simply provide the questions in the following format: 1. [Question] 2. [Question] 3. [Question]"



Defining Instruction Splitter class

In [5]:
class InstructionSplitter:
  def split_instructions_from_dataset(self, dataset: Dataset):
    new_rows = []
    for row in dataset:
      new_rows.extend(self.split_instructions_from_row(row))
    return new_rows

  def split_instructions_from_row(self, row):
      results = []
      for instruction in row['instructions']:
          result = row.copy()
          result['instruction'] = instruction
          del result['instructions']
          results.append(result)
      return results

Defining the RenameColumn step

In [6]:
from pydantic import Field

class RenameColumn(Step):

    old_column: str = Field(..., description="The name of the column to rename.")
    new_column: str = Field(..., description="The new name for the column.")

    @property
    def inputs(self) -> List[str]:
        # Specify the input fields expected by this step
        return [self.old_column]

    @property
    def outputs(self) -> List[str]:
        # Specify the output fields that this step will produce
        return [self.new_column]

    def process(self, inputs: StepInput) -> StepOutput:
        for example in inputs:
            if self.old_column in example:
                example[self.new_column] = example.pop(self.old_column)  # Rename the column
        yield inputs

Defining the Column replacer step class

In [7]:
class ReplaceAllColumnValues(Step):
    column_name: str = Field(..., description="The name of the column whose values will be changed.")
    new_value: str = Field(..., description="The new value that will replace all existing values in the column.")

    @property
    def inputs(self) -> List[str]:
        return [self.column_name]

    @property
    def outputs(self) -> List[str]:
        return [self.column_name]

    def process(self, inputs: StepInput) -> StepOutput:
        for example in inputs:
            if self.column_name in example:
                example[self.column_name] = self.new_value  # Update the column value
        yield inputs

Defining the SplitInstructions step

In [8]:
class SplitInstructions(Step):
    @property
    def inputs(self) -> List[str]:
        # Specify the input fields expected by this step
        return ['instructions']

    @property
    def outputs(self) -> List[str]:
        # Specify the output fields that this step will produce
        return ['instruction']

    def process(self, inputs: StepInput) -> StepOutput:
        inputs = InstructionSplitter().split_instructions_from_dataset(inputs)
        yield inputs

Defining the merge question and suggestion step

In [9]:
class MergeQuestionSuggesions(Step):
    @property
    def inputs(self) -> List[str]:
        # Specify the input fields expected by this step
        return ['question', 'suggestions']

    @property
    def outputs(self) -> List[str]:
        # Specify the output fields that this step will produce
        return ['instruction']

    def process(self, inputs: StepInput) -> StepOutput:
        for example in inputs:
          combined_text = example['question'] + "\n\nSuggestions:\n" + example['suggestions']
          example['instruction'] = combined_text
        yield inputs

## Creating the Piplines

In [14]:
shared_model = TransformersLLM(model="microsoft/Phi-3.5-mini-instruct", device="cuda:0")

with Pipeline(name="Question Generation") as pipeline:
    load_hub_dataset = LoadDataFromHub(
        name="load_dataset",
        output_mappings={"prompt": "instruction"}
    )

    text_generation = TextGeneration(
        # llm = TransformersLLM(model="microsoft/Phi-3-mini-4k-instruct"),
        # llm = TransformersLLM(model="meta-llama/Meta-Llama-3-8B-Instruct", device= "cuda:0"),
        llm = shared_model,
        input_batch_size=1,
        add_raw_output=False,
        output_mappings={"generation": "input", "model_name": "transformed_text_model"},
    )

    self_instruct = SelfInstruct(
        llm = shared_model,
        # llm = TransformersLLM(model="Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct", device= "cuda:0"),
        input_batch_size=1,
        add_raw_output=False,
        num_instructions=5,
        criteria_for_query_generation=criteria_for_query_generation,
        application_description=application_description,
        output_mappings={"model_name": "instructions_model"},
    )

    rename_1 = RenameColumn(
        name="rename_instr_to_raw_seed",
        old_column="instruction",
        new_column="raw_seed"
    )

    split_instr = SplitInstructions(
        name="split_instructions_step"
    )

    prompt_change_1 = ReplaceAllColumnValues(
        name="suggestion_system_prompt",
        column_name="system_prompt",
        new_value=suggestions_prompt_value
    )

    suggestion_generation = TextGeneration(
        # llm = TransformersLLM(model="meta-llama/Meta-Llama-3-8B-Instruct", device= "cuda:0"),
        llm = shared_model,
        input_batch_size=1,
        add_raw_output=False,
        output_mappings={"generation": "suggestions", "model_name": "suggestions_model"},
    )

    rename_2 = RenameColumn(
        name="rename_instr_to_question",
        old_column="instruction",
        new_column="question"
    )

    merge_question_suggestions = MergeQuestionSuggesions(
        name="merge_question_suggestions_step"
    )

    prompt_change_2 = ReplaceAllColumnValues(
        name="question_system_prompt",
        column_name="system_prompt",
        new_value=questions_prompt_value
    )

    question_generation = TextGeneration(
        llm = shared_model,
        # llm = TransformersLLM(model="Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct", device= "cuda:0"),
        input_batch_size=1,
        add_raw_output=False,
        output_mappings={"model_name": "refined_q_model"},
    )

    keep_columns = KeepColumns(
        columns=["generation"],
    )

    load_hub_dataset >> text_generation >> self_instruct >> rename_1 >> split_instr >> prompt_change_1 >> suggestion_generation >> rename_2 >> merge_question_suggestions >> prompt_change_2 >> question_generation >> keep_columns

### Running pipeline

In [15]:
distiset = pipeline.run(
    parameters={
        load_hub_dataset.name: {
            "repo_id": "hassaan-qaisar/initial_prompt",
            "split": "train",
        },
        text_generation.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 256,
                    "temperature": 0.7,
                },
            },
        },
        self_instruct.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 256,
                    "temperature": 0.7,
                },
            },
        },
        suggestion_generation.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 256,
                    "temperature": 0.7,
                },
            },
        },
        question_generation.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 256,
                    "temperature": 0.7,
                },
            },
        },
    },
)

You are not running the flash-attention implementation, expect numerical differences.


You are not running the flash-attention implementation, expect numerical differences.


You are not running the flash-attention implementation, expect numerical differences.


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


You are not running the flash-attention implementation, expect numerical differences.


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Generating train split: 0 examples [00:00, ? examples/s]

In [16]:
print(distiset)

Distiset({
    default: DatasetDict({
        train: Dataset({
            features: ['generation'],
            num_rows: 17
        })
    })
})


In [17]:
print(distiset['default']['train'].to_pandas())

                                           generation
0    1. How do recent enhancements in photovoltaic...
1    1. How will evolving international diplomatic...
2    1. How do recent developments in composite ma...
3    1. How do fluctuating climatic conditions inf...
4    1. How might differentiated tariff structures...
5    1. What is the correlation between individual...
6    1. How do varying regional climates affect th...
7    1. How can identifying key indigenous plants ...
8    1. How does alteration in personal routine fr...
9    1. What is the comparative effect of extensiv...
10   1. Assess the feasibility and implications of...
11   1. Analyze the comparative effectiveness of h...
12   1. In light of advancements such as quantum c...
13   1. How does the application of homomorphic en...
14   1. In the context of fusion between advanced ...
15   1. Assess the interplay between maintaining r...
16   1. How do advanced machine learning algorithm...


In [18]:
for row in distiset['default']['train'].to_pandas()['generation']:
    print(row)

 1. How do recent enhancements in photovoltaic cell design influence global electrical conversion efficiency trends through different climates, subsequently affecting regional contributions to decreasing overall atmospheric CO2?
   
2. What is the comprehensive life cycle analysis comparing greenhouse gas emissions between extensive transitions toward solar infrastructure against persistent dependency on traditional oil and coal industries, factoring elements like ecosystem alteration due to spatial requirements, material production cycles inclusive of assembly and upkeep phases, culminating disposal practices juxtaposed with sustained extraction activities?

3. In what manner has integration of support mechanisms via worldwide environmental accords propelled forward acceptance and assimilation of cleaner alternative energies amongst emerging economies, particularly evaluating this shift’s measurable outcomes related to both socio-economic progressions and tangible declines in national

Splitting the questions

In [19]:
from datasets import DatasetDict, Dataset

# Define the function to split the 'generation' column into multiple rows
def split_generation(examples):
    # Ensure examples is a dictionary with lists
    generations = examples['generation']

    # Process each entry in the batch
    new_examples = []
    for generation in generations:
        questions = generation.split('?')
        questions = [q.strip() + '?' for q in questions if q.strip()]
        for question in questions:
            new_example = {
                'instruction': question,
                **{k: v for k, v in examples.items() if k != 'generation'}
            }
            new_examples.append(new_example)

    return {'instruction': [e['instruction'] for e in new_examples]}

# Map the function to split 'generation' column
split_dataset = distiset['default']['train'].map(
    split_generation,
    batched=True,
    remove_columns=['generation'],
    batch_size=1
)

print(split_dataset)

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction'],
    num_rows: 48
})


In [20]:
for example in split_dataset.to_pandas()['instruction']:
  print(example)

1. How do recent enhancements in photovoltaic cell design influence global electrical conversion efficiency trends through different climates, subsequently affecting regional contributions to decreasing overall atmospheric CO2?
2. What is the comprehensive life cycle analysis comparing greenhouse gas emissions between extensive transitions toward solar infrastructure against persistent dependency on traditional oil and coal industries, factoring elements like ecosystem alteration due to spatial requirements, material production cycles inclusive of assembly and upkeep phases, culminating disposal practices juxtaposed with sustained extraction activities?
3. In what manner has integration of support mechanisms via worldwide environmental accords propelled forward acceptance and assimilation of cleaner alternative energies amongst emerging economies, particularly evaluating this shift’s measurable outcomes related to both socio-economic progressions and tangible declines in nationalized c

In [21]:
split_dataset.push_to_hub(
    "ahsanirfan961/arena-dataset",
    token = "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"
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/269 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ahsanirfan961/arena-dataset/commit/c9689e7efe75b58e1709a13fc5e9a3874e1621de', commit_message='Upload dataset', commit_description='', oid='c9689e7efe75b58e1709a13fc5e9a3874e1621de', pr_url=None, pr_revision=None, pr_num=None)