Add `EvolInstruct` and `EvolInstructGenerator` tasks #407

alvarobartt · 2024-03-11T13:10:42Z

Description

This PR adds both the EvolInstruct and EvolInstructGenerator tasks ported from https://github.com/h2oai/h2o-wizardlm/blob/main/wizardlm.py with some slight modifications to suit our needs, but respecting the evolutionary approach.

Besides that the AsyncLLM has been fixed so as to use asyncio event loops instead of asyncio.run as it was raising some errors when called within a loop, so now the AsyncLLM implementation is more robust. Also the GeneratorTask has been included for tasks like EvolInstructGenerator i.e. generating data without seed data as input.

Closes #408

Example

import time

from distilabel.llm.mistral import MistralLLM
from distilabel.llm.openai import OpenAILLM
from distilabel.pipeline.local import Pipeline
from distilabel.steps.globals.huggingface import PushToHub
from distilabel.steps.task.evol_instruct.generator import EvolInstructGenerator

if __name__ == "__main__":
    start_time = time.time()

    with Pipeline() as pipeline:
        evol_instruct = EvolInstructGenerator(
            name="evol_instruct",
            llm=OpenAILLM(
                model="gpt-4",
                api_key="sk-***",  # type: ignore
            ),
            num_instructions=10,
            generate_answers=False,
        )

        push_to_hub = PushToHub(name="push_to_hub")  # type: ignore
        evol_instruct.connect(push_to_hub)

    pipeline.run(
        parameters={
            "evol_instruct": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                },
            },
            "push_to_hub": {
                "repo_id": "alvarobartt/evol-instruct",
                "split": "train",
                "private": False,
                "token": "hf_***",
            },
        }
    )

    print("--- %s seconds ---" % (time.time() - start_time))

Reference

https://github.com/h2oai/h2o-wizardlm/blob/main/wizardlm.py

Differs from the default `EvolInstruct` which will be refactored to be an evolution on top of existing instructions i.e. always expecting `seed_data` (requires modifications on top of the original implementation)

Return `inputs` where not properly formatted, and `yield` was running twice when `generate_answers=True`

Use `enum.Enum` instead of `enum.EnumType`

alvarobartt · 2024-03-13T13:04:47Z

@alvarobartt, I am also missing the system prompt they defined in the paper but we don't want to implement that either? https://github.com/argilla-io/distilabel/blob/main/src/distilabel/tasks/_templates/evol-instruct.jinja2

Fair, we can include it, but I'm afraid that the misalignments between their official implementation and the paper don't have a clear reference, should we take more inspiration from the paper instead? My experience so far is that using GPT-4 for generating and evolving instructions works well so far, but we can try to compare with and without system_prompt. Also does that apply to both the instruction and the answer, or only the instruction? I guess the first one, but LMK

@alvarobartt Also, I seem to be missing the elimination but you wanted to skip this because general state of LLMs have improved and you feel we don't need it anymore, correct?

Right, I was using gpt-4, mistral-medium and mistral-large, and didn't run into any of those issues to tackle during post-processing, but maybe we can include it just in case, but IMO those were not needed for this small proof of concept. Anyway, all these details will show whenever we implement DEITA using this building blocks to see whether something's off or not, but I'm afraid that some things they added on top of the post-processing where suited for their models, while bigger and more recent ones seem to preduce nice results with minimal post-processing.

alvarobartt · 2024-03-13T13:26:38Z

Also I've been exploring a bit on GitHub and, is this a more faithful reproduction of EvolInstruct (pre-WizardLM)? https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct cc @davidberenstein1957

davidberenstein1957 · 2024-03-13T14:07:48Z

@alvarobartt I would say so, yes. This is also the way we eventually implemented it initially. I would also go with the more extensive prompts to ensure we've got a higher change to make it work without more advancd models.

gabrielmbmb

Hi @alvarobartt, it looks good to me. Some methods are quite big so I would split them

src/distilabel/steps/task/base.py

src/distilabel/steps/task/evol_instruct/base.py

src/distilabel/steps/task/evol_instruct/generator.py

Used https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct as reference instead (using same prompts as the paper)

Co-authored-by: Gabriel Martin <gabrielmbmb@users.noreply.github.com>

…ator`

Co-authored-by: Gabriel Martin <gabrielmbmb@users.noreply.github.com>

alvarobartt added 6 commits March 11, 2024 11:27

Add _event_loop to AsyncLLM to fix generate inside loops

cc3a3f0

Add self.input_queue.get() inside try-except

b2ba409

Add GeneratorStep

4f03aee

Add EvolInstruct (WIP)

11ecb53

Fix name in get_logger from _Step

15b6cc2

Add EvolInstructGenerator

217818a

Differs from the default `EvolInstruct` which will be refactored to be an evolution on top of existing instructions i.e. always expecting `seed_data` (requires modifications on top of the original implementation)

alvarobartt added enhancement New feature or request task labels Mar 11, 2024

alvarobartt added this to the 1.0.0 milestone Mar 11, 2024

alvarobartt self-assigned this Mar 11, 2024

alvarobartt changed the base branch from main to core-refactor March 11, 2024 13:10

alvarobartt added 12 commits March 11, 2024 18:57

Remove try-except in yield

b7fc97d

Update EvolInstruct to evolve existing data

6b369ef

Add GenerationMutationTemplates and subclass StrEnum

c25ec32

Add mutation_templates attr in EvolInstruct{Generator}

f78a774

Set mutation_templates type to EnumType

779dc25

Add Enum serde

26cbb41

Add type: ignore where needed

fb0d0d9

Fix bug within EvolInstruct

522a5e4

Return `inputs` where not properly formatted, and `yield` was running twice when `generate_answers=True`

Add TestEvolInstruct

35f5586

Add EvolInstructGenerator

f73d9f9

Fix EnumType imports in Python < 3.11

3ab4aaa

Use `enum.Enum` instead of `enum.EnumType`

Fix StrEnum imports in Python < 3.11

61a560a

alvarobartt force-pushed the evol-instruct-task branch from 51ceda6 to 61a560a Compare March 12, 2024 16:21

alvarobartt added 2 commits March 12, 2024 17:23

Remove strict=True from zip

6bbe4a7

Fix EnumType imports in Python < 3.11 to use EnumMeta

c3093e1

alvarobartt requested review from gabrielmbmb and plaguss March 13, 2024 08:51

alvarobartt marked this pull request as ready for review March 13, 2024 08:51

alvarobartt linked an issue Mar 13, 2024 that may be closed by this pull request

Add EvolInstruct faithful reproduction of WizardLM #408

Closed

gabrielmbmb requested changes Mar 14, 2024

View reviewed changes

alvarobartt and others added 15 commits March 14, 2024 15:06

Update MutationTemplates

19dbdc0

Used https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct as reference instead (using same prompts as the paper)

Update MutationTemplates key names aligned with paper

1a3b0fc

Update GenerationMutationTemplates

4d79d35

Add missing reference to nlpxucan/WizardLM/Evol_Instruct

1c0e5a5

Update min_length default value to 512

d34b2f2

Apply suggestions from code review

6d789e8

Co-authored-by: Gabriel Martin <gabrielmbmb@users.noreply.github.com>

Merge branch 'core-refactor' into evol-instruct-task

698a27e

Align {LoadHubDataset,PushToHub} docstrings

1aed93d

Import under TYPE_CHECKING when possible

eda87a7

Fix TextGeneration.format_{input,output}

cc7829e

Update and align docstrings for EvolInstruct and `EvolInstructGener…

1118a6e

…ator`

Update EvolInstruct from code review

084b157

Co-authored-by: Gabriel Martin <gabrielmbmb@users.noreply.github.com>

Fix AsyncLLM due to missing *args

6b7d61e

Update EvolInstruct and EvolInstructGenerator

2a8f498

Fix EvolInstruct.format_output docstring

ce14a44

alvarobartt requested a review from gabrielmbmb March 15, 2024 10:46

alvarobartt added 5 commits March 15, 2024 12:35

Fix ChatType type-hint

e083f01

Fix MutationTemplates and GenerationMutationTemplates

d1cf7d5

Fix bug in Enum dump

79f1e4d

Fix TestEvolInstruct and TestEvolInstructGenerator

a38cb73

Remove *args from AsyncLLM.generate

274b11f

gabrielmbmb approved these changes Mar 15, 2024

View reviewed changes

Fix {Task,GeneratorTask} subclass inheritance

5d246a7

alvarobartt merged commit 51872f7 into core-refactor Mar 16, 2024
4 checks passed

alvarobartt deleted the evol-instruct-task branch March 16, 2024 11:10

alvarobartt mentioned this pull request Mar 18, 2024

Add QualityScorer Task #425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `EvolInstruct` and `EvolInstructGenerator` tasks #407

Add `EvolInstruct` and `EvolInstructGenerator` tasks #407

alvarobartt commented Mar 11, 2024 •

edited

alvarobartt commented Mar 13, 2024

alvarobartt commented Mar 13, 2024

davidberenstein1957 commented Mar 13, 2024 •

edited

gabrielmbmb left a comment

Add EvolInstruct and EvolInstructGenerator tasks #407

Add EvolInstruct and EvolInstructGenerator tasks #407

Conversation

alvarobartt commented Mar 11, 2024 • edited

Description

Example

Reference

alvarobartt commented Mar 13, 2024

alvarobartt commented Mar 13, 2024

davidberenstein1957 commented Mar 13, 2024 • edited

gabrielmbmb left a comment

Choose a reason for hiding this comment

Add `EvolInstruct` and `EvolInstructGenerator` tasks #407

Add `EvolInstruct` and `EvolInstructGenerator` tasks #407

alvarobartt commented Mar 11, 2024 •

edited

davidberenstein1957 commented Mar 13, 2024 •

edited