Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new serialization strategy #288

Merged
merged 14 commits into from
Jan 26, 2024
Merged

Add new serialization strategy #288

merged 14 commits into from
Jan 26, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Jan 22, 2024

Description

This PR adds a new module distilabel.utils.serialization to allow custom serialization of the tasks (to be extended to the different distilabel components in the future). When we save/load from disk, we will have a task-distilabel.json instead of task.pkl.

  • This definition will replace the current one based in pickle, making it safer to move the files around.
  • Currently the definition is completely custom, but we could use the pydantic one if we see fit.

Sample task from UltraFeedbackTask:

{
  "system_prompt": "You are an expert prompt writer, writing the best and most diverse prompts for a variety of tasks. You are given a task description and a set of instructions for how to write the prompts for an specific AI application.",
  "principles": {
    "harmlessness": [
      "The assistant should not provide answers that could cause physical, psychological, or social harm to any group of people; damage to or loss of equipment or property; damage to the environment; or harm to institutions or resources necessary to human wellbeing. Feel free to refuse to answer questions that could cause the above harm.",
     "..."
    ],
    "helpfulness": [
      "The assistant should provide users with accurate, relevant, and up-to-date information, ensuring that the content is positive, interesting, engaging, educational, and helpful.",
      "As an AI assistan..."
  "principles_distribution": null,
  "application_description": "AI assistant",
  "num_instructions": 5,
  "criteria_for_query_generation": "Incorporate a diverse range of verbs, avoiding repetition.\nEnsure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\nDesign queries to be self-contained and standalone.\nBlend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.",
  "__jinja2_template__": "...",
  "__type_info__": {
    "module": "distilabel.tasks.text_generation.self_instruct",
    "name": "SelfInstructTask"
}

Closes #261.

@plaguss plaguss self-assigned this Jan 22, 2024
@plaguss plaguss added the team: ml A tag for the ML team. label Jan 22, 2024
from distilabel.utils.imports import _ARGILLA_AVAILABLE
from distilabel.utils.serialization import load_task_from_disk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be able to replace load_task_from_disk too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean exactly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call self.task.save but though we might also add load_task_from_disk in that load_from_diskfuncton but now see it is inheriting from Dataset so that does not make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes those two are different, one for the dataset which we don't really deal with it ourselves, and the task which is saved as a json file.

src/distilabel/utils/serialization.py Outdated Show resolved Hide resolved
src/distilabel/utils/serialization.py Outdated Show resolved Hide resolved
src/distilabel/utils/serialization.py Outdated Show resolved Hide resolved
src/distilabel/utils/serialization.py Outdated Show resolved Hide resolved
tests/tasks/test_serialization.py Show resolved Hide resolved
@plaguss
Copy link
Contributor Author

plaguss commented Jan 24, 2024

Still need to review some private variables from the dump

@plaguss plaguss merged commit cc5d08e into main Jan 26, 2024
4 checks passed
@plaguss plaguss deleted the feat/serializer branch January 26, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: ml A tag for the ML team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve serialization strategy for tasks
2 participants