Skip to content

[Feature Request] Support Synthetic Data Generation via Self-Instruct #2012

@Zhangzeyu97

Description

@Zhangzeyu97

Required prerequisites

Motivation

Motivation:
To improve training data diversity, we need a robust synthetic data generation pipeline. The goal is to leverage the Self-Instruct methodology to automatically create new, high-quality datapoints using a combination of human-provided seed examples and machine-generated content.

By introducing a SelfInstructGenerator that supports:

  • Few-shot prompting for novel question generation,
  • Code-based rationale generation,

Solution

No response

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions