Skip to content

Questions about data distribution and task distribution #5

@DREAMXFAR

Description

@DREAMXFAR

Thanks for your wonderful work and the release of UnicEdit dataset. I have downloaded part of the released 2M data. But I feel a little confused about the data and task distribution and want help.
Since I want to sample a subset of UnicEdit, I download 00001-00060 data parquet (~900k) However, I find the data distribution for some task are highly imbalanced, e.g., Subject Addition 209963, Subject Removal 168, Counting Change 65, Color Alteration 231632.
So, my question is, could you provide a detailed task distribution of the data for reference? Have the authors trained some models on UnicBench-10M, and will such a distribution harm the peformance of exiting models through finetuning? Wish for your early reply~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions