Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to use custom calibration data for quantization #27

Merged
merged 8 commits into from
Sep 15, 2023

Conversation

boehm-e
Copy link
Contributor

@boehm-e boehm-e commented Sep 5, 2023

Hi,

If you have some time to review these changes,
It should allow to use custom dataset (List[str]) for calibration part.

Thx :)

@casper-hansen
Copy link
Owner

Thanks for this PR, TheBloke also asked for this. Will review it later. Before this is merged, I would also like to create two examples of how to use the functionality with either a string pointing to a huggingface dataset or a list of preprocessed data.

if data == "pileval":
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
else:
raise NotImplementedError

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work. Might want to find a way define the split instead of defaulting to train, though.

Suggested change
raise NotImplementedError
dataset = load_dataset(data, split="train")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaulting to train might be solved by adding a kwarg with that defaults to validation which could be used in L9 and L11.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to not raise an exception here. Instead, we should try to load the dataset by the actual string that was passed and load the split by another variable passed. We could default to the validation split as this should be a small enough dataset for calibration yet scientifically sound enough since we would use the test split to measure perplexity.

@casper-hansen casper-hansen mentioned this pull request Sep 6, 2023
30 tasks
@aadnesd
Copy link

aadnesd commented Feb 12, 2024

What's the benefit of using custom data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants