Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(get_dataset): dataset sampling for large datasets #71

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

keerthanvasist
Copy link
Member

Description of changes:
dataset sampling for large datasets: For large datasets like cnn_dailymail and xsum, we were unable to sample the dataset using the to_pandas() method - in order to solve for that use case, we are now using take_batch as a workaround when there are large datasets.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

oyangz
oyangz previously approved these changes Oct 23, 2023
nathanng17
nathanng17 previously approved these changes Oct 23, 2023
malhotra18
malhotra18 previously approved these changes Oct 23, 2023
data = ray.data.from_pandas(data.to_pandas().sample(num_records, random_state=SEED))
if count > 100000:
# If count is larger than 100000, we take the first 100000 row, and then sample from that to
# maintain deterministic behaviour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, Could you please also add reason here, in comments. For using take_batch instead?

I would say also add a TODO to revisit this, this logic is adding an inherent bias towards first 100000 rows of the dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should come back to this. Added a TODO

@keerthanvasist keerthanvasist merged commit ee66db9 into aws:main Oct 23, 2023
1 of 2 checks passed
@keerthanvasist
Copy link
Member Author

Merging since I had previous approvals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants