Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update _BatchManager to work with GlobalSteps and input_batch_size per step #366

Merged
merged 7 commits into from
Feb 29, 2024

Conversation

gabrielmbmb
Copy link
Member

@gabrielmbmb gabrielmbmb commented Feb 29, 2024

Description

This PR updates the _BatchManager to handle the creation of a "super" batch gathering all the data from the previous batches for the GlobalSteps. In addition, it adds a new attribute to the Step class called input_batch_size. This argument can be used to define the desired number of rows that the step will receive per batch. Having that said, the _BatchManager has also been updated to store the data of the output batches, instead of the batches itself. This allows to build batches of the desired size for each step.

@gabrielmbmb gabrielmbmb added the enhancement New feature or request label Feb 29, 2024
@gabrielmbmb gabrielmbmb added this to the 1.0.0 milestone Feb 29, 2024
@gabrielmbmb gabrielmbmb self-assigned this Feb 29, 2024
@gabrielmbmb gabrielmbmb changed the base branch from main to core-refactor February 29, 2024 12:31
Copy link
Contributor

@plaguss plaguss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment irrelevant to the nice PR

@@ -29,7 +29,7 @@
SaveFormats = Literal["json", "yaml"]


def _get_class(module: str = None, name: str = None) -> Type:
def _get_class(module: str, name: str) -> Type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default args are placed so here we can pass the data using **kwargs. But also, I see I have to update the code to use distilabel/pipeline/serialization and remove this file. Will do that in a different PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad!

@gabrielmbmb gabrielmbmb marked this pull request as ready for review February 29, 2024 14:08
@gabrielmbmb gabrielmbmb merged commit c691d96 into core-refactor Feb 29, 2024
4 checks passed
@gabrielmbmb gabrielmbmb deleted the batch_manager_logic_for_global_step branch February 29, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants