Refactor _build_dataset piece for speed #344

plaguss · 2024-02-14T10:04:41Z

Description

This PR does a small refactor in Pipeline._build_dataset to build the final dataset using Dataset.from_pandas instead of iteratively calling Dataset.add_item, as that made the process prohibitively slow for bigger datasets (5K rows and onwards started to be a problem).

alvarobartt

💯

Refactor _build_dataset piece for speed

7e9f963

plaguss requested review from alvarobartt and gabrielmbmb February 14, 2024 10:04

plaguss self-assigned this Feb 14, 2024

plaguss added the improvement label Feb 14, 2024

alvarobartt approved these changes Feb 14, 2024

View reviewed changes

alvarobartt added this to the 0.6.0 milestone Feb 14, 2024

plaguss merged commit d24ee88 into main Feb 14, 2024
4 checks passed

plaguss deleted the test/build-dataset branch February 14, 2024 16:40

jphme pushed a commit to jphme/distilabel that referenced this pull request Feb 20, 2024

Refactor _build_dataset piece for speed (argilla-io#344)

2405cf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor _build_dataset piece for speed #344

Refactor _build_dataset piece for speed #344

plaguss commented Feb 14, 2024

alvarobartt left a comment

Refactor _build_dataset piece for speed #344

Refactor _build_dataset piece for speed #344

Conversation

plaguss commented Feb 14, 2024

Description

alvarobartt left a comment

Choose a reason for hiding this comment