Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tofu pipeline #12

Merged
merged 31 commits into from
Apr 24, 2024
Merged

Tofu pipeline #12

merged 31 commits into from
Apr 24, 2024

Conversation

J-Dymond
Copy link
Contributor

Basic tofu dataset pipeline set up. There are 4 different granularity settings, which are admittedly defined in quite a convoluted way, but I think the comments and readme explain how to use them.

@J-Dymond J-Dymond linked an issue Apr 10, 2024 that may be closed by this pull request
Copy link
Collaborator

@philswatton philswatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good work - much better than my first commit when I started here!

Some comments contained below/. The big ones are around refactoring data_utils.py - some of this code has some duplication and in general data_utils.py feels like it can be a little shorter. I've made some initial suggestions in this direction, happy to discuss during out meeting tomorrow.

Finally: I'd suggest renaming data_utils.py to load_data.py or even just tofu.py

src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
src/arcsf/data/README.md Outdated Show resolved Hide resolved
src/arcsf/data/README.md Outdated Show resolved Hide resolved
src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved
@jack89roberts
Copy link
Contributor

Finally: I'd suggest renaming data_utils.py to load_data.py or even just tofu.py

I'd go tofu.py

@J-Dymond J-Dymond linked an issue Apr 11, 2024 that may be closed by this pull request
@jack89roberts
Copy link
Contributor

jack89roberts commented Apr 11, 2024

On all the comments re indexing etc. I don't think we need to reinvent the wheel if something already works. But my main suggestion is to consider whether it would be helpful to create author_id and question_id columns in the dataset at the beginning, which could also be used in the various sampling/filtering functions. One pro of this is it might make some analysis easier later - e.g. we wouldn't need to keep propagating information about which questions/authors have been removed to figure out which rows in the retain/forget sets correspond to which authors (as we could just refer to the author_id).

HuggingFace Datasets also have some in-built functions that could help: https://huggingface.co/docs/datasets/en/process . E.g. with an author_id column the author_level split could just be something like (probably not the most efficient way):

forget_set = all_data.filter(lambda row: row["author_id"] in forgotten_author_numbers)
retain_set = all_data.filter(lambda row: row["author_id"] not in forgotten_author_numbers)

@philswatton
Copy link
Collaborator

On all the comments re indexing etc. I don't think we need to reinvent the wheel if something already works. But my main suggestion is to consider whether it would be helpful to create author_id and question_id columns in the dataset at the beginning, which could also be used in the various sampling/filtering functions. One pro of this is it might make some analysis easier later - e.g. we wouldn't need to keep propagating information about which questions/authors have been removed to figure out which rows in the retain/forget sets correspond to which authors (as we could just refer to the author_id).

HuggingFace Datasets also have some in-built functions that could help: https://huggingface.co/docs/datasets/en/process . E.g. with an author_id column the author_level split could just be something like (probably not the most efficient way):

forget_set = all_data.filter(lambda row: row["author_id"] in forgotten_author_numbers)
retain_set = all_data.filter(lambda row: row["author_id"] not in forgotten_author_numbers)

I was thinking about this too - am open to arguments either way for now, but we'll almost certainly want to have our own indexing for each level of hierarchy when we start generating our own data in #2

src/arcsf/data/tofu.py Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Show resolved Hide resolved
src/arcsf/data/tofu.py Outdated Show resolved Hide resolved
src/arcsf/data/tofu.py Show resolved Hide resolved
Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge pending the last few changes we discussed.

@jack89roberts jack89roberts requested review from philswatton and removed request for philswatton April 19, 2024 15:53
@jack89roberts jack89roberts merged commit 6627760 into develop Apr 24, 2024
1 check passed
@jack89roberts jack89roberts deleted the tofu_pipeline branch April 24, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Data Preprocessing TOFU dataset in pipeline
3 participants