-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tofu pipeline #12
Tofu pipeline #12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good work - much better than my first commit when I started here!
Some comments contained below/. The big ones are around refactoring data_utils.py - some of this code has some duplication and in general data_utils.py feels like it can be a little shorter. I've made some initial suggestions in this direction, happy to discuss during out meeting tomorrow.
Finally: I'd suggest renaming data_utils.py to load_data.py or even just tofu.py
I'd go |
On all the comments re indexing etc. I don't think we need to reinvent the wheel if something already works. But my main suggestion is to consider whether it would be helpful to create HuggingFace forget_set = all_data.filter(lambda row: row["author_id"] in forgotten_author_numbers)
retain_set = all_data.filter(lambda row: row["author_id"] not in forgotten_author_numbers) |
I was thinking about this too - am open to arguments either way for now, but we'll almost certainly want to have our own indexing for each level of hierarchy when we start generating our own data in #2 |
…ith some helper functions. Outputs a dataset in the same format as huggingface repo, but with randomised author selection.
…in the original codebase.
…hical information. TO DO: verify facts are removed.
…ored function load_tofu.
…values returned in the debug dict
6ca271a
to
82b0c50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge pending the last few changes we discussed.
Basic tofu dataset pipeline set up. There are 4 different granularity settings, which are admittedly defined in quite a convoluted way, but I think the comments and readme explain how to use them.