Tofu pipeline #12

J-Dymond · 2024-04-10T09:07:22Z

Basic tofu dataset pipeline set up. There are 4 different granularity settings, which are admittedly defined in quite a convoluted way, but I think the comments and readme explain how to use them.

philswatton

This is good work - much better than my first commit when I started here!

Some comments contained below/. The big ones are around refactoring data_utils.py - some of this code has some duplication and in general data_utils.py feels like it can be a little shorter. I've made some initial suggestions in this direction, happy to discuss during out meeting tomorrow.

Finally: I'd suggest renaming data_utils.py to load_data.py or even just tofu.py

src/arcsf/data/data_utils.py

src/arcsf/data/README.md

src/arcsf/data/data_utils.py

jack89roberts · 2024-04-11T10:17:57Z

Finally: I'd suggest renaming data_utils.py to load_data.py or even just tofu.py

I'd go tofu.py

src/arcsf/data/data_utils.py

jack89roberts · 2024-04-11T11:07:26Z

On all the comments re indexing etc. I don't think we need to reinvent the wheel if something already works. But my main suggestion is to consider whether it would be helpful to create author_id and question_id columns in the dataset at the beginning, which could also be used in the various sampling/filtering functions. One pro of this is it might make some analysis easier later - e.g. we wouldn't need to keep propagating information about which questions/authors have been removed to figure out which rows in the retain/forget sets correspond to which authors (as we could just refer to the author_id).

HuggingFace Datasets also have some in-built functions that could help: https://huggingface.co/docs/datasets/en/process . E.g. with an author_id column the author_level split could just be something like (probably not the most efficient way):

forget_set = all_data.filter(lambda row: row["author_id"] in forgotten_author_numbers)
retain_set = all_data.filter(lambda row: row["author_id"] not in forgotten_author_numbers)

philswatton · 2024-04-11T12:25:37Z

On all the comments re indexing etc. I don't think we need to reinvent the wheel if something already works. But my main suggestion is to consider whether it would be helpful to create author_id and question_id columns in the dataset at the beginning, which could also be used in the various sampling/filtering functions. One pro of this is it might make some analysis easier later - e.g. we wouldn't need to keep propagating information about which questions/authors have been removed to figure out which rows in the retain/forget sets correspond to which authors (as we could just refer to the author_id).

HuggingFace Datasets also have some in-built functions that could help: https://huggingface.co/docs/datasets/en/process . E.g. with an author_id column the author_level split could just be something like (probably not the most efficient way):
forget_set = all_data.filter(lambda row: row["author_id"] in forgotten_author_numbers)
retain_set = all_data.filter(lambda row: row["author_id"] not in forgotten_author_numbers)

I was thinking about this too - am open to arguments either way for now, but we'll almost certainly want to have our own indexing for each level of hierarchy when we start generating our own data in #2

src/arcsf/data/tofu.py

…ith some helper functions. Outputs a dataset in the same format as huggingface repo, but with randomised author selection.

…in the original codebase.

…hical information. TO DO: verify facts are removed.

…ored function load_tofu.

…values returned in the debug dict

jack89roberts

Good to merge pending the last few changes we discussed.

J-Dymond linked an issue Apr 10, 2024 that may be closed by this pull request

Implement Data Preprocessing #7

Closed

philswatton requested changes Apr 10, 2024

View reviewed changes

J-Dymond linked an issue Apr 11, 2024 that may be closed by this pull request

TOFU dataset in pipeline #1

Closed

jack89roberts reviewed Apr 11, 2024

View reviewed changes

src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 11, 2024

View reviewed changes

src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 11, 2024

View reviewed changes

src/arcsf/data/data_utils.py Outdated Show resolved Hide resolved

jack89roberts added this to the Milestone 1: Working pipeline on small novel usecase milestone Apr 11, 2024

jack89roberts reviewed Apr 17, 2024

View reviewed changes

src/arcsf/data/tofu.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 17, 2024

View reviewed changes

src/arcsf/data/tofu.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 17, 2024

View reviewed changes

src/arcsf/data/tofu.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 17, 2024

View reviewed changes

src/arcsf/data/tofu.py Outdated Show resolved Hide resolved

jack89roberts reviewed Apr 19, 2024

View reviewed changes

J-Dymond added 15 commits April 19, 2024 16:41

added load_tofu function, located inside utils/data_utils.py, along w…

41536b9

…ith some helper functions. Outputs a dataset in the same format as huggingface repo, but with randomised author selection.

Added README.md to briefly explain how to use the load_tofu function …

22167b6

…in the original codebase.

updated readme in utils so its slightly clearer

56a6fec

minor changes to documentation

b80a5b0

added structured fact-level question removal, likely removing biograp…

ec476ac

…hical information. TO DO: verify facts are removed.

refactored previous commits

812200a

refactored previous commits

6628fea

Added random within author and random across all questions. TODO: tests.

7a4b253

working on unit tests

9f017fe

Added tests for load_tofu

c4ce007

changed forgotten_fact_fraction to a passable argument

6fcaf60

data tests now check the datasets are datasets...

d9fe951

rename 'utils' to 'data' for easier navigation

23fb013

modularised the testing scripts

a4f04c1

clarified different granularities

32a779c

J-Dymond and others added 7 commits April 19, 2024 16:43

clarified different granularities - fixed formatting

eeec339

refactored data_utils.py -> tofu.py

90cbeed

added docstrings where appropriate and comments to explain the refact…

d4d8ce7

…ored function load_tofu.

added question and author indices, checked these align with expected …

613059d

…values returned in the debug dict

remove unnecessary comment

62c2666

Params -> Args in docstring

ede6203

added some changes from pull request

82b0c50

jack89roberts force-pushed the tofu_pipeline branch from 6ca271a to 82b0c50 Compare April 19, 2024 15:45

jack89roberts added 2 commits April 19, 2024 16:45

Update poetry.lock

971e88a

fix pyproject.toml

a473fba

jack89roberts approved these changes Apr 19, 2024

View reviewed changes

jack89roberts requested review from philswatton and removed request for philswatton April 19, 2024 15:53

J-Dymond added 2 commits April 22, 2024 12:47

removed the debug dict

1836fcd

renamed functions in tofu.py to make their purpose clearer

e494450

jack89roberts mentioned this pull request Apr 22, 2024

simplify get_forget_indices #15

Merged

jack89roberts added 5 commits April 24, 2024 10:06

simplify get_forget_indices

55ce37f

update allowable types for get_forget_indices

59dfe41

tidy docstrings/comments

70b1438

tidy docstrings

98f883c

update actions python version

439d81e

jack89roberts merged commit 6627760 into develop Apr 24, 2024
1 check passed

jack89roberts deleted the tofu_pipeline branch April 24, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tofu pipeline #12

Tofu pipeline #12

J-Dymond commented Apr 10, 2024

philswatton left a comment

jack89roberts commented Apr 11, 2024

jack89roberts commented Apr 11, 2024 •

edited

Loading

philswatton commented Apr 11, 2024

jack89roberts left a comment

Tofu pipeline #12

Tofu pipeline #12

Conversation

J-Dymond commented Apr 10, 2024

philswatton left a comment

Choose a reason for hiding this comment

jack89roberts commented Apr 11, 2024

jack89roberts commented Apr 11, 2024 • edited Loading

philswatton commented Apr 11, 2024

jack89roberts left a comment

Choose a reason for hiding this comment

jack89roberts commented Apr 11, 2024 •

edited

Loading