Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 double drop #12

Merged
merged 23 commits into from Jan 25, 2023
Merged

3 double drop #12

merged 23 commits into from Jan 25, 2023

Conversation

philswatton
Copy link
Contributor

These commits do 2 broad things to contribute to #3:

  • Turn the CIFAR10DataModuleDrop class's dataloader into one that can mange dropping the same % from both A and B, while still being useful for the case where we only want to drop from one but not the other
  • Adds a test to ensure that when dropping from both, we get complete non-overlap as expected

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to whiteboard around design here so I'll finish up my review for now and we can discuss in person!

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
@philswatton
Copy link
Contributor Author

philswatton commented Jan 20, 2023

I've overhauled the code in line with our discussions. The DMPair class now takes drop percentages as input, sets up a CIFAR data module from pytorch, then uses this to generate two instances of the DMCifarSubset class. This should in theory improve the overall efficiency of the code.

What isn't handled in the current code is:

  • Generalisation to allow inclusion of other datasets
    • This shouldn't be too difficult to do, would mainly need arguments for datasets A and B, additional subset classes for other datamodules (e.g. MINST), and different behaviours if we want to drop from both A and B (in the case where A and B are not the same, we don't need to consider overlap)
    • However, this is better handled in the more distant future
  • Handling of transformations, and making them different across A and B
  • Validation splits
    • In line with today's discussion, this is something we'll need to consider, especially where A and B are from the same source dataset. However, I think that's a bridge better crossed in the future.

Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've added a few comments (some of which may just be me misunderstanding what's going on).

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks good.

A couple of minor comments.

Further, could you see if you can iterate through the resulting dataloaders and whether they behave as expected? I get an error trying to loop through dmpair.A.train_dataloader()
would be good to test looping through and count the no. of items. This should probably be a separate test as otherwise just testing the inds which are separate.

tests/test_data.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
@philswatton
Copy link
Contributor Author

I've added several updates to the PR:

  • Taken the double test-train split out of the DMPair class into its own function
  • Tests now cover A and B
  • Extra test covering batching over the dataloader, for a total of 8 tests up from 3
  • Variable names hopefully clearer
  • Extra comments, var names for equations hopefully also clarify stuff

I haven't addressed the .setup() comment yet. Mainly becuase looking at the source code (https://github.com/Lightning-AI/lightning-bolts/blob/c26c8d8f407de386038d5fb13297233a8aa052e7/pl_bolts/datamodules/vision_datamodule.py#L10) this is the stage at which transformations etc are applied, so I'd rather handle it an another PR aimed at dealing with the transforms

Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not gone through everything in full but LGTM - I just have some minor comments on the formatting/structure of the new splitting function.

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Outdated Show resolved Hide resolved
src/modsim2/data/loader.py Show resolved Hide resolved
Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed on slack, let's add the setup override even if it's just a copy+paste of the transform code in the parent. In addition, could you add some comments and logging to clarify that we will not be doing late loading and that the data setup is happening in the initialisation of the class?
I suggest just using logging with a logger set up at module level logger = logging.getLogger(__name__) (as per https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial) and a warning level message in __init__ like logger.warning('sensible message here')

tests/test_data.py Show resolved Hide resolved
Copy link
Contributor

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work. Very minor comment re. moving the logging statement. Happy for it to be merged after.

src/modsim2/data/loader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@philswatton philswatton merged commit 5a5671a into develop Jan 25, 2023
@philswatton philswatton deleted the 3-double-drop branch January 25, 2023 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants