Conversation
1f010c4 to
febb865
Compare
added 21 commits
December 24, 2023 13:48
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
New behaviour
"scan_for_errors": trueit will be read entirely at startup and any bad images will be removed ifdelete_problematic_images: true. It will remove any outdated cache entries."disabled": truein the dataset config entry.Removed arguments
multidatabackend.json--data_backendis now--data_backend_configand is a path to a dataset config, seemultidatabackend.json.examplefor help converting your existing configurations overNew arguments
--data_backend_configDATALOADER_CONFIGinsdxl-env.sh--override_dataset_config--vae_cache_behaviourWhat: Configure the behaviour of the integrity scan check.
Why: A dataset could have incorrect settings applied at multiple points of training, eg. if you accidentally delete the
.jsoncache files from your dataset and switch the data backend config to use square images rather than aspect-crops. This will result in an inconsistent data cache, which can be corrected by settingscan_for_errorstotruein yourmultidatabackend.jsonconfiguration file. When this scan runs, it relies on the setting of--vae_cache_behaviourto determine how to resolve the inconsistency:recreate(the default) will remove the offending cache entry so that it can be recreated, andsyncwill update the bucket metadata to reflect the reality of the real training sample. Recommended value:recreate.data backend init function
add the identifier field to all components
convert static BucketManager calls to concrete object methods where at all possible
ensure VAE caching occurs in sequence across each dataset
ensure text embeds end up in a single folder, since they are reusable
checks for valid multi-dataset settings
load multiple local datasets correctly
load multiple AWS datasets correctly
combining local and AWS datasets
ensure that crop parameter overrides are correctly used for each dataset