-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AL-2322] Improved dataset directory validation #2459
Conversation
…ise-valid dataset directory. - Added a separate dataset_validate function to check for problems in the structure
I added the logic, but no unit tests since I'm not sure the existing patterns for testing against file system files. I could use some pointers on that |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2459 +/- ##
==========================================
- Coverage 84.86% 84.82% -0.05%
==========================================
Files 326 326
Lines 38306 38286 -20
==========================================
- Hits 32509 32476 -33
- Misses 5797 5810 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
I added a test against the in-memory storage as a unit test |
@@ -132,6 +132,7 @@ wandb/ | |||
*Python-3.7* | |||
*mem:/* | |||
hub_pytest/ | |||
deeplake/test-dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why we need this
Kudos, SonarCloud Quality Gate passed! |
return ( | ||
get_dataset_meta_key(FIRST_COMMIT_ID) in storage | ||
or get_version_control_info_key() in storage | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious is it enough to check only these things? Shouldn't we also check whether chunks exists?
🚀 🚀 Pull Request
Checklist:
coverage-rate
upChanges
Currently, if a single file is missing in a dataset folder (
dataset_meta.json
), Deep Lake thinks it’s not a deeplake dataset leading to a confusing error of….."A Deep Lake dataset does not exist at the given path….."
.This make this check more sophisticated, making "the is this a dataset directory" check look for a variety of files while also adding a "validation" function for times when any of them missing will be a problem.