Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-2322] Improved dataset directory validation #2459

Merged
merged 6 commits into from
Jul 6, 2023

Conversation

nvoxland
Copy link
Contributor

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

Currently, if a single file is missing in a dataset folder (dataset_meta.json), Deep Lake thinks it’s not a deeplake dataset leading to a confusing error of ….."A Deep Lake dataset does not exist at the given path…..".

This make this check more sophisticated, making "the is this a dataset directory" check look for a variety of files while also adding a "validation" function for times when any of them missing will be a problem.

…ise-valid dataset directory.

- Added a separate dataset_validate function to check for problems in the structure
@CLAassistant
Copy link

CLAassistant commented Jun 28, 2023

CLA assistant check
All committers have signed the CLA.

@nvoxland
Copy link
Contributor Author

I added the logic, but no unit tests since I'm not sure the existing patterns for testing against file system files. I could use some pointers on that

deeplake/util/keys.py Outdated Show resolved Hide resolved
deeplake/util/keys.py Outdated Show resolved Hide resolved
deeplake/core/dataset/dataset.py Outdated Show resolved Hide resolved
deeplake/util/keys.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.05 ⚠️

Comparison is base (df3b051) 84.86% compared to head (3fe8baa) 84.82%.

❗ Current head 3fe8baa differs from pull request most recent head 91d4883. Consider uploading reports for the commit 91d4883 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2459      +/-   ##
==========================================
- Coverage   84.86%   84.82%   -0.05%     
==========================================
  Files         326      326              
  Lines       38306    38286      -20     
==========================================
- Hits        32509    32476      -33     
- Misses       5797     5810      +13     
Flag Coverage Δ
unittests 84.82% <100.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
deeplake/util/keys.py 94.69% <100.00%> (-2.68%) ⬇️

... and 24 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nvoxland nvoxland marked this pull request as ready for review June 30, 2023 21:21
@nvoxland
Copy link
Contributor Author

I added a test against the in-memory storage as a unit test

@@ -132,6 +132,7 @@ wandb/
*Python-3.7*
*mem:/*
hub_pytest/
deeplake/test-dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need this

@nvoxland nvoxland added the trigger-test Label trigger to run tests on PRs label Jul 3, 2023
.gitignore Outdated Show resolved Hide resolved
deeplake/util/keys.py Outdated Show resolved Hide resolved
@nvoxland nvoxland added trigger-test Label trigger to run tests on PRs and removed trigger-test Label trigger to run tests on PRs labels Jul 4, 2023
@sonarcloud
Copy link

sonarcloud bot commented Jul 5, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

return (
get_dataset_meta_key(FIRST_COMMIT_ID) in storage
or get_version_control_info_key() in storage
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious is it enough to check only these things? Shouldn't we also check whether chunks exists?

@nvoxland nvoxland merged commit 3b6bf75 into main Jul 6, 2023
6 of 9 checks passed
@nvoxland nvoxland deleted the improve_dataset_exists branch July 6, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
trigger-test Label trigger to run tests on PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants