Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a fully random UUID as dataset ID #4790

Merged
merged 2 commits into from Aug 3, 2020
Merged

Use a fully random UUID as dataset ID #4790

merged 2 commits into from Aug 3, 2020

Conversation

mih
Copy link
Member

@mih mih commented Jul 31, 2020

Until now the dataset ID encoded the time of dataset creation, and a machine-identifying MAC-address (which may or may not be a reliable/persistent identifier).

DataLad neither used, nor advertised this encoded information in the ID. Moreover, having such information unconditionally encoded can counteract other features (e.g. "fake dates").

I'd argue that if identifying information of this kind is desired, it should be placed explicitly into a dataset. If UUIDs with encoded content are desired, the UUID of a dataset can be altered (right after creation) to match such needs.

Fixes gh-4785

This issue is related to datalad/datalad-metalad#30 and impacts the suitability of an identifier for use in anonymous metadata on personal data.

This PR goes against maint, because it should have no functional impact, but there is value in getting this change out sooner.

@mih mih added the metadata label Jul 31, 2020
@mih mih changed the base branch from master to maint Jul 31, 2020
Until now the dataset ID encoded the time of dataset creation, and
a machine-identifying MAC-address (which may or may not be a
reliable/persistent identifier).

DataLad neither used, nor advertised this encoded information in the
ID. Moreover, having such information unconditionally encoded can
counteract other features (e.g. "fake dates").

I'd argue that if identifying information of this kind is desired,
it should be placed explicitly into a dataset. If UUIDs with encoded
content are desired, the UUID of a dataset can be altered (right after
creation) to match such needs.

Fixes dataladgh-4785
datalad/core/local/create.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #4790 into maint will decrease coverage by 0.30%.
The diff coverage is 84.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #4790      +/-   ##
==========================================
- Coverage   89.68%   89.37%   -0.31%     
==========================================
  Files         275      287      +12     
  Lines       37103    40292    +3189     
==========================================
+ Hits        33275    36011    +2736     
- Misses       3828     4281     +453     
Impacted Files Coverage Δ
datalad/config.py 96.90% <ø> (+0.30%) ⬆️
datalad/core/local/save.py 98.68% <ø> (ø)
datalad/distributed/create_sibling_gitlab.py 74.67% <ø> (ø)
...ad/distributed/tests/test_create_sibling_gitlab.py 100.00% <ø> (ø)
datalad/distribution/create_sibling_github.py 83.87% <ø> (+0.26%) ⬆️
datalad/distribution/drop.py 98.79% <ø> (ø)
datalad/distribution/install.py 98.87% <ø> (ø)
datalad/distribution/tests/test_create_github.py 95.77% <ø> (+0.18%) ⬆️
datalad/interface/tests/test_unlock.py 97.91% <ø> (-0.18%) ⬇️
datalad/interface/tests/test_utils.py 100.00% <ø> (ø)
... and 361 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55200cd...7daa9ef. Read the comment docs.

@yarikoptic
Copy link
Member

yarikoptic commented Jul 31, 2020

Not of any specific advantage (at least I don't see it besides being able to say that some random uuId is a uuid of a datalad dataset), and probably would limit the space a bit, but I recalled that for datalad special remotes I have used uuid5 with our datalad namespace:

https://github.com/datalad/datalad/blob/master/datalad/customremotes/base.py#L663

@mih
Copy link
Member Author

mih commented Aug 1, 2020

Test failure is #4773

@mih
Copy link
Member Author

mih commented Aug 1, 2020

Not of any specific advantage (at least I don't see it besides being able to say that some random uuId is a uuid of a datalad dataset), and probably would limit the space a bit, but I recalled that for datalad special remotes I have used uuid5 with our datalad namespace:

https://github.com/datalad/datalad/blob/master/datalad/customremotes/base.py#L663

I don't see how a UUID5 would fit our needs. AFAICS there is no "internal" randomness, hence we would need to generate random names to get any randomness in the UUID.

In [1]: uuid.uuid5(uuid.NAMESPACE_URL, 'http://datalad.org')
Out[1]: UUID('1a50c8dc-30dc-5327-8e59-450be59a00b7')

In [2]: uuid.uuid5(uuid.NAMESPACE_URL, 'http://datalad.org')
Out[2]: UUID('1a50c8dc-30dc-5327-8e59-450be59a00b7')

If the names have to be random, it doesn't matter if there is a constant part of it (eg. 'http://datalad.org'). Also it is based on the SHA1 of the name, hence decoding that part will not be possible

In [3]: uuid.uuid5(uuid.NAMESPACE_URL, 'http://datalad.org/a')
Out[3]: UUID('a14c0cef-27b1-50e0-9044-c8d0fe90650f')

In [4]: uuid.uuid5(uuid.NAMESPACE_URL, 'http://datalad.org/b')
Out[4]: UUID('9e0672b3-732b-5763-85e4-68462cb26385')

I think we can stick with uuid4()

@@ -459,7 +459,8 @@ def __call__(

if _seed is None:
# just the standard way
uuid_id = uuid.uuid1().urn.split(':')[-1]
# use a fully random identifier (i.e. UUID version 4)
uuid_id = str(uuid.uuid4())
else:
# Let's generate preseeded ones
uuid_id = str(uuid.UUID(int=random.getrandbits(128)))
Copy link
Member

@yarikoptic yarikoptic Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to just always use this one (so no if _seed conditioning) then so we have a single implementation instead of conditioning here on _seed:

def uuid4():
    """Generate a random UUID."""
    return UUID(bytes=os.urandom(16), version=4)

so the only difference would be from stock uuid4 is that we would use random.getrandbits and not os.urandom ("Return a bytes object containing random bytes suitable for cryptographic use."). Not sure if "cryptographic" aspect would really matter here...?

Copy link
Member

@yarikoptic yarikoptic Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be just add an explicit version=4 as well

Copy link
Member Author

@mih mih Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of an effort towards a certified identifier setup. Being able to say "this is a UUID4" is better than "this should not be too different from a UUID4".

That being said, I am also OK with leaving things as they are and adding an explicit, separate mode where the properties of all identifiers are precisely defined.

Copy link
Member

@yarikoptic yarikoptic Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, with the above I was just trying to reduce number of "modes" of operation in the code, not to increase it... I am ok with this PR as is then.

@mih
Copy link
Member Author

mih commented Aug 3, 2020

Thx @yarikoptic

@mih mih merged commit daacea9 into datalad:maint Aug 3, 2020
1 of 2 checks passed
@mih mih deleted the bf-4785 branch Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch dataset ID to a fully random identifier
3 participants