Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use os urandom to generate doc id #464

Merged
merged 3 commits into from
Aug 1, 2022
Merged

fix: use os urandom to generate doc id #464

merged 3 commits into from
Aug 1, 2022

Conversation

hanxiao
Copy link
Member

@hanxiao hanxiao commented Aug 1, 2022

replacing #462

hi guys, there is a bug in docarray Document.id, it is affected by random.seed() , see the behavior here:

import random
import numpy as np
from docarray import Document, DocumentArray

da = DocumentArray()

for _ in range(10):
    random.seed(0)
    np.random.seed(0)
    # now do some math stuff
    tensor = ...

    da.append(Document(tensor=tensor))

print(da[:, 'id'])

then you will find:

['e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd']

Expected behavior:
Doc ID should not be controlled by random.seed/numpy.seed/torch.seed etc.

Unexpected behavior:
Now every doc has the same id, this leads to many other problems in DocArray/push/pull/persist/all kinds of API. It is also the side-effect that user unexpected when using DocArray.

Solution:

use os.urandom(16).hex() for doc.id generation, a bit slower but faster than uuid1 but it is independent from seed generation.

This bug is sneaky and very unexpected, which caused https://jina-ai.slack.com/archives/C0169V26ATY/p1658429345447309

In general, the current design of ID generation is a flaw as it was not put into computational-intensive/real numerical application.

@codecov
Copy link

codecov bot commented Aug 1, 2022

Codecov Report

Merging #464 (54f74f3) into main (ed7e9b6) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #464      +/-   ##
==========================================
- Coverage   83.33%   83.30%   -0.04%     
==========================================
  Files         134      134              
  Lines        6516     6516              
==========================================
- Hits         5430     5428       -2     
- Misses       1086     1088       +2     
Flag Coverage Δ
docarray 83.30% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/__init__.py 75.00% <100.00%> (ø)
docarray/document/data.py 91.48% <100.00%> (ø)
docarray/array/mixins/io/pushpull.py 92.13% <0.00%> (-2.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d6d3d2...54f74f3. Read the comment docs.

@hanxiao hanxiao merged commit fa8b3d0 into main Aug 1, 2022
@hanxiao hanxiao deleted the fix-urandom-id branch August 1, 2022 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants