fix: use os urandom to generate doc id #464

hanxiao · 2022-08-01T09:05:49Z

replacing #462

hi guys, there is a bug in docarray Document.id, it is affected by random.seed() , see the behavior here:

import random
import numpy as np
from docarray import Document, DocumentArray

da = DocumentArray()

for _ in range(10):
    random.seed(0)
    np.random.seed(0)
    # now do some math stuff
    tensor = ...

    da.append(Document(tensor=tensor))

print(da[:, 'id'])

then you will find:

['e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd', 'e3e70682c2094cac629f6fbed82c07cd']

Expected behavior:
Doc ID should not be controlled by random.seed/numpy.seed/torch.seed etc.

Unexpected behavior:
Now every doc has the same id, this leads to many other problems in DocArray/push/pull/persist/all kinds of API. It is also the side-effect that user unexpected when using DocArray.

Solution:

use os.urandom(16).hex() for doc.id generation, a bit slower but faster than uuid1 but it is independent from seed generation.

This bug is sneaky and very unexpected, which caused https://jina-ai.slack.com/archives/C0169V26ATY/p1658429345447309

In general, the current design of ID generation is a flaw as it was not put into computational-intensive/real numerical application.

codecov · 2022-08-01T09:11:48Z

Codecov Report

Merging #464 (54f74f3) into main (ed7e9b6) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #464      +/-   ##
==========================================
- Coverage   83.33%   83.30%   -0.04%     
==========================================
  Files         134      134              
  Lines        6516     6516              
==========================================
- Hits         5430     5428       -2     
- Misses       1086     1088       +2

Flag	Coverage Δ
docarray	`83.30% <100.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/__init__.py	`75.00% <100.00%> (ø)`
docarray/document/data.py	`91.48% <100.00%> (ø)`
docarray/array/mixins/io/pushpull.py	`92.13% <0.00%> (-2.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d6d3d2...54f74f3. Read the comment docs.

fix: use os urandom to generate doc id

38ee935

github-actions bot added size/xs area/core component/document labels Aug 1, 2022

fix: use os urandom to generate doc id

fca7ba9

github-actions bot added size/s area/testing and removed size/xs labels Aug 1, 2022

JoanFM approved these changes Aug 1, 2022

View reviewed changes

test: avoid collide push pull

54f74f3

hanxiao merged commit fa8b3d0 into main Aug 1, 2022

hanxiao deleted the fix-urandom-id branch August 1, 2022 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use os urandom to generate doc id #464

fix: use os urandom to generate doc id #464

hanxiao commented Aug 1, 2022 •

edited

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading

fix: use os urandom to generate doc id #464

fix: use os urandom to generate doc id #464

Conversation

hanxiao commented Aug 1, 2022 • edited Loading

codecov bot commented Aug 1, 2022 • edited Loading

Codecov Report

hanxiao commented Aug 1, 2022 •

edited

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading