Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in to_pydantic_model/from_pydantic_model #221

Closed
JohannesMessner opened this issue Mar 23, 2022 · 0 comments · Fixed by #223
Closed

Bug in to_pydantic_model/from_pydantic_model #221

JohannesMessner opened this issue Mar 23, 2022 · 0 comments · Fixed by #223
Labels
type/bug Something isn't working

Comments

@JohannesMessner
Copy link
Member

JohannesMessner commented Mar 23, 2022

Problem description:

The class methods to_pydantic_model() / from_pydantic_model() do not appear to be inverses of each other.

How to reproduce:

>>> Document.from_pydantic_model(Document(blob=b'hello').to_pydantic_model())
binascii.Error: Invalid base64-encoded string: number of data characters (5) cannot be 1 more than a multiple of 4

Seems like this part of the conversion (in the from_pydantic_model() method) might be based on an assumption that does not always hold:

elif f_name == 'blob':
    # here is a dirty fishy itchy trick
    # the original bytes will be encoded two times:
    # first time is real during `to_dict/to_json`, it converts into base64 string
    # second time is at `from_dict/from_json`, it is unnecessary yet inevitable, the result string get
    # converted into a binary string and encoded again.
    # consequently, we need to decode two times here!
    fields[f_name] = base64.b64decode(base64.b64decode(value))

My guess is that the blob is actually only encoded once, but it tries to decode twice, giving the error above.

Related Issue

The .dict() method of PydanticDocument does not implement any decoding of blob, but I think it should:

>>> d = Document(blob=b'hello')
>>> d.blob
b'hello'
>>> d_pyd = d.to_pydantic_model()  # encodes the blob
>>> d_pyd.blob
'aGVsbG8='
>>> d_pyd.dict()['blob']  # I think this should decode the blob and return b'hello'
'aGVsbG8='

@JoanFM I believe that if this behaviour of .dict() is fixed then the trick in the core of creating a DocumentArray becomes unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant