Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Possible Bug] imdl fails to deserialize torrents with .utf-8 key variants #534

Closed
DerBunteBall opened this issue Mar 6, 2024 · 6 comments

Comments

@DerBunteBall
Copy link

Hi,

imdl fails to deserialize torrents which have .utf-8 key variants name.utf-8 and path.utf-8.

They are not explicit defined in BEP3 but seem to be introduced by BitTorrent Inc. and used in uTorrent also see this.

To explain the thing shortly:

Torrents of this type hold e.g. the name key and a name.utf-8 key. The encoding of the value of this dict entries are different. So they have two variants of name (think it's ASCII or the system default encoding and UTF-8 in the .utf-8 key). BEP3 normally says they should be UTF-8 always. This is also usual in files dict entries with path and path.utf-8 which is the same like in the case of the name key.

torf accepts this without problems (files are simply valid).

Can be fixed by rewriting the torrent e.g. with torf by using the value of the .utf-8 fields and simply remove the .utf-8 variants after this.

Because it seems to not violating BEP3 directly but is used in practice I guess it's the best to behave like clients do: If the .utf-8 variant exists use it preferred.

Could provide a sample from the wild world out there via Discord.

Best Regards

@casey
Copy link
Owner

casey commented Mar 6, 2024

Thanks for the report! I'd be fine if someone created a PR supporting this, although it might, in practice, be quite a messy PR. imdl relies on serde for serializing and deserializing bencode, and I think it would be a bit hard to make it accept both and fall back to the .utf-8 fields.

@casey
Copy link
Owner

casey commented Mar 6, 2024

I think maybe the best way to support this would be to have a fix command, which could fix common problems with torrents, and avoids using serde. I.e., it uses raw bencode deserialization, and upon encountering a field that ends with .utf-8 replaces the non .utf-8 fields.

@DerBunteBall
Copy link
Author

DerBunteBall commented Mar 6, 2024

The torrent dump command outputs them normally. But show and all others fail. I think dump command uses plain Bencode.

It can be fixed by putting the values of .utf-8 fields into the normal ones.

In torf something simple like this did it for me:

#!/usr/bin/env python3

import sys
from torf import Torrent

def main(argv=None):
    t = Torrent().read("my_torrent.torrent")
    for num, my_file in enumerate(t.metainfo["info"]["files"]):
        t.metainfo["info"]["files"][num]["path"] = my_file["path.utf-8"]
        my_file.pop("path.utf-8")
    t.metainfo["info"]["name"] = t.metainfo["info"]["name.utf-8"]
    t.metainfo["info"].pop("name.utf-8")
    t.write("my_torrent_fixed.torrent")

if __name__ == "__main__":
    main(sys.argv)

@DerBunteBall
Copy link
Author

Just removing doesn't help by the way. Because of the fact that in this case the encoding in name field can be everything this leads to the same error. So UTF-8 seems to be strictly expected by serde.

@DerBunteBall
Copy link
Author

DerBunteBall commented Mar 11, 2024

imdl should be able to handle these things natively.

The reason is that files that get modified like this have new info hashes. This is true for the variant where optional md5sum keys contain invalid data (not plain strings) or .utf-8 key variants. That's at least the implementation state for now. I guess it's possible to do it in anthoer way but this could be a violation of specification.

So the fix above leads to a torrent with another infohash then the original but can be verified due to the fact that the piece hashes aren't touched. torf creates a new info hash when writing out the file by hashing the info dict. I guess there is no real specification for torrent modification so I guess it would be possible to just modify the torrent without changing the info hash (just modify the bencode).

BUT the changing of the info hash will be confusing. Because in the case where only the bencode is modified the hash wouldn't be valid for the info dict.

@casey
Copy link
Owner

casey commented May 14, 2024

I think this would just be too complex for the implementation. imdl uses serde, and I can't think of a simple way to substitute values from .utf-8 keys when they are present.

@casey casey closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants