Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ram Issue in Tutorial 9 (DPR Training) Colab #735

Closed
brandenchan opened this issue Jan 14, 2021 · 6 comments
Closed

Ram Issue in Tutorial 9 (DPR Training) Colab #735

brandenchan opened this issue Jan 14, 2021 · 6 comments
Assignees
Labels
type:bug Something isn't working

Comments

@brandenchan
Copy link
Contributor

When running the DPR Training tutorial in Colab, the download of the training dataset seems to run fine, but the program crashes before starting to download the dev file due to running out of RAM.

# Download original DPR data
# WARNING: the train set is 7.4GB and the dev set is 800MB
doc_dir = "data/dpr_training/"
s3_url_train = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz"
s3_url_dev = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz"
fetch_archive_from_http(s3_url_train, output_dir=doc_dir + "train/")
fetch_archive_from_http(s3_url_dev, output_dir=doc_dir + "dev/")

We need to find some way to reduce this RAM consumption. It likely has something to do with the way the files are being uncompressed

def fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None):
            ...
            elif url[-3:] == ".gz":
                json_bytes = gzip.open(temp_file.name).read()
                filename = url.split("/")[-1].replace(".gz", "")
                output_filename = Path(output_dir) / filename
                output = open(output_filename, "wb")
                output.write(json_bytes)
            else:
                ...
        return True
@brandenchan brandenchan added the type:bug Something isn't working label Jan 14, 2021
@brandenchan brandenchan self-assigned this Jan 14, 2021
@Timoeller
Copy link
Contributor

How about our tutorial just uses small files so people can quickly go through the code + execution, and we have the links for the large datafiles for interested users as comments?

@lalitpagaria
Copy link
Contributor

lalitpagaria commented Jan 14, 2021

Actually following line read whole file in memory also decompress it in memory as well.

json_bytes = gzip.open(temp_file.name).read()

Python 3 support buffered reading automatically so if change it to as follows memory utilization will improve -

            elif url[-3:] == ".gz":
                filename = url.split("/")[-1].replace(".gz", "")
                output_filename = Path(output_dir) / filename
                with gzip.open(temp_file.name) as f, open(output_filename, "wb") as output:
                        for line in f:
                               output.write(line)

I have not tested above snippet, I will do tonight. If this not work fine there is another solution to read file in chunks. But I don't think the would be necessary as gzip already support buffered IO.

One more point we can directly uncompress file from url instead of download to temp file and uncompressing it. Python compression libs have streaming support. Refer this #709 (comment)

@brandenchan
Copy link
Contributor Author

Hey @lalitpagaria thanks so much for the suggestion! I actually tested it and it solved the problem. I haven't removed the tempfile code but I did integrate your snippet in #737.

@lalitpagaria
Copy link
Contributor

@brandenchan glad to know it. Thanks for testing it. 🙂

@lalitpagaria
Copy link
Contributor

@tholor This can be closed now

@tholor tholor closed this as completed Jan 22, 2021
@nguyen-brat
Copy link

have you update the repo, when i run the tutorial i still face the same issue :((

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants