Ram Issue in Tutorial 9 (DPR Training) Colab #735

brandenchan · 2021-01-14T10:51:37Z

When running the DPR Training tutorial in Colab, the download of the training dataset seems to run fine, but the program crashes before starting to download the dev file due to running out of RAM.

# Download original DPR data
# WARNING: the train set is 7.4GB and the dev set is 800MB
doc_dir = "data/dpr_training/"
s3_url_train = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz"
s3_url_dev = "https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz"
fetch_archive_from_http(s3_url_train, output_dir=doc_dir + "train/")
fetch_archive_from_http(s3_url_dev, output_dir=doc_dir + "dev/")

We need to find some way to reduce this RAM consumption. It likely has something to do with the way the files are being uncompressed

def fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None):
            ...
            elif url[-3:] == ".gz":
                json_bytes = gzip.open(temp_file.name).read()
                filename = url.split("/")[-1].replace(".gz", "")
                output_filename = Path(output_dir) / filename
                output = open(output_filename, "wb")
                output.write(json_bytes)
            else:
                ...
        return True

Timoeller · 2021-01-14T11:04:57Z

How about our tutorial just uses small files so people can quickly go through the code + execution, and we have the links for the large datafiles for interested users as comments?

lalitpagaria · 2021-01-14T14:32:42Z

Actually following line read whole file in memory also decompress it in memory as well.

json_bytes = gzip.open(temp_file.name).read()

Python 3 support buffered reading automatically so if change it to as follows memory utilization will improve -

            elif url[-3:] == ".gz":
                filename = url.split("/")[-1].replace(".gz", "")
                output_filename = Path(output_dir) / filename
                with gzip.open(temp_file.name) as f, open(output_filename, "wb") as output:
                        for line in f:
                               output.write(line)

I have not tested above snippet, I will do tonight. If this not work fine there is another solution to read file in chunks. But I don't think the would be necessary as gzip already support buffered IO.

One more point we can directly uncompress file from url instead of download to temp file and uncompressing it. Python compression libs have streaming support. Refer this #709 (comment)

brandenchan · 2021-01-14T14:51:31Z

Hey @lalitpagaria thanks so much for the suggestion! I actually tested it and it solved the problem. I haven't removed the tempfile code but I did integrate your snippet in #737.

lalitpagaria · 2021-01-14T15:09:57Z

@brandenchan glad to know it. Thanks for testing it. 🙂

lalitpagaria · 2021-01-21T19:43:28Z

@tholor This can be closed now

nguyen-brat · 2023-08-15T03:10:17Z

have you update the repo, when i run the tutorial i still face the same issue :((

brandenchan added the type:bug Something isn't working label Jan 14, 2021

brandenchan self-assigned this Jan 14, 2021

brandenchan mentioned this issue Jan 14, 2021

Reduce memory consumption of fetch_archive_from_http #737

Merged

tholor closed this as completed Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ram Issue in Tutorial 9 (DPR Training) Colab #735

Ram Issue in Tutorial 9 (DPR Training) Colab #735

brandenchan commented Jan 14, 2021

Timoeller commented Jan 14, 2021

lalitpagaria commented Jan 14, 2021 •

edited

Loading

brandenchan commented Jan 14, 2021

lalitpagaria commented Jan 14, 2021

lalitpagaria commented Jan 21, 2021

nguyen-brat commented Aug 15, 2023

Ram Issue in Tutorial 9 (DPR Training) Colab #735

Ram Issue in Tutorial 9 (DPR Training) Colab #735

Comments

brandenchan commented Jan 14, 2021

Timoeller commented Jan 14, 2021

lalitpagaria commented Jan 14, 2021 • edited Loading

brandenchan commented Jan 14, 2021

lalitpagaria commented Jan 14, 2021

lalitpagaria commented Jan 21, 2021

nguyen-brat commented Aug 15, 2023

lalitpagaria commented Jan 14, 2021 •

edited

Loading