Enhanced MMapIndexedDataset: less memory, higher speed #816

davidecaroselli · 2019-06-19T15:23:04Z

I have made an upgrade to my previous implementation of MMapIndexedDataset, now:

It uses up to 4 times less memory and disk space
Words per second is slightly improved thanks to less memory access

myleott · 2019-06-19T15:40:23Z

Nice!

facebook-github-bot

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

myleott · 2019-06-19T16:04:14Z

fairseq/data/indexed_dataset.py

+    if vocab_size is not None and vocab_size < 65500:
+        return np.uint16
+    else:
+        return np.int32


This makes sense. Is this the main cause of the memory/speed improvement?

Yes, this is the actual boost. I now have a dataset that is 1/4th of the original and this means a huge saving in memory, disk and also load time!

myleott · 2019-06-19T16:04:18Z

fairseq/data/indexed_dataset.py

-            return tensor.long()
+        np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=size, offset=ptr)
+        if self._index.dtype != np.int64:
+            np_array = np_array.astype(np.int64)


Does this change improve memory usage and/or speed as well?

This is due the fact that invoking tensor.long() on a uint16 tensor was throwing an exception (tensor supports casting only on int64, int32, uint8, ... except uint16).
This way I bypass the problem!

davidecaroselli · 2019-06-19T16:13:53Z

I'm running the very last test where I train a net on 8GPU for 6 hours and test that the translation results are equivalent to the same net trained with old version of MMapIndexedDataset.

PS: I just noticed the new version is out 0.7.0. That's great, but I was really hoping to have this improvement in the release! Too late I guess... Should I have to wait ~3 months, or are you planning to have a more regular and fast release schedule? It would be great to see all new improvements available in less time, even more important in this fast-growing stage of the framework

myleott · 2019-06-20T00:31:07Z

Should I have to wait ~3 months, or are you planning to have a more regular and fast release schedule?

We'll try to release more regularly. I cut 0.7.0 because we have a major logging refactoring landing soon and I wanted to cut a release right before it. We'll probably move to 0.8.0 soon thereafter.

I can put this in a 0.7.1 as soon as it's merged.

Edit: Also the 0.7.0 tag is missing the commit that changes the version number -- will delete and retag with the updated version number and push to pypi shortly.

Summary: I have made an upgrade to my previous implementation of MMapIndexedDataset, now: - It uses up to **4 times less memory and disk space** - Words per second is slightly improved thanks to less memory access Pull Request resolved: #816 Differential Revision: D15899848 Pulled By: myleott fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6

davidecaroselli · 2019-06-20T11:17:44Z

Thanks @myleott for the amazing product! :)

myleott · 2019-06-20T15:22:05Z

0.7.1 is on pypi with this change!

davidecaroselli · 2019-06-20T15:23:00Z

Thanks a million!

Summary: I have made an upgrade to my previous implementation of MMapIndexedDataset, now: - It uses up to **4 times less memory and disk space** - Words per second is slightly improved thanks to less memory access Pull Request resolved: facebookresearch/fairseq#816 Differential Revision: D15899848 Pulled By: myleott fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6

MMapIndexedDataset value dtype is optimized for the vocabulary size

d94bfa8

facebook-github-bot added the CLA Signed label Jun 19, 2019

davidecaroselli mentioned this pull request Jun 19, 2019

Enhanced MMapIndexedDataset: less memory, higher speed #815

Closed

facebook-github-bot reviewed Jun 19, 2019

View reviewed changes

myleott reviewed Jun 19, 2019

View reviewed changes

facebook-github-bot closed this Jun 20, 2019

myleott mentioned this pull request Oct 25, 2020

Inconsistent results of IWSLT data preprocessing between fairseq 0.6 and 0.9 #2787

Closed

davidecaroselli deleted the features/mmap_dataset branch December 10, 2020 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced MMapIndexedDataset: less memory, higher speed #816

Enhanced MMapIndexedDataset: less memory, higher speed #816

davidecaroselli commented Jun 19, 2019

myleott commented Jun 19, 2019

facebook-github-bot left a comment

myleott Jun 19, 2019

davidecaroselli Jun 19, 2019 •

edited

myleott Jun 19, 2019

davidecaroselli Jun 19, 2019

davidecaroselli commented Jun 19, 2019

myleott commented Jun 20, 2019 •

edited

davidecaroselli commented Jun 20, 2019

myleott commented Jun 20, 2019

davidecaroselli commented Jun 20, 2019

Enhanced MMapIndexedDataset: less memory, higher speed #816

Enhanced MMapIndexedDataset: less memory, higher speed #816

Conversation

davidecaroselli commented Jun 19, 2019

myleott commented Jun 19, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

myleott Jun 19, 2019

Choose a reason for hiding this comment

davidecaroselli Jun 19, 2019 • edited

Choose a reason for hiding this comment

myleott Jun 19, 2019

Choose a reason for hiding this comment

davidecaroselli Jun 19, 2019

Choose a reason for hiding this comment

davidecaroselli commented Jun 19, 2019

myleott commented Jun 20, 2019 • edited

davidecaroselli commented Jun 20, 2019

myleott commented Jun 20, 2019

davidecaroselli commented Jun 20, 2019

davidecaroselli Jun 19, 2019 •

edited

myleott commented Jun 20, 2019 •

edited