-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced MMapIndexedDataset: less memory, higher speed #816
Enhanced MMapIndexedDataset: less memory, higher speed #816
Conversation
Nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
if vocab_size is not None and vocab_size < 65500: | ||
return np.uint16 | ||
else: | ||
return np.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Is this the main cause of the memory/speed improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the actual boost. I now have a dataset that is 1/4th of the original and this means a huge saving in memory, disk and also load time!
return tensor.long() | ||
np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=size, offset=ptr) | ||
if self._index.dtype != np.int64: | ||
np_array = np_array.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change improve memory usage and/or speed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due the fact that invoking tensor.long()
on a uint16
tensor was throwing an exception (tensor supports casting only on int64
, int32
, uint8
, ... except uint16
).
This way I bypass the problem!
I'm running the very last test where I train a net on 8GPU for 6 hours and test that the translation results are equivalent to the same net trained with old version of MMapIndexedDataset. PS: I just noticed the new version is out 0.7.0. That's great, but I was really hoping to have this improvement in the release! Too late I guess... Should I have to wait ~3 months, or are you planning to have a more regular and fast release schedule? It would be great to see all new improvements available in less time, even more important in this fast-growing stage of the framework |
We'll try to release more regularly. I cut 0.7.0 because we have a major logging refactoring landing soon and I wanted to cut a release right before it. We'll probably move to 0.8.0 soon thereafter. I can put this in a 0.7.1 as soon as it's merged. Edit: Also the 0.7.0 tag is missing the commit that changes the version number -- will delete and retag with the updated version number and push to pypi shortly. |
Summary: I have made an upgrade to my previous implementation of MMapIndexedDataset, now: - It uses up to **4 times less memory and disk space** - Words per second is slightly improved thanks to less memory access Pull Request resolved: #816 Differential Revision: D15899848 Pulled By: myleott fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
Thanks @myleott for the amazing product! :) |
0.7.1 is on pypi with this change! |
Thanks a million! |
Summary: I have made an upgrade to my previous implementation of MMapIndexedDataset, now: - It uses up to **4 times less memory and disk space** - Words per second is slightly improved thanks to less memory access Pull Request resolved: facebookresearch/fairseq#816 Differential Revision: D15899848 Pulled By: myleott fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
Summary: I have made an upgrade to my previous implementation of MMapIndexedDataset, now: - It uses up to **4 times less memory and disk space** - Words per second is slightly improved thanks to less memory access Pull Request resolved: facebookresearch/fairseq#816 Differential Revision: D15899848 Pulled By: myleott fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
I have made an upgrade to my previous implementation of MMapIndexedDataset, now: