Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced MMapIndexedDataset: less memory, higher speed #816

Conversation

davidecaroselli
Copy link
Contributor

I have made an upgrade to my previous implementation of MMapIndexedDataset, now:

  • It uses up to 4 times less memory and disk space
  • Words per second is slightly improved thanks to less memory access

@myleott
Copy link
Contributor

myleott commented Jun 19, 2019

Nice!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

if vocab_size is not None and vocab_size < 65500:
return np.uint16
else:
return np.int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Is this the main cause of the memory/speed improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the actual boost. I now have a dataset that is 1/4th of the original and this means a huge saving in memory, disk and also load time!

return tensor.long()
np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=size, offset=ptr)
if self._index.dtype != np.int64:
np_array = np_array.astype(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change improve memory usage and/or speed as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due the fact that invoking tensor.long() on a uint16 tensor was throwing an exception (tensor supports casting only on int64, int32, uint8, ... except uint16).
This way I bypass the problem!

@davidecaroselli
Copy link
Contributor Author

I'm running the very last test where I train a net on 8GPU for 6 hours and test that the translation results are equivalent to the same net trained with old version of MMapIndexedDataset.

PS: I just noticed the new version is out 0.7.0. That's great, but I was really hoping to have this improvement in the release! Too late I guess... Should I have to wait ~3 months, or are you planning to have a more regular and fast release schedule? It would be great to see all new improvements available in less time, even more important in this fast-growing stage of the framework

@myleott
Copy link
Contributor

myleott commented Jun 20, 2019

Should I have to wait ~3 months, or are you planning to have a more regular and fast release schedule?

We'll try to release more regularly. I cut 0.7.0 because we have a major logging refactoring landing soon and I wanted to cut a release right before it. We'll probably move to 0.8.0 soon thereafter.

I can put this in a 0.7.1 as soon as it's merged.

Edit: Also the 0.7.0 tag is missing the commit that changes the version number -- will delete and retag with the updated version number and push to pypi shortly.

facebook-github-bot pushed a commit that referenced this pull request Jun 20, 2019
Summary:
I have made an upgrade to my previous implementation of MMapIndexedDataset, now:
- It uses up to **4 times less memory and disk space**
- Words per second is slightly improved thanks to less memory access
Pull Request resolved: #816

Differential Revision: D15899848

Pulled By: myleott

fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
@davidecaroselli
Copy link
Contributor Author

Thanks @myleott for the amazing product! :)

@myleott
Copy link
Contributor

myleott commented Jun 20, 2019

0.7.1 is on pypi with this change!

@davidecaroselli
Copy link
Contributor Author

Thanks a million!

@davidecaroselli davidecaroselli deleted the features/mmap_dataset branch December 10, 2020 10:38
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
Summary:
I have made an upgrade to my previous implementation of MMapIndexedDataset, now:
- It uses up to **4 times less memory and disk space**
- Words per second is slightly improved thanks to less memory access
Pull Request resolved: facebookresearch/fairseq#816

Differential Revision: D15899848

Pulled By: myleott

fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
Summary:
I have made an upgrade to my previous implementation of MMapIndexedDataset, now:
- It uses up to **4 times less memory and disk space**
- Words per second is slightly improved thanks to less memory access
Pull Request resolved: facebookresearch/fairseq#816

Differential Revision: D15899848

Pulled By: myleott

fbshipit-source-id: 9ddeb4809729ef69cc6b0867b33ee71184d845e6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants