This repository has been archived by the owner on Mar 19, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
This PR is a followup to issue 755.
It uses threads to parallelize the
saveVectors
method across multiple cores, which dramatically reduces runtime for large datasets. The order of the output is not guaranteed.Benchmarks
With binary built from master branch:
$ time ./fasttext-master supervised -input data/amazon_review_polarity_csv/train.csv -output out-single Read 282M words Number of words: 6735845 Number of labels: 0 Progress: 100.0% words/sec/thread: 2829461 lr: 0.000000 loss: -nan ETA: 0h 0m real 7m39.323s user 11m50.848s sys 0m25.348s
With binary built using parallel
saveVectors
$ time ./fasttext supervised -input data/amazon_review_polarity_csv/train.csv -output out-parallel Read 282M words Number of words: 6735845 Number of labels: 0 Progress: 100.0% words/sec/thread: 2754298 lr: 0.000000 loss: -nan ETA: 0h 0m real 3m21.412s user 17m18.682s sys 0m18.058s
Validated equal outputs with:
CPU details