-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bert Language Modeling example #21818
Conversation
Run Python 3.8 PostCommit |
Postcommits and unit tests pass locally. PTAL @tvalentyn |
Can one of the admins verify this patch? |
Codecov Report
@@ Coverage Diff @@
## master #21818 +/- ##
==========================================
- Coverage 74.15% 74.07% -0.08%
==========================================
Files 698 699 +1
Lines 92417 92504 +87
==========================================
- Hits 68530 68526 -4
- Misses 22636 22727 +91
Partials 1251 1251
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Changed the dataset to a custom file with my own sentences. If we get the Ok for the model, then I think this example should be goo. |
sdks/python/apache_beam/examples/inference/pytorch_language_modeling.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/examples/inference/pytorch_language_modeling.py
Outdated
Show resolved
Hide resolved
'--output', | ||
dest='output', | ||
help='Path where to save output predictions.' | ||
' text file.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken help string
sdks/python/apache_beam/examples/inference/pytorch_language_modeling.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/examples/inference/pytorch_language_modeling.py
Outdated
Show resolved
Hide resolved
'masked_text': text_and_masked_text_tuple, | ||
'predicted_text': text_and_predictions | ||
}) | ||
| 'Merge' >> beam.CoGroupByKey() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something that can be folded into PostProcess? isn't the masked_text already a part of predicted result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can fold it in. The original text, not the masked text, is the key of predicted_text
. i.e. the format is
masked_text is (original_text, masked_text)
predicted_text is (original_text, predicted_word)
and then we join on the original_text key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the CoGBK can't be folded.
text_and_tokenized_text_tuple | ||
| 'PyTorchRunInference' >> RunInference( | ||
KeyedModelHandler(model_handler)).with_output_types( | ||
Tuple[str, PredictionResult]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these hints anymore.
Added custom text of sentences. Ran local tests
|
Removed CoGBK for simplicity. |
test_pipeline = TestPipeline(is_integration_test=True) | ||
# Path to text file containing some sentences | ||
file_of_sentences = 'gs://apache-beam-ml/datasets/custom/sentences.txt' # disable: line-too-long | ||
output_file_dir = 'gs://apache-beam-ml/testing/predictions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test output, it's better to use a bucket with a lifecycle configured to leave less clutter behind, for example:
:~$ gsutil lifecycle get gs://temp-storage-for-end-to-end-tests/
{"rule": [{"action": {"type": "Delete"}, "condition": {"age": 14}}]}
Lifecycle may be per bucket (not sure if we can configure it just for ./testing/predictions
), so switching outputs for all tests to gs://temp-storage-for-end-to-end-tests/ may be easiest.
cc: @AnandInguva FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, thanks.
Add Bert Language Modeling example
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.