-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe wordcount example. #12889
Dataframe wordcount example. #12889
Conversation
5697966
to
0bd0e44
Compare
| 'Split' >> beam.FlatMap( | ||
lambda line: re.findall(r'[\w]+', line)).with_output_types(str) | ||
# Map to Row objects to generate a schema suitable for conversion to a dataframe. | ||
| 'ToRows' >> beam.Map(lambda word: beam.Row(word=word))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Select
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually toyed with that, but it's not as natural (or time-saving) for 1-field schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
df = to_dataframe(words) | ||
df['count'] = 1 | ||
counted = df.groupby('word').sum() | ||
counted.to_csv(known_args.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice to tee the counted
back to a PCollection and print it as an example of to_pcollection
. That's easier to do once unbatching is the default, I can add it as part of #12882 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, let's do that.
CHANGES.md
Outdated
* X feature added (Java/Python) ([BEAM-X](https://issues.apache.org/jira/browse/BEAM-X)). | ||
|
||
## Breaking Changes | ||
|
||
* X behavior was changed ([BEAM-X](https://issues.apache.org/jira/browse/BEAM-X)). | ||
* Python 2 and Python 3.5 support dropped. | ||
* Pandas 1.x allowed. Older version of Pandas may still be used, but may not be as well tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment be part of the Dataframes note above? I don't think pandas 1.x support has any broader implications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense here--the main implication here is how diamond dependencies might get resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, makes sense
0bd0e44
to
895f222
Compare
895f222
to
4146e55
Compare
Codecov Report
@@ Coverage Diff @@
## master #12889 +/- ##
=======================================
Coverage 82.32% 82.33%
=======================================
Files 452 453 +1
Lines 54016 54040 +24
=======================================
+ Hits 44471 44496 +25
+ Misses 9545 9544 -1
Continue to review full report at Codecov.
|
Also add reference in release notes.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.