-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[CsvIO] Create CsvIOStringToCsvRecord Class #31857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @m-trieu for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
ahmedabu98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class be exposed to the user (ie. public class) or is it internal only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything is currently internal, all classes are package private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it possible to create a CsvIOStringToCsvRecord transform without having to provide an errorHandlerTransform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I added the transform to the CsvIOConfiguration class itself but will update this class to include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately had to completely remove checking for errors within this class due to git rebase requirement -- the branch being unable to pull the most recent version of CsvIOParseConfiguration. The test for Bad record will also have to be in different class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be outputting CSVRecord?
| extends PTransform<PCollection<String>, PCollection<Iterable<String>>> { | |
| extends PTransform<PCollection<String>, PCollection<CSVRecord>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output was originally CSVRecord, but due to CSVRecord being read-only and formatting differences in equals() method, it is not possible to compare CSVRecord(read-only) in test checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see. We should probably find a way around that. Looking at the CSVRecord class, there's some useful methods we'd be missing out on if we cast it to an Iterable<String>.
Instead, we can maybe try converting the output CSVRecord to an ArrayList before doing our test checks. For example:
PCollection<String> outputStrings = input
.apply(underTest)
.apply(MapElements.into(TypeDescriptors.lists(Typedescriptors.strings())).via(
record -> ImmutableList.copyOf(record.iterator())));
// Passert checks with outputStringsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can implement custom comparison in test to override the equals
I.e
assertThat(csvRecords)
.comparingElementsUsing(
Correspondence.from(...customComparator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I will implement both measures and see the best way to go about getting around the equals mismatch. @ahmedabu98 I agree, CSVRecord is also easier to convert into custom type so overall a better choice for corresponding classes that require CsvIOStringToCsvRecord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further documentation, in future versions the Serializable is deprecated in CSVRecord (versions later than 1.8, which is currently being used). This was the main reason for using Iterable rather than CSVRecord.
via CSVRecord 1.8: Note: Support for Serializable is scheduled to be removed in version 2.0. In version 1.8 the mapping between the column header and the column index was removed from the serialised state. The class maintains serialization compatibility with versions pre-1.8 for the record values; these must be accessed by index following deserialization. There will be loss of any functionally linked to the header mapping when transferring serialised forms pre-1.8 to 1.8 and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, that is pretty unfortunate. I guess Iterable<String> makes sense to go with (even though the naming of this transform will be a little misleading)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming itself isn't the best but only data truly needed from the CsvRecord is the cell for future parsing.
sdks/java/io/csv/src/test/java/org/apache/beam/sdk/io/csv/CsvIOStringToCsvRecordTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this class be final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class can be final but it defies the convention of all other created classes that are currently only package private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can implement custom comparison in test to override the equals
I.e
assertThat(csvRecords)
.comparingElementsUsing(
Correspondence.from(...customComparator)
29c0ba2 to
ed3e6c5
Compare
|
Run Java_Examples_Dataflow PreCommit |
ahmedabu98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Fixes #31854 |
* Create CsvIOStringToCsvRecord class * Create CsvIOStringToCsvRecord Class * Create CsvIOStringToCsvRecord Class * Create CsvIOStringToCsvRecord Class * Fixed BadRecord Output * Make class final --------- Co-authored-by: Lahari Guduru <lahariguduru@google.com>
* Create CsvIOStringToCsvRecord class * Create CsvIOStringToCsvRecord Class * Create CsvIOStringToCsvRecord Class * Create CsvIOStringToCsvRecord Class * Fixed BadRecord Output * Make class final --------- Co-authored-by: Lahari Guduru <lahariguduru@google.com>
This PR closes #31854.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.