Skip to content

Conversation

@lahariguduru
Copy link
Contributor

This PR closes #31854.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@lahariguduru lahariguduru changed the title [CsvIO] Create CsIOStringToCsvRecord Class [CsvIO] Create CsvIOStringToCsvRecord Class Jul 12, 2024
Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class be exposed to the user (ie. public class) or is it internal only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is currently internal, all classes are package private

Comment on lines 41 to 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it possible to create a CsvIOStringToCsvRecord transform without having to provide an errorHandlerTransform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I added the transform to the CsvIOConfiguration class itself but will update this class to include it.

Copy link
Contributor Author

@lahariguduru lahariguduru Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately had to completely remove checking for errors within this class due to git rebase requirement -- the branch being unable to pull the most recent version of CsvIOParseConfiguration. The test for Bad record will also have to be in different class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleanup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be outputting CSVRecord?

Suggested change
extends PTransform<PCollection<String>, PCollection<Iterable<String>>> {
extends PTransform<PCollection<String>, PCollection<CSVRecord>> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output was originally CSVRecord, but due to CSVRecord being read-only and formatting differences in equals() method, it is not possible to compare CSVRecord(read-only) in test checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see. We should probably find a way around that. Looking at the CSVRecord class, there's some useful methods we'd be missing out on if we cast it to an Iterable<String>.

Instead, we can maybe try converting the output CSVRecord to an ArrayList before doing our test checks. For example:

PCollection<String> outputStrings = input
        .apply(underTest)
        .apply(MapElements.into(TypeDescriptors.lists(Typedescriptors.strings())).via(
                                record -> ImmutableList.copyOf(record.iterator())));

// Passert checks with outputStrings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can implement custom comparison in test to override the equals

I.e

assertThat(csvRecords)
          .comparingElementsUsing(
           Correspondence.from(...customComparator) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I will implement both measures and see the best way to go about getting around the equals mismatch. @ahmedabu98 I agree, CSVRecord is also easier to convert into custom type so overall a better choice for corresponding classes that require CsvIOStringToCsvRecord.

Copy link
Contributor Author

@lahariguduru lahariguduru Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further documentation, in future versions the Serializable is deprecated in CSVRecord (versions later than 1.8, which is currently being used). This was the main reason for using Iterable rather than CSVRecord.

via CSVRecord 1.8: Note: Support for Serializable is scheduled to be removed in version 2.0. In version 1.8 the mapping between the column header and the column index was removed from the serialised state. The class maintains serialization compatibility with versions pre-1.8 for the record values; these must be accessed by index following deserialization. There will be loss of any functionally linked to the header mapping when transferring serialised forms pre-1.8 to 1.8 and vice versa.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, that is pretty unfortunate. I guess Iterable<String> makes sense to go with (even though the naming of this transform will be a little misleading)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming itself isn't the best but only data truly needed from the CsvRecord is the cell for future parsing.

@lahariguduru lahariguduru requested a review from ahmedabu98 July 12, 2024 23:58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class be final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class can be final but it defies the convention of all other created classes that are currently only package private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can implement custom comparison in test to override the equals

I.e

assertThat(csvRecords)
          .comparingElementsUsing(
           Correspondence.from(...customComparator) 

@lahariguduru lahariguduru force-pushed the string-to-csv-record branch from 29c0ba2 to ed3e6c5 Compare July 15, 2024 16:33
@lahariguduru
Copy link
Contributor Author

Run Java_Examples_Dataflow PreCommit

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahmedabu98
Copy link
Contributor

Fixes #31854

@ahmedabu98 ahmedabu98 merged commit ea76111 into apache:master Jul 15, 2024
acrites pushed a commit to acrites/beam that referenced this pull request Jul 17, 2024
* Create CsvIOStringToCsvRecord class

* Create CsvIOStringToCsvRecord Class

* Create CsvIOStringToCsvRecord Class

* Create CsvIOStringToCsvRecord Class

* Fixed BadRecord Output

* Make class final

---------

Co-authored-by: Lahari Guduru <lahariguduru@google.com>
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
* Create CsvIOStringToCsvRecord class

* Create CsvIOStringToCsvRecord Class

* Create CsvIOStringToCsvRecord Class

* Create CsvIOStringToCsvRecord Class

* Fixed BadRecord Output

* Make class final

---------

Co-authored-by: Lahari Guduru <lahariguduru@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CsvIO]: Create CsvIOStringToCsvRecord Class

3 participants