Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix record conversion for Arrays #591

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

tgmof
Copy link

@tgmof tgmof commented Nov 16, 2022

Issue summary: I cannot use the Wrangler or any other XML plugin provided for a (a priori) simple use case which consist of importing (nested/repeated) XML data (that have repeated columns, i.e. JSON Arrays) to whatever sink.

Steps to reproduce:

  1. Create a pipeline GCS->Wrangler->Whatever sink (with the input path in GCS set as a runtime variable).

  2. Use the following sample to create the output schema (with the xml-to-json transform) and run the pipeline with this file.

    <SomeField>
        <Total>65.95</Total>
        <Total>3.98</Total>
        <Total TotalType="FinalTotal">65.95</Total>
    </SomeField>
    <Timer>
        <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
    </Timer>
</MyRoot>
  1. Oberve that the pipeline is successful.

  2. Change the source to a new file:

    <SomeField>
        <Total>65.95</Total>
    </SomeField>
    <Timer>
        <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
    </Timer>
</MyRoot>
  1. Observe that the pipeline fails with the "Unable to decode array 'body_MyRoot_SomeField'" error.

Why this PR? Because there is no general way to know when an XML contains repeated columns or not and thus everything should be expected to be repeated.

Why I think it's a good idea to do that in the standard CDAP code:

  1. Correct me if I'm wrong but this RecordConvertor.java is meant to convert the input Runtime data to match the Output Schema. It is NOT meant to "VALIDATE the input against the output schema".
  2. It is a "high level" data type since an array is always filled with elements that have a type themselves (or no element but then we won't have any issue in the first place) thus doing this Collections.singletonList(object) is pretty much the "array equivalent" of doing Double.parseDouble(value); (which is already in this code) i.e. we basically cast the input to match the output schema.

Issue summary: I cannot use the Wrangler or any other XML plugin provided for a (a priori) simple use case which consist of importing (nested/repeated) XML data (that have repeated columns, i.e. JSON Arrays) to whatever sink.

Steps to reproduce:
1. Create a pipeline GCS->Wrangler->Whatever sink (with the input path in GCS set as a runtime variable).

2. Use the following sample to create the output schema (with the xml-to-json transform) and run the pipeline with this file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MyRoot>
    <SomeField>
        <Total>65.95</Total>
        <Total>3.98</Total>
        <Total TotalType="FinalTotal">65.95</Total>
    </SomeField>
    <Timer>
        <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
    </Timer>
</MyRoot>

3. Oberve that the pipeline is successful.

4. Change the source to a new file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MyRoot>
    <SomeField>
        <Total>65.95</Total>
    </SomeField>
    <Timer>
        <StartTimestamp>2022-10-03T11:01:48</StartTimestamp>
    </Timer>
</MyRoot>

5. Observe that the pipeline fails with the "Unable to decode array 'body_MyRoot_SomeField'" error.

Why this PR? Because there is no general way to know when an XML contains repeated columns or not and thus everything should be expected to be repeated.

Why I think it's a good idea to do that in the standard CDAP code:
1. Correct me if I'm wrong but this RecordConvertor.java is meant to convert the input Runtime data to match the Output Schema. It is NOT meant to "VALIDATE the input against the output schema".
2. It is a "high level" data type since an array is always filled with elements that have a type themselves (or no element but then we won't have any issue in the first place) thus doing this Collections.singletonList(object) is pretty much the "array equivalent" of doing Double.parseDouble(value); (which is already in this code) i.e. we basically cast the input to match the output schema.
@google-cla
Copy link

google-cla bot commented Nov 16, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant