Skip to content

Conversation

@abhisheknath2011
Copy link
Contributor

@abhisheknath2011 abhisheknath2011 commented Oct 13, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
  • Upgrade of Avro version to 1.9.
  • Usage of Avro compatibility helper to make avro 1.9 compatible changes (Refer - https://github.com/linkedin/avro-util).
  • Replaced deprecated public Jackson APIs with the ones compatible with Avro 1.9.
  • Replaced guava library import from avro shaded with direct guava libraries.
  • Upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.
  • Test failure fixes due to upgrade changes.
  • Updated the Avro schemas and avro data with respect to default value, null handling etc. These are requred to fix the test failures.
  • LinkedIn github hive i.e. https://github.com/linkedin/gobblin-hive/tree/branch-1.0.1-avro fixes hive issues by using Avro compatibility helper and all the new hive artifacts are published to https://linkedin.jfrog.io/artifactory/gobblin-hive. Gobblin oss now depends on these artifacts instead of Apache hive.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • Fixed all the failed tests and gobblin CI looks all green.
  • Tested using internal jobs pipeline.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

abhisheknath2011 and others added 21 commits August 12, 2021 09:51
…cated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>
…nd kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.
…#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```
…cated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>
…nd kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.
…#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```
…cated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>
…nd kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.
…#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```
…#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```
@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2022

Codecov Report

Merging #3581 (b97726d) into master (0936016) will increase coverage by 0.83%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #3581      +/-   ##
============================================
+ Coverage     46.87%   47.71%   +0.83%     
+ Complexity    10659     8598    -2061     
============================================
  Files          2117     1710     -407     
  Lines         82965    65380   -17585     
  Branches       9242     7089    -2153     
============================================
- Hits          38893    31193    -7700     
+ Misses        40507    31481    -9026     
+ Partials       3565     2706     -859     
Impacted Files Coverage Δ
.../org/apache/gobblin/test/SequentialTestSource.java 0.00% <ø> (ø)
...nversion/hive/query/HiveAvroORCQueryGenerator.java 65.89% <ø> (ø)
...ent/copy/replication/ConfigBasedMultiDatasets.java 13.95% <ø> (ø)
...ion/google/webmaster/GoogleWebmasterExtractor.java 53.53% <ø> (ø)
...le/webmaster/GoogleWebmasterExtractorIterator.java 41.73% <ø> (ø)
...blin/service/monitoring/KafkaJobStatusMonitor.java 44.88% <ø> (ø)
...e/avro/FieldAttributeBasedDeltaFieldsProvider.java 80.00% <100.00%> (+0.58%) ⬆️
...preduce/avro/MRCompactorAvroKeyDedupJobRunner.java 45.83% <100.00%> (+0.45%) ⬆️
...bblin/converter/filter/AvroSchemaFieldRemover.java 94.64% <100.00%> (ø)
...er/GobblinTrackingEventFlattenFilterConverter.java 70.90% <100.00%> (+0.53%) ⬆️
... and 418 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

maven {
url "https://linkedin.jfrog.io/artifactory/gobblin-hive/"
}
jcenter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace jcenter with maven central? We are trying to limit dependencies on jcenter because of its flakiness after becoming EoL. This makes our CI flakey

See #3554 for similar type changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed jcenter(). Will check if everything looks good from github CI side.

@Will-Lo Will-Lo changed the title Avro 1.9 upgrade of Gobblin OSS [GOBBLIN-1726] Avro 1.9 upgrade of Gobblin OSS Oct 14, 2022
for (Map.Entry<String, JsonNode> stringJsonNodeEntry : field.getJsonProps().entrySet()) {
newField.addProp(stringJsonNodeEntry.getKey(), stringJsonNodeEntry.getValue());
// Avro 1.9 compatible change - replaced deprecated public api getJsonProps with getObjectProps
for (Map.Entry<String, Object> objectEntry : field.getObjectProps().entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getObjectProps() doesn't exist in Avro 1.7.7, so this is not Avro version agnostic. We did add a new API in AvroCompatibilityHelper to getAllPropNames(), so you should be able to use the AvroCompatibilityHelper here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated the PR with the changes.

newField.addProp(stringJsonNodeEntry.getKey(), stringJsonNodeEntry.getValue());
// Avro 1.9 compatible change - replaced deprecated public api getJsonProps with getObjectProps
for (Map.Entry<String, Object> objectEntry : field.getObjectProps().entrySet()) {
newField.addProp(objectEntry.getKey(), objectEntry.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for addProp(). Use the AvroCompatibilityHelper instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@Test
public void testGetMultiConvertsStrings() throws IOException {
// The below error is due to invalid avro data. Type with "null" union must have "null" first and then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not strictly accurate (as far as Avro is concerned). A union with "null" doesn't have to have null as the first type. It's usually intended to signify optional-ness, and usually people do want to put null first, so that they can set "null" as the default value, but it's not required.

What is required (as per Avro) is that the default value must have the same type as the first entry in the union. So, for example, this is correct:

"type": ["null", "string"],
"default": null

Whereas this is wrong:

// default is null, but must be a string, since that's the first type in the union
"type": ["string", "null"],
"default": null

This is also wrong:

// default is a string ("null"), not null, which is the first type in the union
"type": ["null", "string"],
"default": "null"

And this is totally correct, though usually rare to see:

"type": ["string", "null"]
"default": "foobar"

So what you really want to say in this comment is that since the default value is null, you want to ensure that that's the first type in the union.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. I have updated the comment.

@Will-Lo Will-Lo merged commit 7eaa6e8 into apache:master Oct 18, 2022
phet pushed a commit to phet/gobblin that referenced this pull request Dec 3, 2022
* [Branch avro_1_9] Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.

* Changes for upgrade Avro 1.9.2 and leverges hive with avro changes from https://linkedin.jfrog.io/artifactory/gobblin-hive (apache#3458)

* Use helper-all v0.2.74 to solve issues around default values. (apache#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```

* [Branch avro_1_9] Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.

* Changes for upgrade Avro 1.9.2 and leverges hive with avro changes from https://linkedin.jfrog.io/artifactory/gobblin-hive (apache#3458)

* Use helper-all v0.2.74 to solve issues around default values. (apache#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```

* [Branch avro_1_9] Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs (apache#3349)

* Avro 1.9 upgrade compatible change - replaced deprecated public APIs with the compatible APIs.

* Avro 1.9 upgrade compatible change - Replaced guava library import from avro shaded with direct guava libraries

* Applied Gobblin codestyle formatting.

Co-authored-by: Lei <autumnust@users.noreply.github.com>

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency. (apache#3368)

* Avro 1.9 upgrade compatible change - upgraded jackson mapper to 2.x and kept jackson mapper 1.x for modules with hive, helix library dependency.

* Changes for upgrade Avro 1.9.2 and leverges hive with avro changes from https://linkedin.jfrog.io/artifactory/gobblin-hive (apache#3458)

* Use helper-all v0.2.74 to solve issues around default values. (apache#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```

* Merging apache/gobblin master with avro_1_9

* Use helper-all v0.2.74 to solve issues around default values. (apache#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e018:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```

* Added deprecated json method using AvroCompatibilityHelper

* Removed unused import and replaced Integer.valueOf with Integer.parseInt

* Exclude com.linkedin.hive dependency from gradle build files similar to org.apache.hive

* Repalce direct avro field creation with AvroCompatibilityHelper.createSchemaField

* Removed extra dependency. Addressed review comment - removed jcenter() repository

* Upgrade AvroCompatHelper version

* Removed the code that are actually moved to AvroHiveTypeUtils.java in the master branch

* Addresssed review comments: replaced getObjectProps/getObjectProp with AvroCompatibilityHelper methods

* Fix for test failure

Co-authored-by: Lei <autumnust@users.noreply.github.com>
Co-authored-by: Sreeram Ramachandran <sramachandran@linkedin.com>
phet added a commit to phet/gobblin that referenced this pull request Dec 5, 2022
* upstream/master:
  move dataset handler code before cleaning up staging data (apache#3594)
  [GOBBLIN-1730] Include flow execution id when try to cancel/submit job using SimpleKafkaSpecProducer (apache#3588)
  [GOBBLIN-1734] make DestinationDatasetHandler work on streaming sources (apache#3592)
  give option to cancel helix workflow through Delete API (apache#3580)
  [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior (apache#3586)
  Support multiple node types in shared flowgraph, fix logs (apache#3590)
  Search for dummy file in writer directory (apache#3589)
  Use root cause for checking if exception is transient (apache#3585)
  [GOBBLIN-1724] Support a shared flowgraph layout in GaaS (apache#3583)
  [GOBBLIN-1731] Enable HiveMetadataWriter to override table schema lit… (apache#3587)
  [GOBBLIN-1726] Avro 1.9 upgrade of Gobblin OSS (apache#3581)
  [GOBBLIN-1725] Fix bugs in gaas warm standby mode (apache#3582)
  [GOBBLIN-1718] Define DagActionStoreMonitor to listen for kill/resume… (apache#3572)
  Add log line for committing/retrieving watermarks in streaming (apache#3578)
  [GOBBLIN-1707] Enhance `IcebergDataset` to detect when files already at dest then proceed with only delta (apache#3575)
  Ignore AlreadyExistsException in hive writer (apache#3579)
  Fail GMIP container for known transient exceptions to avoid data loss (apache#3576)
  GOBBLIN-1715: Support vectorized row batch pooling (apache#3574)
  [GOBBLIN-1696] Implement file based flowgraph that detects changes to the underlying… (apache#3548)
  GOBBLIN-1719 Replace moveToTrash with moveToAppropriateTrash for hadoop trash (apache#3573)
  [GOBBLIN-1703] avoid double quota increase for adhoc flows (apache#3550)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants