[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB by ZihanLi58 · Pull Request #3329 · apache/gobblin

ZihanLi58 · 2021-07-14T00:28:36Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1484

Description

Here are some details about my PR, including screenshots (if applicable):
Sometime we need the avro schema in place even for ORC tables, we have that information in ingestion job but not for compaction job. And it's hard to get the avro schema from orc file itself, so we want to support schema source db, so that we can fetch the schema from source db where ingestion job registers to.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
unit test

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

codecov-commenter · 2021-07-14T00:40:44Z

Codecov Report

Merging #3329 (f0754b1) into master (5b495af) will increase coverage by 0.03%.
The diff coverage is 7.69%.

@@             Coverage Diff              @@
##             master    #3329      +/-   ##
============================================
+ Coverage     46.50%   46.53%   +0.03%     
- Complexity    10110    10122      +12     
============================================
  Files          2048     2048              
  Lines         79403    79428      +25     
  Branches       8864     8872       +8     
============================================
+ Hits          36928    36965      +37     
+ Misses        39051    39036      -15     
- Partials       3424     3427       +3

Impacted Files	Coverage Δ
...apache/gobblin/hive/writer/HiveMetadataWriter.java	`0.00% <0.00%> (ø)`
...org/apache/gobblin/iceberg/GobblinMCEProducer.java	`50.00% <0.00%> (-1.36%)`	⬇️
.../gobblin/iceberg/writer/IcebergMetadataWriter.java	`68.62% <0.00%> (-0.32%)`	⬇️
...pache/gobblin/iceberg/writer/GobblinMCEWriter.java	`71.20% <100.00%> (+0.46%)`	⬆️
...lin/util/filesystem/FileSystemInstrumentation.java	`92.85% <0.00%> (-7.15%)`	⬇️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java	`60.21% <0.00%> (-2.16%)`	⬇️
...lin/elasticsearch/writer/FutureCallbackHolder.java	`61.42% <0.00%> (-1.43%)`	⬇️
...main/java/org/apache/gobblin/yarn/YarnService.java	`14.36% <0.00%> (-0.79%)`	⬇️
...pache/gobblin/configuration/ConfigurationKeys.java	`0.00% <0.00%> (ø)`
...in/java/org/apache/gobblin/cluster/HelixUtils.java	`38.01% <0.00%> (+5.78%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b495af...f0754b1. Read the comment docs.

sv2000 · 2021-07-14T18:21:20Z

gobblin-iceberg/src/test/java/org/apache/gobblin/iceberg/writer/HiveMetadataWriterTest.java

      return tables;
    }
-    protected Iterable<String> getDatabaseNames(Path path) {
+    /*protected Iterable<String> getDatabaseNames(Path path) {


Remove the commented out block.

sv2000 · 2021-07-14T18:34:52Z

gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/writer/HiveMetadataWriter.java

+      // If schema source is NONE and schema source db is set, we will directly update the schema to source db schema
+      String schemaSourceDb = gmce.getRegistrationProperties().get(HiveMetaStoreBasedRegister.SCHEMA_SOURCE_DB);
+      try {
+        String sourceSchema = fetchSchemaFromTable(schemaSourceDb, spec.getTable().getTableName());


Should we cache the sourceSchema to avoid repeated lookups?

Update the code to try to use the existing schema map to get the schema. But if there is no schema in schema map, which means we don't register the source table and not maintain the latest schema, we will still fetch from hive to make sure we can always have the latest schema

autumnust · 2021-07-14T20:46:15Z

No additional comments beyond what @sv2000 mentioned above. But a broader question to deal with this kind of feature is: What should be the right way to specify "lineage" of schema between different tables? Is setting source.db in GMCE a right approach (which means you need to set this in a specific application's GMCE if you expect the application itself doesn't carry the schema during runtime, for example compaction), or is there something broader missing in the overall picture.

ZihanLi58 · 2021-07-14T22:55:26Z

No additional comments beyond what @sv2000 mentioned above. But a broader question to deal with this kind of feature is: What should be the right way to specify "lineage" of schema between different tables? Is setting source.db in GMCE a right approach (which means you need to set this in a specific application's GMCE if you expect the application itself doesn't carry the schema during runtime, for example compaction), or is there something broader missing in the overall picture.

Yeah I do think there is something miss broader. Ideally, we should have a source of truth relationship graph between each table, so that when we see schema update, we can modify all tables using that schema. Leveraging config store is doable, but will introduce more complexity in manage the relationships. One better way is that we can use datahub for this case, but this will need more design. As for now, I would like to support source db in the GMCE itself to make it feasible for OSS user as well

sv2000

+1.

ZihanLi58 added 4 commits July 12, 2021 16:42

[GOBBLIN-1484]initial commit

6ef7d0d

avoid listing when compute old specs

4cfc474

clean code

37a745e

add unit tests

a331f7e

fix the issue that new partitions have data loss

9988d54

sv2000 requested changes Jul 14, 2021

View reviewed changes

address comments

8535c0d

ZihanLi58 closed this Jul 14, 2021

ZihanLi58 reopened this Jul 14, 2021

address comments

f0754b1

sv2000 approved these changes Jul 15, 2021

View reviewed changes

sv2000 merged commit e5b6461 into apache:master Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329

[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329
sv2000 merged 7 commits intoapache:masterfrom
ZihanLi58:GOBBLIN-1484

ZihanLi58 commented Jul 14, 2021

Uh oh!

codecov-commenter commented Jul 14, 2021 •

edited

Loading

Uh oh!

sv2000 Jul 14, 2021

Uh oh!

ZihanLi58 Jul 14, 2021

Uh oh!

sv2000 Jul 14, 2021

Uh oh!

ZihanLi58 Jul 14, 2021

Uh oh!

autumnust commented Jul 14, 2021

Uh oh!

ZihanLi58 commented Jul 14, 2021 •

edited

Loading

Uh oh!

sv2000 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZihanLi58 commented Jul 14, 2021

JIRA

Description

Tests

Commits

Uh oh!

codecov-commenter commented Jul 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sv2000 Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

ZihanLi58 Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

sv2000 Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

ZihanLi58 Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

autumnust commented Jul 14, 2021

Uh oh!

ZihanLi58 commented Jul 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sv2000 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 14, 2021 •

edited

Loading

ZihanLi58 commented Jul 14, 2021 •

edited

Loading