Skip to content

[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329

Merged
sv2000 merged 7 commits intoapache:masterfrom
ZihanLi58:GOBBLIN-1484
Jul 15, 2021
Merged

[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329
sv2000 merged 7 commits intoapache:masterfrom
ZihanLi58:GOBBLIN-1484

Conversation

@ZihanLi58
Copy link
Copy Markdown
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Sometime we need the avro schema in place even for ORC tables, we have that information in ingestion job but not for compaction job. And it's hard to get the avro schema from orc file itself, so we want to support schema source db, so that we can fetch the schema from source db where ingestion job registers to.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    unit test

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #3329 (f0754b1) into master (5b495af) will increase coverage by 0.03%.
The diff coverage is 7.69%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3329      +/-   ##
============================================
+ Coverage     46.50%   46.53%   +0.03%     
- Complexity    10110    10122      +12     
============================================
  Files          2048     2048              
  Lines         79403    79428      +25     
  Branches       8864     8872       +8     
============================================
+ Hits          36928    36965      +37     
+ Misses        39051    39036      -15     
- Partials       3424     3427       +3     
Impacted Files Coverage Δ
...apache/gobblin/hive/writer/HiveMetadataWriter.java 0.00% <0.00%> (ø)
...org/apache/gobblin/iceberg/GobblinMCEProducer.java 50.00% <0.00%> (-1.36%) ⬇️
.../gobblin/iceberg/writer/IcebergMetadataWriter.java 68.62% <0.00%> (-0.32%) ⬇️
...pache/gobblin/iceberg/writer/GobblinMCEWriter.java 71.20% <100.00%> (+0.46%) ⬆️
...lin/util/filesystem/FileSystemInstrumentation.java 92.85% <0.00%> (-7.15%) ⬇️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 60.21% <0.00%> (-2.16%) ⬇️
...lin/elasticsearch/writer/FutureCallbackHolder.java 61.42% <0.00%> (-1.43%) ⬇️
...main/java/org/apache/gobblin/yarn/YarnService.java 14.36% <0.00%> (-0.79%) ⬇️
...pache/gobblin/configuration/ConfigurationKeys.java 0.00% <0.00%> (ø)
...in/java/org/apache/gobblin/cluster/HelixUtils.java 38.01% <0.00%> (+5.78%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b495af...f0754b1. Read the comment docs.

return tables;
}
protected Iterable<String> getDatabaseNames(Path path) {
/*protected Iterable<String> getDatabaseNames(Path path) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented out block.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// If schema source is NONE and schema source db is set, we will directly update the schema to source db schema
String schemaSourceDb = gmce.getRegistrationProperties().get(HiveMetaStoreBasedRegister.SCHEMA_SOURCE_DB);
try {
String sourceSchema = fetchSchemaFromTable(schemaSourceDb, spec.getTable().getTableName());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache the sourceSchema to avoid repeated lookups?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the code to try to use the existing schema map to get the schema. But if there is no schema in schema map, which means we don't register the source table and not maintain the latest schema, we will still fetch from hive to make sure we can always have the latest schema

@autumnust
Copy link
Copy Markdown
Contributor

No additional comments beyond what @sv2000 mentioned above. But a broader question to deal with this kind of feature is: What should be the right way to specify "lineage" of schema between different tables? Is setting source.db in GMCE a right approach (which means you need to set this in a specific application's GMCE if you expect the application itself doesn't carry the schema during runtime, for example compaction), or is there something broader missing in the overall picture.

@ZihanLi58 ZihanLi58 closed this Jul 14, 2021
@ZihanLi58 ZihanLi58 reopened this Jul 14, 2021
@ZihanLi58
Copy link
Copy Markdown
Contributor Author

ZihanLi58 commented Jul 14, 2021

No additional comments beyond what @sv2000 mentioned above. But a broader question to deal with this kind of feature is: What should be the right way to specify "lineage" of schema between different tables? Is setting source.db in GMCE a right approach (which means you need to set this in a specific application's GMCE if you expect the application itself doesn't carry the schema during runtime, for example compaction), or is there something broader missing in the overall picture.

Yeah I do think there is something miss broader. Ideally, we should have a source of truth relationship graph between each table, so that when we see schema update, we can modify all tables using that schema. Leveraging config store is doable, but will introduce more complexity in manage the relationships. One better way is that we can use datahub for this case, but this will need more design. As for now, I would like to support source db in the GMCE itself to make it feasible for OSS user as well

Copy link
Copy Markdown
Contributor

@sv2000 sv2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

@sv2000 sv2000 merged commit e5b6461 into apache:master Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants