[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329
[GOBBLIN-1484]Make Gobblin metadata writer be able to support schema source DB#3329sv2000 merged 7 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3329 +/- ##
============================================
+ Coverage 46.50% 46.53% +0.03%
- Complexity 10110 10122 +12
============================================
Files 2048 2048
Lines 79403 79428 +25
Branches 8864 8872 +8
============================================
+ Hits 36928 36965 +37
+ Misses 39051 39036 -15
- Partials 3424 3427 +3
Continue to review full report at Codecov.
|
| return tables; | ||
| } | ||
| protected Iterable<String> getDatabaseNames(Path path) { | ||
| /*protected Iterable<String> getDatabaseNames(Path path) { |
There was a problem hiding this comment.
Remove the commented out block.
| // If schema source is NONE and schema source db is set, we will directly update the schema to source db schema | ||
| String schemaSourceDb = gmce.getRegistrationProperties().get(HiveMetaStoreBasedRegister.SCHEMA_SOURCE_DB); | ||
| try { | ||
| String sourceSchema = fetchSchemaFromTable(schemaSourceDb, spec.getTable().getTableName()); |
There was a problem hiding this comment.
Should we cache the sourceSchema to avoid repeated lookups?
There was a problem hiding this comment.
Update the code to try to use the existing schema map to get the schema. But if there is no schema in schema map, which means we don't register the source table and not maintain the latest schema, we will still fetch from hive to make sure we can always have the latest schema
|
No additional comments beyond what @sv2000 mentioned above. But a broader question to deal with this kind of feature is: What should be the right way to specify "lineage" of schema between different tables? Is setting source.db in GMCE a right approach (which means you need to set this in a specific application's GMCE if you expect the application itself doesn't carry the schema during runtime, for example compaction), or is there something broader missing in the overall picture. |
Yeah I do think there is something miss broader. Ideally, we should have a source of truth relationship graph between each table, so that when we see schema update, we can modify all tables using that schema. Leveraging config store is doable, but will introduce more complexity in manage the relationships. One better way is that we can use datahub for this case, but this will need more design. As for now, I would like to support source db in the GMCE itself to make it feasible for OSS user as well |
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Sometime we need the avro schema in place even for ORC tables, we have that information in ingestion job but not for compaction job. And it's hard to get the avro schema from orc file itself, so we want to support schema source db, so that we can fetch the schema from source db where ingestion job registers to.
Tests
unit test
Commits