Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-21029: External table replication for existing deployments running incremental replication. #523

Closed
wants to merge 4 commits into from

Conversation

sankarh
Copy link
Contributor

@sankarh sankarh commented Jan 29, 2019

No description provided.

@sankarh sankarh force-pushed the HIVE-21029 branch 3 times, most recently from a7ea7d9 to 859b07e Compare January 30, 2019 06:51
@@ -204,7 +211,7 @@ public static boolean shouldReplicate(NotificationEvent tableForEvent,
.getTableName(), e);
return false;
}
return shouldReplicate(replicationSpec, table, hiveConf);
return shouldReplicate(replicationSpec, table, false, hiveConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here true should be passed as this is called for incremental dump. Better to change the prototype of this function also to avoid misuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

// If incremental events are already applied, then check and perform if need to bootstrap any tables.
if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) {
if (work.hasBootstrapLoadTasks()) {
LOG.debug("Current incremental dump have tables to be bootstrapped. Switching to bootstrap "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can run in parallel to incremental load. For external table both incremental and bootstrap can not be there

Copy link
Contributor Author

@sankarh sankarh Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be done. We have to bootstrap external tables only after finish applying events. This is to avoid problems if some non-external table uses same name as external table and it has alter/drop table event in the current dump.
Anyways, bootstrap in itself loads all external tables in parallel. Also, it is metadata-only operations and most of the work is with copy of external tables data which anyways done in parallel with incremental events load.

@@ -351,13 +366,23 @@ public StageType getType() {

private int executeIncrementalLoad(DriverContext driverContext) {
try {
IncrementalLoadTasksBuilder builder = work.incrementalLoadTasksBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (loadWork.getPathsToCopyIterator().hasNext()) {
taskChainTail.addDependentTask(TaskFactory.get(loadWork, conf));
}

This can be removed from IncrementalLoadTasksBuilder.build method

Copy link
Contributor Author

@sankarh sankarh Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will remove it.

// Update last repl ID of the database only if the current dump is not incremental. If bootstrap
// is combined with incremental dump, it contains only tables to bootstrap. So, needn't change
// last repl ID of the database.
if (!iterator.hasNext() && !constraintIterator.hasNext() && !work.isIncrementalLoad()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no checkpoint support for external table bootstrap ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint support is there. It is done in table/partition level for inc-bootstrap case. Database object should be unaltered and hence no need to set checkpoint flag on DB.

+ " repl dump to bootstrap all external tables."),
REPL_BOOTSTRAP_EXTERNAL_TABLES("hive.repl.bootstrap.external.tables", false,
"Indicates if repl dump should bootstrap the information about external tables during incremental \n"
+ "phase of replication. It should be used in conjunction with 'hive.repl.include.external.tables' \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should not be a config value as it may cause repeated bootstrap of external table. This can be passed as some parameter in dump command.

Copy link
Contributor Author

@sankarh sankarh Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be set in WITH clause of REPL DUMP. Also, as the description says, it should be set to true only once for repl dump after enabling external tables replication. No issues having it as config provided user sets it properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should not be a config ..should be a flag. Or atleast block it some how , so that it can be given only using repl dump command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, updated description of this config to highlight that it should be set only via WITH clause of REPL DUMP.

// This in-turn applicable for partitions creation as well.
if ((parentDb != null) && (!event.replicationSpec().allowReplacementInto(parentDb.getParameters()))) {
return tracker;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this check, external tables are not bootstrapped as DB is later (higher last_repl_id) than table itself. Also, this check doesn't make sense as it is bootstrap flow where we set DB's last_repl_id as 0 before triggering tables bootstrap. It will be always true. Also, in retry case, we depend on checkpoints to ignore replicated tables not this.

if (conf.getBoolVar(HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES)
&& (!conf.getBoolVar(HiveConf.ConfVars.REPL_DUMP_METADATA_ONLY)
|| conf.getBoolVar(HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES))) {
Path dbRoot = getBootstrapDbRoot(dumpRoot, dbName, true);
try (Writer writer = new Writer(dumpRoot, conf)) {
for (String tableName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) {
Table table = hiveDb.getTable(dbName, tableName);
if (TableType.EXTERNAL_TABLE.equals(table.getTableType())) {
writer.dataLocationDump(table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataLocationDump should be done only if its not metadata only dump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is validated inside dataLocationDump method and it is noop if metadata-only.

!conf.getBoolVar(HiveConf.ConfVars.REPL_DUMP_METADATA_ONLY)) {
if (conf.getBoolVar(HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES)
&& (!conf.getBoolVar(HiveConf.ConfVars.REPL_DUMP_METADATA_ONLY)
|| conf.getBoolVar(HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition is very confusing ..add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if (conf.getBoolVar(HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES)
&& (!conf.getBoolVar(HiveConf.ConfVars.REPL_DUMP_METADATA_ONLY)
|| conf.getBoolVar(HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES))) {
Path dbRoot = getBootstrapDbRoot(dumpRoot, dbName, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBootstrapDbRoot should be called only if inc-bootstrap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? getBootstrapDbRoot takes flag to decide if it is for inc-bootstrap or normal one.

FileSystem fs = incBootstrapDir.getFileSystem(hiveConf);
if (fs.exists(incBootstrapDir)) {
this.bootstrapIterator = new BootstrapEventsIterator(incBootstrapDir.toString(), dbNameToLoadIn, hiveConf);
this.constraintsIterator = new ConstraintEventsIterator(dumpDirectory, hiveConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.constraintsIterator ..why this is required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeBootstrapLoad method expects it to be not null even if constraints are not there. Also, I set it here so that in future if bootstrapped table (such as ACID tables) have constraints, it works fine.

@sankarh sankarh closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants