Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-20517 : Creation of staging directory and Move operation is taking time in S3 #430

Closed
wants to merge 2 commits into from

Conversation

maheshk114
Copy link
Contributor

No description provided.

@@ -489,6 +489,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
+ "used in conjunction with 'hive.repl.dump.metadata.only' set to false. if 'hive.repl.dump.metadata.only' \n"
+ " is set to true then this config parameter has no effect as external table meta data is flushed \n"
+ " always by default."),
REPL_ENABLE_MOVE_OPTIMIZATION("hive.repl.enable.move.optimization", false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, highlight the side effect of setting this flag to true in HDFS system where incremental repl may lead to partial data read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -489,6 +489,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
+ "used in conjunction with 'hive.repl.dump.metadata.only' set to false. if 'hive.repl.dump.metadata.only' \n"
+ " is set to true then this config parameter has no effect as external table meta data is flushed \n"
+ " always by default."),
REPL_ENABLE_MOVE_OPTIMIZATION("hive.repl.enable.move.optimization", false,
"If its set to true, replcopy task copy the source files directly to the destination path. \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall rephrase the statement as "If its set to true, REPL LOAD copies data files directly to the target table/partition location instead of copying to staging directory first and then move to target location. This optimize the REPL LOAD on object data stores such as S3 or WASB where creating a directory and move files are costly operations."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

run("CREATE TABLE " + tableNamePart + " (fld int) partitioned by (part int) STORED AS TEXTFILE", driver);

run("insert into " + tableNameNoPart + " values (1) ", driver);
run("insert into " + tableNameNoPart + " values (2) ", driver);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall just limit to 1 row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for partitioned table first one goes in add partition flow and second one in insert flow


run("insert into " + tableNameNoPart + " values (1) ", driver);
run("insert into " + tableNameNoPart + " values (2) ", driver);
verifySetup("SELECT fld from " + tableNameNoPart , new String[]{ "1" , "2" }, driver);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifySetup can be removed as they are generic scenarios tested by many other test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

verifyRun("SELECT fld from " + tableNamePart + " where part = 10" , new String[]{ "1" , "2"}, driverMirror);
verifyRun("SELECT fld from " + tableNamePart + " where part = 11" , new String[]{ "3" }, driverMirror);
verifyRun("SELECT fld from " + tableNameNoPart , new String[]{ "1" , "2" }, driverMirror);
verifyRun("SELECT count(*) from " + tableNamePart , new String[]{ "3"}, driverMirror);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall remove all count(*) queries as they are already verified with values query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) for check if stats are updated or not

@@ -80,6 +84,8 @@
public static final String FUNCTIONS_ROOT_DIR_NAME = "_functions";
public static final String CONSTRAINTS_ROOT_DIR_NAME = "_constraints";

private static final List<String> CLOUD_SCHEME_PREFIXES = Arrays.asList("s3a", "wasb");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to hardcode this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats the only way now

withinContext.replicationSpec.setIsReplace(true);
// In case of ACID operations, same directory may have many other sub directory for different write id stmt id
// combination. So we can not set isreplace to true.
withinContext.replicationSpec.setIsReplace(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But REPL LOAD copies to sub-directory directly not base dir and so replace flag to true make sense in case of retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check to remove stale files during copy , if the copy is done from cm path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this change for replace flag can be reverted right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ..for acid tables the files need not be replaced

@@ -49,6 +49,8 @@
// If set to false, it'll behave as a traditional CopyTask.
protected boolean readSrcAsFilesList = false;

protected boolean deleteDestIfExist = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used by replcopy task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why do we have public methods to expose this? Can be either private member + public methods or just protected if accessed by extended classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to private


if (replicationSpec.isInReplicationScope() &&
x.getCtx().getConf().getBoolean(REPL_ENABLE_MOVE_OPTIMIZATION.varname, false)) {
lft = LoadFileType.IGNORE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we skip move task instead of IGNORE type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that can be done only if its a add partition ..for insert operation move is required ..anyways thats handled in a separate patch

Path destPath;
if (replicationSpec.isInReplicationScope() &&
x.getCtx().getConf().getBoolean(REPL_ENABLE_MOVE_OPTIMIZATION.varname, false)) {
loadFileType = LoadFileType.IGNORE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we skip move task instead of IGNORE type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move task is required for statistics

@maheshk114 maheshk114 force-pushed the HIVE-20517 branch 2 times, most recently from a57ce7b to f3cf0bd Compare September 15, 2018 07:21
@@ -489,6 +489,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
+ "used in conjunction with 'hive.repl.dump.metadata.only' set to false. if 'hive.repl.dump.metadata.only' \n"
+ " is set to true then this config parameter has no effect as external table meta data is flushed \n"
+ " always by default."),
REPL_ENABLE_MOVE_OPTIMIZATION("hive.repl.enable.move.optimization", false,
"If its set to true, REPL LOAD copies data files directly to the target table/partition location \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall remove all \n in the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the parameters are there with \n ?

@@ -1321,6 +1322,106 @@ public Boolean apply(@Nullable CallerArguments args) {
.verifyResult(replicatedDbName + ".testFunctionOne");
}

@Test
public void testMoveOptimizationBootstrapReplLoadRetryAfterFailureFor() throws Throwable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall remove "For" from test name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

};

InjectableBehaviourObjectStore.setAddNotificationModifier(callerVerifier);
replica.loadFailure(replicadb, tuple.dumpLocation, withConfigs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in try catch block and call InjectableBehaviourObjectStore.resetAddNotificationModifier in finally block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

replica.load(replicadb, tuple.dumpLocation, withConfigs);

replica.run("use " + replicadb)
.run("select country from " + tbl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be select place from tbl where country='india' to we know if the data file is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try {
Hive hiveDb = Hive.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use getHive() of Task class to consider the Task conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

withinContext.replicationSpec.setIsReplace(true);
// In case of ACID operations, same directory may have many other sub directory for different write id stmt id
// combination. So we can not set isreplace to true.
withinContext.replicationSpec.setIsReplace(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this change for replace flag can be reverted right?

OVERWRITE_EXISTING
OVERWRITE_EXISTING,
/**
* No need to rename the file, used in case of replication to s3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"No need to move the file".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -49,6 +49,8 @@
// If set to false, it'll behave as a traditional CopyTask.
protected boolean readSrcAsFilesList = false;

protected boolean deleteDestIfExist = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why do we have public methods to expose this? Can be either private member + public methods or just protected if accessed by extended classes.

Boolean success = addNotificationEventModifier.apply(entry);
if ((success != null) && !success) {
throw new MetaException("InjectableBehaviourObjectStore: Invalid addNotificationEvent operation on DB: "
+ entry.getDbName() + " table: " + entry.getTableName() + " event : " + entry.getEventType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB and table name need not be set always. shall use String.valueOf(entry.getDbName()) as it return "null" string if null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing null is fine i think

if ((success != null) && !success) {
throw new MetaException("InjectableBehaviourObjectStore: Invalid addNotificationEvent operation on DB: "
+ entry.getDbName() + " table: " + entry.getTableName() + " event : " + entry.getEventType());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to call super.addNotificationEvent(entry); if modifier method returns true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link

github-actions bot commented Jun 9, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Jun 9, 2020
@github-actions github-actions bot closed this Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants