Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHOENIX-5190 Implement TaskRegionObserver for Index rebuild #457

Closed
wants to merge 1 commit into from

Conversation

gokceni
Copy link
Contributor

@gokceni gokceni commented Mar 7, 2019

No description provided.

@gokceni
Copy link
Contributor Author

gokceni commented Mar 7, 2019

@gjacoby126
Copy link
Contributor

@gokceni - what JIRA does this correspond to?

@gokceni
Copy link
Contributor Author

gokceni commented Mar 12, 2019

@gjacoby126 PHOENIX-5190 is the JIRA.

@gokceni gokceni changed the title Implement TaskRegionObserver for Index rebuild PHOENIX-5190 Implement TaskRegionObserver for Index rebuild Mar 13, 2019
Copy link
Contributor

@kadirozde kadirozde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gokceni, you have done a great job in extending the existing system task framework and implementing the new use case. I have one comment on your code for removing the INDEX_NAME column.

Your changes got me thinking about improving the design further as follows:

  • Map the task type to the class name of the corresponding task handler. This can be implemented as enum.
  • Have one thread that goes through the system task table and maps the task type to its class, creates an instance of it with task attributes and finally calls the run method of the instance.
  • Task handling classes should not know anything about the system task table or getting a phoenix connection to access its records.

I think this will simply adding new tasks and will make the code more readable.

PhoenixDatabaseMetaData.TENANT_ID + ", " +
PhoenixDatabaseMetaData.TABLE_SCHEM + ", " +
PhoenixDatabaseMetaData.TABLE_NAME + ", " +
PhoenixDatabaseMetaData.INDEX_NAME + ", " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDEX_NAME should be a task specific attribute. I do not think it should be part of the generic task attributes. Either the name of the index table should be stored in the TABLE_NAME column or encoded in the DATA column. I suggest removing INDEX_NAME from from the DDL


task.run();

String viewIndexTableName = "_IDX_" + baseTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should use the appropriate SchemaUtil method to construct the view index table name.

admin.disableTable(tableName);
admin.truncateTable(tableName, false);

data = "{GusWorkId:abc}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using a constant

assertTrue( count == numOfRuns);

// Check task status and other column values.
ResultSet rs = conn.createStatement().executeQuery("SELECT * " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this would be useful to have as a utility function somewhere. In general it's good to break up long functions into helper functions with descriptive names when there's a discrete unit of work.

mutateSystemTaskTable(conn, stmt, accessCheckEnabled);
}

public static void addTask(PhoenixConnection conn, TaskType taskType, String tenantId, String schemaName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there should be a separation of concerns here -- persisting a Task to a Phoenix table and reading it back is distinct from scheduling a task. I'd suggest breaking some of these Task classes into their own files and either fully encapsulate their persistence logic in the Tasks (the Data Access Object pattern) or have standalone persistence classes that handle the persistence (the Repository pattern).

Either way would improve the readability and reduce the coupling between the scheduling logic and the persistence logic.

Also agree with @kadirozde 's ideas to use reflection and JSON parsing to reduce the amount of boilerplate necessary to hook in each new task type for this framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjacoby126 This is not a new method, it was already in this file. I moved it to this class. I am not seeing the patterns for other System tables. I will add a Tasks class tho.


public static String[] getArgValues(boolean directApi, boolean useSnapshot, String schemaName,
String dataTable, String indxTable, String tenantId) {
final List<String> args = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is leaking IndexTool internals into this task. Would be better to provide an abstraction so you can call the IndexTool directly.

tableName = taskRecord.getTableName();
indexName = taskRecord.getIndexName();
final String[] cmdArgs =
getArgValues(true, false, schemaName, tableName, indexName, tenantId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, it would be better to have an API in the IndexTool that could be called directly in an object-oriented fashion. The IndexTool should only have a CLI so that it can be called from a real shell command line, but here we don't need such things.

JsonParser jsonParser = new JsonParser();
JsonObject jsonObject = jsonParser.parse(data).getAsJsonObject();
if (jsonObject.has("DisableBefore")) {
String disableBefore = jsonObject.get("DisableBefore").toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like logic that should be in the IndexTool itself.

public static final String TASK_STATUS = "TASK_STATUS";
public static final String TASK_END_TS = "TASK_END_TS";
public static final String PRIORITY = "PRIORITY";
public static final String DATA = "DATA";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make the constant named "TASK_DATA_COLUMN" or something along those lines to indicate what it's for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and same with PRIORITY)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename as TASK_DATA. The others don't have _COLUMN at the end


} catch (TableAlreadyExistsException e) {
long currentServerSideTableTimeStamp = e.getTable().getTimeStamp();
if (currentServerSideTableTimeStamp <= MetaDataProtocol.MIN_SYSTEM_TABLE_TIMESTAMP_5_1_0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind this change will also likely go into 4.15, so the upgrade will need to happen for either 4.15 or 5.1

@gokceni
Copy link
Contributor Author

gokceni commented Mar 14, 2019

@kadirozde (and @gjacoby126) for this comment:
"Have one thread that goes through the system task table and maps the task type to its class, creates an instance of it with task attributes and finally calls the run method of the instance."

If we do it like this, don't we tie both tasks to a single thread?
From the discussion we had, I thought we said each task having their own thread. Right now, there is 2. If we keep adding different task types, they will all run on 1 thread. Also, we might want to run them on different intervals (some tasks might not be that frequent, some might). What do you think?

@kadirozde
Copy link
Contributor

@gokceni, regarding having one main thread responsible form scanning the system table allows us to separate the system table management concerns from the individual task handing ones. Initially having one thread will be fine as we will not have many tasks. I think, when we need more threads and scheduling task based on their priorities, this would be handled within the system task framework without changing task handling classes if the concerns are separated.

@gokceni gokceni force-pushed the gokcen branch 2 times, most recently from fb432f7 to 4a4764f Compare March 22, 2019 20:18
DropTableWithViewsIT.assertTaskColumns(conn, PTable.TaskStatus.COMPLETED.toString(), PTable.TaskType.INDEX_REBUILD,
"{\"IndexName\":\""+ indexName +"\",\"TaskDetails\":\"SUCCESS\"}");

// See that index is rebuild and confirm index has rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo (should be "index is rebuilt")

initMethod.invoke(obj, env, timeMaxInterval);

// Change task status to STARTED
Task.addTask(connForTask, taskRecord.getTaskType(), taskRecord.getTenantId(), taskRecord.getSchemaName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one server runs TaskRegionObserver, and it runs this line to set a Task as STARTED, and then immediately afterward the server dies, will the next server that runs TaskRegionObserver pick up the orphaned task and actually run it? Since you only screen out FAILED and COMPLETE above, I think the answer's yes, but making sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we exclude Completed and Failed states and get anything else

" ) VALUES(?,?,?,?,?,?,?,?)");
stmt = setValuesToAddTaskPS(stmt, taskType, tenantId, schemaName, tableName);
if (taskStatus != null) {
stmt.setString(5, taskStatus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this logic shouldn't be in setValuesToAddTaskPS? (You could have an overload that took just the original 4 params that calls the full 8 param function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do

Copy link
Contributor

@gjacoby126 gjacoby126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks @gokceni !

@gokceni gokceni force-pushed the gokcen branch 4 times, most recently from 9f7eb9a to b84af26 Compare April 2, 2019 21:30
Copy link
Contributor

@kadirozde kadirozde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

@gokceni gokceni force-pushed the gokcen branch 3 times, most recently from f6516fe to bc57c47 Compare April 3, 2019 18:21
@gjacoby126
Copy link
Contributor

Closing because this was pushed via git command line

@gjacoby126 gjacoby126 closed this Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants