New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[15721] Index Suggestion #1347

Open
wants to merge 182 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@sivaprasadsudhir
Member

sivaprasadsudhir commented May 5, 2018

This PR implements a search based algorithm for suggesting the best set of indexes for a given workload based on the AutoAdmin paper from Microsoft.

The main workflow and classes are given below:

  • The helper classes and methods for our algorithm are defined in brain/index_selection_util[.h/.c]. This includes the IndexSelectionKnobs (tunable knobs of the algorithm), HypotheticalIndexObject (a hypothetical index class), IndexConfiguration (the set of hypothetical indexes). Multiple helper methods and overloaded operators for IndexConfiguration are defined here

  • IndexSelection is the main wrapper around our tool which takes in a workload (set of queries) and the tunable parameters of the algorithm. It returns the best index configuration through the main external API GetBestIndexes

  • The methods for running the search based algorithm are present in brain/index_selection[.h/.c]. The three modules which are used for each search iteration are:

    • GenerateCandidateIndexes generates the admissible indexes (indexes that benefit at least one query in the workload) from the provided queries and prune the useless ones
    • Enumerate that gets the top k indexes for the workload which would reduce the cost of executing them through a combination of exhaustive search (ExhaustiveEnumeration) and greedy search (GreedySearch)
    • GenerateMultiColumnIndexes generates multi-column indexes from the single column indexes by doing a cross product and adds it into the result which will be used for the next iteration.
  • The WhatIfIndex in brain/what_if_index[.h/.c] returns best physical plan tree and the cost associated with a hypothetical index configuration. This is possible with the set of appropriate changes made to optimizer/optimizer[.h/.c]

  • The IndexSelectionContext in brain/index_selection_context[.h/.c] memoizes the cost of the query for a given configuration to reduce the number of calls sent to WhatIfAPI

  • Integrating this module to work with the self-driving infrastructure of Peloton (Brain) is a work in progress

vkonagar and others added some commits May 12, 2018

Fix: Index Selection returns empty set because the
catalog cache eviction is not done properly.
Merge remote-tracking branch 'origin/auto_index' into auto_index
# Conflicts:
#	src/brain/what_if_index.cpp
Fix warning in IndexConfigComparator
warning: the specified comparator type does not provide a const call
operator [-Wuser-defined-warnings]
Hack to make travis pass the build.
DEFUALT_SCHEMA_NAME can't be found error. Fix this when merging with
master.
Hack to make travis pass the build.
DEFUALT_SCHEMA_NAME can't be found error. Fix this when merging with
master.
@malin1993ml

Overall the code quality is good! The documentation is very good. I left some comments to fix. Besides the comments, there're also two things:

  1. I didn't check all the files, but it looks like you didn't use forward-declaration to reduce the dependency. You should check where you're only using pointers in the .h file and forward-declare the classes and move the includes to the .cpp file as much as possible.
  2. Some tests on Jenkins are failing. PLease fix them as well.
}
}
Workload *w;

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

Where is this w used?

// Column is a table column name.
// 2. GROUP BY (if present)
// 3. ORDER BY (if present)
// 4. all updated columns for UPDATE query.

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

I think we should only get the columns in the where clause of the UPDATE query, not all updated columns for UPDATE query, right? The code looks correct, but the comment looks wrong?

auto result = std::hash<std::string>()(key.second->GetInfo());
for (auto index : indexes) {
// TODO[Siva]: Use IndexObjectHasher to hash this
result ^= std::hash<std::string>()(index->ToString());

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

You probably want to fix this now.

// The mapping from the object to the shared pointer
std::unordered_map<HypotheticalIndexObject,
std::shared_ptr<HypotheticalIndexObject>,
IndexObjectHasher> map_;

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

Why did we decide to do it this way instead of directly storing a set of HypotheticalIndexObjects? Is it for efficiency consideration?

}
PELOTON_ASSERT(index->column_oids.size() > 0);
auto response = request.send().wait(client.getWaitScope());

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

Can you check the response and through some warning if it does not succeed?

// Get index objects
bool InsertIndexObject(std::shared_ptr<IndexCatalogObject> index_object);
bool EvictIndexObject(oid_t index_oid);
bool EvictIndexObject(const std::string &index_name);

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

Again. These should be protected and only used by what-if API through friend class.

@@ -79,6 +67,9 @@ class TableCatalogObject {
inline oid_t GetDatabaseOid() { return database_oid; }
inline uint32_t GetVersionId() { return version_id; }
// NOTE: should be only used by What-if API.
void SetValidIndexObjects(bool is_valid);

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

protected.

// try get from cache
auto pg_table = Catalog::GetInstance()
->GetSystemCatalogs(database_oid)
->GetTableCatalog();

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

This is not the correct way to do this. We should not directly access the catalog instance. We should get the database catalog object from txn->catalog_cache_, then get table objects, and then index objects. Everything should go through the local catalog cache of the transaction but no the global catalog instance.

// TODO: Avoid using this function.
// Copied from SQL testing util.
// Execute a SQL query end-to-end

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

Yeah this is a hack.... Is there a way to fix this?

for (size_t idx = 0; idx < key_attr_list.size(); ++idx) {
// If index cannot further reduce scan range, break
if (idx == op->key_column_id_list.size() ||
key_attr_list[idx] != op->key_column_id_list[idx]) {

This comment has been minimized.

@malin1993ml

malin1993ml May 18, 2018

Member

@chenboy It looks like this thing requires the key_column_id_list has exactly the same order as key_attr_list? But in fact, an index with col(a, b, c) can also help a query with predicates(b=2 AND a=1 AND c=3), right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment