Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a FullRelationScan operation. #164

Closed

Conversation

floriankramer
Copy link
Member

This implements a FullRelationScan opertion, which allows for queries of the form:

SELECT ?a WHERE {
 ?a ?b ?c
}
GROUP BY ?a

as we had discussed. The current implementation is based upon the FullRelationMetadata stored in the index. It does not do any optimizations (e.g. applying prefix filters when scanning the metadata or caching a prefix of the result and applying LIMIT early).
The FullRelationScan is used if all of these three conditions are met by the query:

  • There must only be one triple which must only contain variables
  • Only one of these variables (and optionally its count) must be selected.
  • The query must group on the selected variable.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though I did add some questions. Will be interesting to see what kind of performance we get from this.

// ____________________________________________________________
ConstIterator cend() const { return _map.end(); }

// ____________________________________________________________
Iterator end() { return _map.end(); }

// ____________________________________________________________
ConstIterator end() const { return _map.end(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already cend()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a for loop to iterate through the meta data right now (for (auto e : meta)) and that seems to only use begin and end, but I needed a constant version of those as I only had a reference to the constant metadata. std::vector seems to also have a constant end and a separate also constant cend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about for (const auto& e: meta) {…}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklas88 : the for (const auto...) overload uses the const overload to begin and end.
So they are necessary for Florians use case.
@floriankramer : Niklas is right, MetaData should by all means be accessed in a const way.
Btw: By a second thought I don't think you will have much fun with this implementation since iterating over the MetaData takes a very long time. I think storing the ids of all subjects, objects and predicates in a sorted way on disk would be much faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joka921 Is there currently any non constant access? The Index only returns a constant reference.
I tried running the current implementation on the wikidata full dataset on galera and aborted the query after it took more than a full minute, so we definitely need to precompute this, preferably during the index building.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floriankramer I think @joka921 was only talking about the for (auto e: meta) in your comment which didn't make the constness explicit.

@@ -215,12 +215,18 @@ class MetaDataWrapperHashMap {
// __________________________________________________________________
Iterator begin() { return _map.begin(); }

// __________________________________________________________________
ConstIterator begin() const { return _map.begin(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use cbegin() even though in std::vector there is also a const begin() see https://en.cppreference.com/w/cpp/container/vector/begin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ad_utility::HashMap does not have a cbegin, and we can't add it using a simple using statement, (that leads to g++ complaining that there is no matching member).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok this was just style nitpicking anyway

return os.str();
}

size_t FullRelationScan::getResultWidth() const { return 2; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct when there is not COUNT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count is always returned by the operation. When it is not selected it is discarded during the json assembly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when using this as a subquery e.g. to join it with wikibase:claim?

std::to_string(static_cast<int>(type)));
}

for (const std::pair<size_t, const FullRelationMetaData&> elem :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these ifs do basically the same except for which meta data is used and using a different type for predicates. I think this could be handled in a templated method which just gets called with index.getXyzMeta() and index.getNofXyz()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return _psoMeta;
} else {
AD_THROW(ad_semsearch::Exception::CHECK_FAILED,
"Can only use the spo metadta if all 6 permutations "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I mean we still have _psoMetah?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just forgot to remove the check after copying the method. It's gone now.

@floriankramer
Copy link
Member Author

Given the bad performance of this implementation on large datasets, and the possiblity of extracting similar data to what this provided using schema:name I'll close this pr for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants