New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouping #63
Grouping #63
Conversation
Can you add documentation (e.g. in README.md) explaining the constraints on the current implementation of these features. Also I think some commits like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with questions and nitpicks
GROUP BY ?profession | ||
ORDER BY ?avg | ||
|
||
Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, when using MAX
or MIN
is there a way to get other attributes from the left side that has the max/min. For example when dealing with Freebase's dated integers, MAX
over the date (idoes MAX
use sort order?) would give you the latest date of the entry but is there a way to get the integer at that date or does that need full subquery support including LIMIT
and ORDER BY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX and MIN don't interpret knowledgebase ids in the current implementation. For ordering, the numerical order of the id is used.
As far as I understand the sparql standard, after the group by only basic arithmetic, HAVING and ORDER BY can affect the result. I don't know how Freebase's dated integers are stored, so I'm not certain if full subquery support would be needed, but it does seem likely.
ORDER BY ?avg | ||
|
||
Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`. | ||
Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it has potential for breaking some output formats if they lack escaping. We should check that at least JSON uses proper escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some basic escaping is applied, and everything is contained within a string, as quotation marks are not allowed within the separator string (at least not in the current implementation).
* @param hasRelation A mapping from entity ids to sets of relations | ||
* @param patterns A mapping from pattern ids to patterns | ||
* @param subjectColumn The column containing the entities for which the | ||
* relations should be counted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation isn't appreciated enough so let me say. Thanks for documenting!
struct Aggregate { | ||
AggregateType _type; | ||
size_t _inCol, _outCol; | ||
// Used to store the string necessary for the group concat aggregate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also say why this is void*
instead of std::string
_aliases.push_back(a); | ||
} | ||
} | ||
std::sort(_aliases.begin(), _aliases.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment why we sort here
@@ -180,6 +194,19 @@ class QueryExecutionTree { | |||
row[validIndices[validIndices.size() - 1].first])) | |||
<< "\"]"; | |||
break; | |||
case ResultTable::ResultType::FLOAT: { | |||
float f; | |||
std::memcpy(&f, &row[validIndices[validIndices.size() - 1].first], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at another point something similar used a reinterpret_cast
I think this variant here is cleaner and safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the reinterpret casts with memcpys. The standard seems to actually not define any behaviour for the reinterpret casts when used for non similar types, so this should be more standard conform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's why I had it on my mind that it's possibly safer
src/engine/ResultTable.h
Outdated
@@ -21,7 +21,7 @@ class ResultTable { | |||
public: | |||
enum Status { FINISHED = 0, OTHER = 1 }; | |||
|
|||
enum class ResultType { KB, VERBATIM, TEXT }; | |||
enum class ResultType { KB, VERBATIM, TEXT, FLOAT, STRING }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments detailing what each type is (e.g. difference between TEXT
, STRING
, KB
and VERBATIM
)
src/util/Conversions.h
Outdated
@@ -48,6 +49,9 @@ inline string convertFloatToIndexWord(const string& value, | |||
//! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123". | |||
inline string convertIndexWordToFloat(const string& indexWord); | |||
|
|||
//! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123". | |||
inline float convertIndexWordToFloatValue(const string& indexWord); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a string convertIndexWordToFloat(const string& indexWord);
wit the same comment but returning a std::string
? If I'm seeing this right the other one is just for displaying as a string. So for the sake of better names I propose dropping the suffix …Value
from this one and adding …String
to the other one or alternatively I think we could drop the other one entirely and rely on later conversion to string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originaly there was only the one converting the index word to float, which is used when generating the json output. As SUM and AVG need to interpret floats I added the convert to float value version. I renamed them to convertIndexWordToFloatString and convertIndexWordToFloat.
test/CMakeLists.txt
Outdated
add_library(tests | ||
SparqlParserTest | ||
StringUtilsTest | ||
StringUtilsTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this indent seems wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidentally mixed tabs and spaces.
test/CMakeLists.txt
Outdated
SparsehashTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this one too
test/SparqlParserTest.cpp
Outdated
ASSERT_EQ(1u, pq._aliases.size()); | ||
ASSERT_EQ("?a", pq._aliases[0]._inVarName); | ||
ASSERT_EQ("?count", pq._aliases[0]._outVarName); | ||
ASSERT_EQ(true, pq._aliases[0]._isAggregate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ASSERT_TRUE(
I can remove the last merge commit before merging this pull request. i just didn't want to do a rebase and force push due to the review, as that can break the association between comments and code. |
src/engine/GroupBy.cpp
Outdated
if (a._distinct) { | ||
for (size_t i = blockStart; i <= blockEnd; i++) { | ||
const auto it = distinctHashSet.find((*input)[i][a._inCol]); | ||
if (it == distinctHashSet.end()) { | ||
distinctHashSet.insert((*input)[i][a._inCol]); | ||
res += *reinterpret_cast<const float*>(&(*input)[i][a._inCol]); | ||
std::memcpy(&tmpF, &(*input)[i][a._inCol], sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to first derefence and then take the address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input could be a pointer to a vector of vectors, or a pointer to a vector of arrays so the dereferencing is required to access the proper elment, of which we then need the address. Adding more parantheses could make the intended effect of the code clearer, might also make it harder to read though.
KB, | ||
// An unsigned integer (size_t) | ||
VERBATIM, | ||
// An entry in the text index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/entry/offset/gc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what the comment refers to. Maybe something was removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "An entry in the text index" but I think we are really storing a byte offset
// A 32 bit float, stored in the first 4 bytes of the entry. The last four | ||
// bytes have to be zero. | ||
FLOAT, | ||
// An entry in the ResultTable _localVocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about LOCAL_VOCAB
I find STRING
a bit too ambiguous when there is TEXT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a better name, I changed it in the latest commit.
@@ -272,7 +272,7 @@ string convertFloatToIndexWord(const string& orig, size_t nofExponentDigits, | |||
} | |||
|
|||
// _____________________________________________________________________________ | |||
string convertIndexWordToFloat(const string& indexWord) { | |||
string convertIndexWordToFloatString(const string& indexWord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions still share a lot of code, maybe this one should really just call convertIndexWordToFloatValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the convertIndexWordToFloatString implementation with a call to std::to_string(convertIndexWordToFloat(indexWord));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing the new implementation I ran into the problem, that the conversion using the float value suffers from small precision errors, which currently breaks the unit tests. This could be especially problematic with integers, which are internally stored as floats, as it might not be possible to represent the integer exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How, where and why do we store integers in floats? I'm a little confused, I thought this part is for stuff marked ^^xsd:float
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't know, I really don't like the code duplication but I also don't like it holding up this PR. So I will merge now and then add an issue for the code duplication
The previous Travis failure seems to have just been a fluke, somehow Travis's build container couldn't reach Canonical's repository (I sincerely hope they actually run their own mirror). A simple rerun fixed it. I have also added a couple more comments. |
Added support for GroupBy statements and aggregate aliases.