Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping #63

Merged
merged 12 commits into from Jun 26, 2018
Merged

Grouping #63

merged 12 commits into from Jun 26, 2018

Conversation

floriankramer
Copy link
Member

Added support for GroupBy statements and aggregate aliases.

@niklas88
Copy link
Member

Can you add documentation (e.g. in README.md) explaining the constraints on the current implementation of these features. Also I think some commits like Fixed testGroupByAndAlias can be squashed.

@niklas88 niklas88 self-assigned this Jun 15, 2018
Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with questions and nitpicks

GROUP BY ?profession
ORDER BY ?avg

Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, when using MAX or MIN is there a way to get other attributes from the left side that has the max/min. For example when dealing with Freebase's dated integers, MAX over the date (idoes MAX use sort order?) would give you the latest date of the entry but is there a way to get the integer at that date or does that need full subquery support including LIMIT and ORDER BY?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX and MIN don't interpret knowledgebase ids in the current implementation. For ordering, the numerical order of the id is used.
As far as I understand the sparql standard, after the group by only basic arithmetic, HAVING and ORDER BY can affect the result. I don't know how Freebase's dated integers are stored, so I'm not certain if full subquery support would be needed, but it does seem likely.

ORDER BY ?avg

Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it has potential for breaking some output formats if they lack escaping. We should check that at least JSON uses proper escaping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some basic escaping is applied, and everything is contained within a string, as quotation marks are not allowed within the separator string (at least not in the current implementation).

* @param hasRelation A mapping from entity ids to sets of relations
* @param patterns A mapping from pattern ids to patterns
* @param subjectColumn The column containing the entities for which the
* relations should be counted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation isn't appreciated enough so let me say. Thanks for documenting!

struct Aggregate {
AggregateType _type;
size_t _inCol, _outCol;
// Used to store the string necessary for the group concat aggregate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also say why this is void* instead of std::string

_aliases.push_back(a);
}
}
std::sort(_aliases.begin(), _aliases.end(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment why we sort here

@@ -180,6 +194,19 @@ class QueryExecutionTree {
row[validIndices[validIndices.size() - 1].first]))
<< "\"]";
break;
case ResultTable::ResultType::FLOAT: {
float f;
std::memcpy(&f, &row[validIndices[validIndices.size() - 1].first],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at another point something similar used a reinterpret_cast I think this variant here is cleaner and safer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the reinterpret casts with memcpys. The standard seems to actually not define any behaviour for the reinterpret casts when used for non similar types, so this should be more standard conform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's why I had it on my mind that it's possibly safer

@@ -21,7 +21,7 @@ class ResultTable {
public:
enum Status { FINISHED = 0, OTHER = 1 };

enum class ResultType { KB, VERBATIM, TEXT };
enum class ResultType { KB, VERBATIM, TEXT, FLOAT, STRING };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments detailing what each type is (e.g. difference between TEXT, STRING, KB and VERBATIM)

@@ -48,6 +49,9 @@ inline string convertFloatToIndexWord(const string& value,
//! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123".
inline string convertIndexWordToFloat(const string& indexWord);

//! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123".
inline float convertIndexWordToFloatValue(const string& indexWord);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a string convertIndexWordToFloat(const string& indexWord); wit the same comment but returning a std::string? If I'm seeing this right the other one is just for displaying as a string. So for the sake of better names I propose dropping the suffix …Value from this one and adding …String to the other one or alternatively I think we could drop the other one entirely and rely on later conversion to string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originaly there was only the one converting the index word to float, which is used when generating the json output. As SUM and AVG need to interpret floats I added the convert to float value version. I renamed them to convertIndexWordToFloatString and convertIndexWordToFloat.

add_library(tests
SparqlParserTest
StringUtilsTest
StringUtilsTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this indent seems wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally mixed tabs and spaces.

SparsehashTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one too

ASSERT_EQ(1u, pq._aliases.size());
ASSERT_EQ("?a", pq._aliases[0]._inVarName);
ASSERT_EQ("?count", pq._aliases[0]._outVarName);
ASSERT_EQ(true, pq._aliases[0]._isAggregate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ASSERT_TRUE(

@floriankramer
Copy link
Member Author

I can remove the last merge commit before merging this pull request. i just didn't want to do a rebase and force push due to the review, as that can break the association between comments and code.

if (a._distinct) {
for (size_t i = blockStart; i <= blockEnd; i++) {
const auto it = distinctHashSet.find((*input)[i][a._inCol]);
if (it == distinctHashSet.end()) {
distinctHashSet.insert((*input)[i][a._inCol]);
res += *reinterpret_cast<const float*>(&(*input)[i][a._inCol]);
std::memcpy(&tmpF, &(*input)[i][a._inCol], sizeof(float));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to first derefence and then take the address?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input could be a pointer to a vector of vectors, or a pointer to a vector of arrays so the dereferencing is required to access the proper elment, of which we then need the address. Adding more parantheses could make the intended effect of the code clearer, might also make it harder to read though.

KB,
// An unsigned integer (size_t)
VERBATIM,
// An entry in the text index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/entry/offset/gc ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what the comment refers to. Maybe something was removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "An entry in the text index" but I think we are really storing a byte offset

// A 32 bit float, stored in the first 4 bytes of the entry. The last four
// bytes have to be zero.
FLOAT,
// An entry in the ResultTable _localVocab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about LOCAL_VOCAB I find STRING a bit too ambiguous when there is TEXT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a better name, I changed it in the latest commit.

@@ -272,7 +272,7 @@ string convertFloatToIndexWord(const string& orig, size_t nofExponentDigits,
}

// _____________________________________________________________________________
string convertIndexWordToFloat(const string& indexWord) {
string convertIndexWordToFloatString(const string& indexWord) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions still share a lot of code, maybe this one should really just call convertIndexWordToFloatValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the convertIndexWordToFloatString implementation with a call to std::to_string(convertIndexWordToFloat(indexWord));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the new implementation I ran into the problem, that the conversion using the float value suffers from small precision errors, which currently breaks the unit tests. This could be especially problematic with integers, which are internally stored as floats, as it might not be possible to represent the integer exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How, where and why do we store integers in floats? I'm a little confused, I thought this part is for stuff marked ^^xsd:float or similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@niklas88 niklas88 Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't know, I really don't like the code duplication but I also don't like it holding up this PR. So I will merge now and then add an issue for the code duplication

@niklas88
Copy link
Member

niklas88 commented Jun 18, 2018

The previous Travis failure seems to have just been a fluke, somehow Travis's build container couldn't reach Canonical's repository (I sincerely hope they actually run their own mirror). A simple rerun fixed it.

I have also added a couple more comments.

@niklas88 niklas88 merged commit 8dd5aba into ad-freiburg:master Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants