Grouping #63

floriankramer · 2018-06-13T16:04:12Z

Added support for GroupBy statements and aggregate aliases.

niklas88 · 2018-06-14T09:41:46Z

Can you add documentation (e.g. in README.md) explaining the constraints on the current implementation of these features. Also I think some commits like Fixed testGroupByAndAlias can be squashed.

niklas88

LGTM with questions and nitpicks

niklas88 · 2018-06-15T08:38:01Z

README.md

+    GROUP BY ?profession
+    ORDER BY ?avg
+
+Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.


Just a question, when using MAX or MIN is there a way to get other attributes from the left side that has the max/min. For example when dealing with Freebase's dated integers, MAX over the date (idoes MAX use sort order?) would give you the latest date of the entry but is there a way to get the integer at that date or does that need full subquery support including LIMIT and ORDER BY?

MAX and MIN don't interpret knowledgebase ids in the current implementation. For ordering, the numerical order of the id is used.
As far as I understand the sparql standard, after the group by only basic arithmetic, HAVING and ORDER BY can affect the result. I don't know how Freebase's dated integers are stored, so I'm not certain if full subquery support would be needed, but it does seem likely.

niklas88 · 2018-06-15T08:41:44Z

README.md

+    ORDER BY ?avg
+
+Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
+Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable.


This sounds like it has potential for breaking some output formats if they lack escaping. We should check that at least JSON uses proper escaping.

Some basic escaping is applied, and everything is contained within a string, as quotation marks are not allowed within the separator string (at least not in the current implementation).

niklas88 · 2018-06-15T08:45:18Z

src/engine/Engine.h

+   * @param hasRelation A mapping from entity ids to sets of relations
+   * @param patterns A mapping from pattern ids to patterns
+   * @param subjectColumn The column containing the entities for which the
+   *                      relations should be counted.


Documentation isn't appreciated enough so let me say. Thanks for documenting!

niklas88 · 2018-06-15T08:47:05Z

src/engine/GroupBy.h

+  struct Aggregate {
+    AggregateType _type;
+    size_t _inCol, _outCol;
+    // Used to store the string necessary for the group concat aggregate.


Maybe also say why this is void* instead of std::string

niklas88 · 2018-06-15T08:53:51Z

src/engine/GroupBy.cpp

+      _aliases.push_back(a);
+    }
+  }
+  std::sort(_aliases.begin(), _aliases.end(),


Add a comment why we sort here

niklas88 · 2018-06-15T09:32:33Z

src/engine/QueryExecutionTree.h

@@ -180,6 +194,19 @@ class QueryExecutionTree {
                    row[validIndices[validIndices.size() - 1].first]))
             << "\"]";
          break;
+        case ResultTable::ResultType::FLOAT: {
+          float f;
+          std::memcpy(&f, &row[validIndices[validIndices.size() - 1].first],


I think at another point something similar used a reinterpret_cast I think this variant here is cleaner and safer

I replaced the reinterpret casts with memcpys. The standard seems to actually not define any behaviour for the reinterpret casts when used for non similar types, so this should be more standard conform.

Yes I think that's why I had it on my mind that it's possibly safer

niklas88 · 2018-06-15T09:36:32Z

src/engine/ResultTable.h

@@ -21,7 +21,7 @@ class ResultTable {
 public:
  enum Status { FINISHED = 0, OTHER = 1 };

-  enum class ResultType { KB, VERBATIM, TEXT };
+  enum class ResultType { KB, VERBATIM, TEXT, FLOAT, STRING };


Can you add comments detailing what each type is (e.g. difference between TEXT, STRING, KB and VERBATIM)

niklas88 · 2018-06-15T09:41:32Z

src/util/Conversions.h

@@ -48,6 +49,9 @@ inline string convertFloatToIndexWord(const string& value,
 //! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123".
 inline string convertIndexWordToFloat(const string& indexWord);

+//! Converts like this: "PP0*2E0*1234 to "12.34 and M-0*1E9*876 to -0.123".
+inline float convertIndexWordToFloatValue(const string& indexWord);


why is there a string convertIndexWordToFloat(const string& indexWord); wit the same comment but returning a std::string? If I'm seeing this right the other one is just for displaying as a string. So for the sake of better names I propose dropping the suffix …Value from this one and adding …String to the other one or alternatively I think we could drop the other one entirely and rely on later conversion to string

Originaly there was only the one converting the index word to float, which is used when generating the json output. As SUM and AVG need to interpret floats I added the convert to float value version. I renamed them to convertIndexWordToFloatString and convertIndexWordToFloat.

niklas88 · 2018-06-15T09:52:40Z

test/CMakeLists.txt

 add_library(tests
            SparqlParserTest
-			StringUtilsTest
+      			StringUtilsTest


this indent seems wrong

I accidentally mixed tabs and spaces.

niklas88 · 2018-06-15T09:52:54Z

test/CMakeLists.txt

and this one too

niklas88 · 2018-06-15T10:29:17Z

test/SparqlParserTest.cpp

+  ASSERT_EQ(1u, pq._aliases.size());
+  ASSERT_EQ("?a", pq._aliases[0]._inVarName);
+  ASSERT_EQ("?count", pq._aliases[0]._outVarName);
+  ASSERT_EQ(true, pq._aliases[0]._isAggregate);


Use ASSERT_TRUE(

floriankramer · 2018-06-17T11:14:56Z

I can remove the last merge commit before merging this pull request. i just didn't want to do a rebase and force push due to the review, as that can break the association between comments and code.

niklas88 · 2018-06-18T10:09:37Z

src/engine/GroupBy.cpp

        if (a._distinct) {
          for (size_t i = blockStart; i <= blockEnd; i++) {
            const auto it = distinctHashSet.find((*input)[i][a._inCol]);
            if (it == distinctHashSet.end()) {
              distinctHashSet.insert((*input)[i][a._inCol]);
-              res += *reinterpret_cast<const float*>(&(*input)[i][a._inCol]);
+              std::memcpy(&tmpF, &(*input)[i][a._inCol], sizeof(float));


Do we need to first derefence and then take the address?

input could be a pointer to a vector of vectors, or a pointer to a vector of arrays so the dereferencing is required to access the proper elment, of which we then need the address. Adding more parantheses could make the intended effect of the code clearer, might also make it harder to read though.

niklas88 · 2018-06-18T10:12:25Z

src/engine/ResultTable.h

+    KB,
+    // An unsigned integer (size_t)
+    VERBATIM,
+    // An entry in the text index


s/entry/offset/gc ?

I don't see what the comment refers to. Maybe something was removed?

The comment "An entry in the text index" but I think we are really storing a byte offset

niklas88 · 2018-06-18T10:13:26Z

src/engine/ResultTable.h

+    // A 32 bit float, stored in the first 4 bytes of the entry. The last four
+    // bytes have to be zero.
+    FLOAT,
+    // An entry in the ResultTable _localVocab


How about LOCAL_VOCAB I find STRING a bit too ambiguous when there is TEXT

Definitely a better name, I changed it in the latest commit.

niklas88 · 2018-06-18T10:14:16Z

src/util/Conversions.h

@@ -272,7 +272,7 @@ string convertFloatToIndexWord(const string& orig, size_t nofExponentDigits,
 }

 // _____________________________________________________________________________
-string convertIndexWordToFloat(const string& indexWord) {
+string convertIndexWordToFloatString(const string& indexWord) {


These two functions still share a lot of code, maybe this one should really just call convertIndexWordToFloatValue?

I replaced the convertIndexWordToFloatString implementation with a call to std::to_string(convertIndexWordToFloat(indexWord));

While testing the new implementation I ran into the problem, that the conversion using the float value suffers from small precision errors, which currently breaks the unit tests. This could be especially problematic with integers, which are internally stored as floats, as it might not be possible to represent the integer exactly.

How, where and why do we store integers in floats? I'm a little confused, I thought this part is for stuff marked ^^xsd:float or similar?

https://github.com/ad-freiburg/QLever/blob/ee89fdc51b6eef788265e6524c2d550aaa96cdd2/src/util/Conversions.h#L114-L122

Hmm, I don't know, I really don't like the code duplication but I also don't like it holding up this PR. So I will merge now and then add an issue for the code duplication

niklas88 · 2018-06-18T10:16:59Z

The previous Travis failure seems to have just been a fluke, somehow Travis's build container couldn't reach Canonical's repository (I sincerely hope they actually run their own mirror). A simple rerun fixed it.

I have also added a couple more comments.

floriankramer added 4 commits June 6, 2018 17:57

added basic group by

d03a008

Implemented average for group by.

69ef94b

Added all aggregates but GROUP_CONCAT.

effef19

Implemented GROUP_CONCAT aggregate

eaa686d

floriankramer requested a review from niklas88 June 13, 2018 16:04

floriankramer added 5 commits June 14, 2018 14:59

Added unit tests to GroupBy.

9534223

Added 'decimal' to the accepted xsd number types

65e6482

Removed old todos, reformatted code, added comments

92ff3fd

Added support for distinct aggregates.

72da0cd

Added support for leading plus signs in xsd floats

3bbf812

floriankramer force-pushed the grouping branch from b8838a5 to 92ff3fd Compare June 14, 2018 13:02

Added documentation for group by to the readme

f6bf974

niklas88 self-assigned this Jun 15, 2018

niklas88 approved these changes Jun 15, 2018

View reviewed changes

niklas88 reviewed Jun 15, 2018

View reviewed changes

niklas88 mentioned this pull request Jun 15, 2018

Fix warning by using ASSERT_FALSE() #66

Merged

floriankramer added 2 commits June 17, 2018 12:34

Addressed pull request review

be4ac34

Merge branch 'master' into grouping and applied clang formatting

90c7b10

niklas88 reviewed Jun 18, 2018

View reviewed changes

niklas88 merged commit 8dd5aba into ad-freiburg:master Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping #63

Grouping #63

floriankramer commented Jun 13, 2018

niklas88 commented Jun 14, 2018

niklas88 left a comment •

edited

niklas88 Jun 15, 2018

floriankramer Jun 17, 2018

niklas88 Jun 15, 2018

floriankramer Jun 17, 2018

niklas88 Jun 15, 2018

niklas88 Jun 15, 2018

niklas88 Jun 15, 2018

niklas88 Jun 15, 2018

floriankramer Jun 17, 2018

niklas88 Jun 18, 2018

niklas88 Jun 15, 2018

niklas88 Jun 15, 2018

floriankramer Jun 17, 2018

niklas88 Jun 15, 2018

floriankramer Jun 17, 2018

niklas88 Jun 15, 2018

niklas88 Jun 15, 2018

floriankramer commented Jun 17, 2018

niklas88 Jun 18, 2018

floriankramer Jun 20, 2018

niklas88 Jun 18, 2018

floriankramer Jun 20, 2018

niklas88 Jun 21, 2018

niklas88 Jun 18, 2018

floriankramer Jun 20, 2018

niklas88 Jun 18, 2018

floriankramer Jun 20, 2018

floriankramer Jun 20, 2018

niklas88 Jun 21, 2018

floriankramer Jun 25, 2018

niklas88 Jun 25, 2018 •

edited

niklas88 commented Jun 18, 2018 •

edited

Grouping #63

Grouping #63

Conversation

floriankramer commented Jun 13, 2018

niklas88 commented Jun 14, 2018

niklas88 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floriankramer commented Jun 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 Jun 25, 2018 • edited

Choose a reason for hiding this comment

niklas88 commented Jun 18, 2018 • edited

niklas88 left a comment •

edited

niklas88 Jun 25, 2018 •

edited

niklas88 commented Jun 18, 2018 •

edited