New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting 'Select *' implementation #546
Conversation
reintroducing '_re_string.size() > 0'
Restored code, Implementing 'variant<_selectedVariables,Asterisk>' throughout the files e2e -> accept whitespace in vars-paths TODO: test/SparqlParserTest
updating to the '_varsOrAsterisk' variant type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this implementations,
We have found 2 bugs together and have discussed how the SelectedVarsOrAsterisk
should work to make the code more readable.
Please add some unit tests for the SparqlParser and some e2e tests via the scientists.quieries.yaml
in the e2e
directory.
src/engine/Server.cpp
Outdated
query.hasSelectClause() | ||
? query.selectClause()._selectedVariables | ||
: std::vector<std::string>{"?subject", "?predicate", "?object"}; | ||
query.hasSelectClause() ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong,
even with the * you have to put all the variables
in the ["selected"] field, and they have to be in the correct order
(as by the column indices).
src/parser/SparqlParser.cpp
Outdated
@@ -162,9 +164,12 @@ void SparqlParser::parseSelect(ParsedQuery* query) { | |||
if (_lexer.accept("reduced")) { | |||
selectClause._reduced = true; | |||
} | |||
if (_lexer.accept("*")) { | |||
selectClause._varsOrAsterisk = '*'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selectClause._varsOrAsterisk.setAsterisk()
test/SparqlParserTest.cpp
Outdated
ASSERT_EQ(1u, selectClause._selectedVariables.size()); | ||
ASSERT_EQ("?movie", selectClause._selectedVariables[0]); | ||
ASSERT_EQ(1u, std::get<ParsedQuery::_selectedVariables>(selectClause._varsOrAsterisk).size()); | ||
ASSERT_EQ("?movie", std::get<ParsedQuery::_selectedVariables>(selectClause._varsOrAsterisk)[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also implement some tests for queries with SELECT *.
Branch renamed "select_asterisk" |
TODO: Unit tests for SparqlParserTest e2e tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these improvements. It is already much more readable, and it almost works.
- The
QleverJson
format also needs the variable field to be set to the explicit variables in the correct order and not to "*". - I discussed with @hannahbast that we want the variables in a SELECT * query to be in the order of their first appearance in the query. Since doing this exactly again would need to intrusively change the parsing of the complete query, and since this is not required by the standard but just a convenience feature, you can implement this as follows:
- In the
SelectedVarsOrAsterisk
class, store the complete text of the query. - As soon as you resolve the asterisk to concrete variables (you know all the variables then) you can just search for the first occurence of each variable in the query and sort by this. It is ok, if the order is wrong in the presence of subqueries, e.g.
SELECT * WHERE {
{ SELECT ?a WHERE {
?a <pred> ?c
} }
?a <pred> ?b.
?a <pred2> ?c
}
We get the variable order ?a ?c ?b
although the first occurence of ?c is in the inner suquery space, and ?a ?b ?c
would be expected.
- As soon as there's a GROUP BY in the query, SELECT * is illegal. Make sure to implement this case, and also write a test, that checks, that the parser throws a readable error in this case.
src/engine/QueryExecutionTree.cpp
Outdated
@@ -91,22 +92,29 @@ void QueryExecutionTree::setVariableColumns( | |||
// _____________________________________________________________________________ | |||
template <QueryExecutionTree::ExportSubFormat format> | |||
ad_utility::stream_generator::stream_generator | |||
QueryExecutionTree::generateResults(const vector<string>& selectVars, | |||
QueryExecutionTree::generateResults(const SelectedVarsOrAsterisk & selectVarsOrAsterisk, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the type is now called SelectedVarsOrAsterisk
the corresponding variables should also be called selectedVarsOrAsterisk
(vs. selectVarsOrAsterisk
)
src/engine/QueryExecutionTree.cpp
Outdated
if (getVariableColumns().contains(var)) { | ||
auto columnIndex = getVariableColumns().at(var); | ||
if(selectVarsOrAsterisk.isAsterisk()) { | ||
for(const auto& elem: getVariableColumns()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a further improvement:
the type of elem
here is a pair of (std::string, Index) (the index is an unsigned integer).
The string is the Variable (the keys of the hash map).
So you can further Improve this to
for (const auto& [variable, columnIndex]) {.... }
here variable
is what previously was elem.first
and columnIndex
is variable.second.
This is a C++17 feature called "structured binding" in case you want to research it.
src/engine/QueryExecutionTree.cpp
Outdated
var, columnIndex, resultTable.getResultType(columnIndex)}); | ||
} else { | ||
exportColumns.emplace_back(std::nullopt); | ||
elem.first, elem.second, resultTable.getResultType(elem.second)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you sort here by the variable (currently "elem.first", then the output order becomes
deterministic for the asterisk (easiest way for the e2e tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course it should then be commented, that these deterministic orders are not required by the standard.
src/engine/QueryExecutionTree.cpp
Outdated
|
||
if(selectVarsOrAsterisk.isAsterisk()) { | ||
vector<string> vars_names; | ||
for(auto const& varName_index: getVariableColumns()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Also consider the structured binding here.
-
This only works, if
getVariableColumns()
(a hash map) alsways gives us the results in the same order. This should "typically" work, but in general you should either sort them consistently, or write exactly the variables which you have extracted here to theselectedVarsOrAsterisk
src/engine/QueryPlanner.cpp
Outdated
} | ||
if (!usePatternTrick) { | ||
continue; | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
it should work as expected.
You can then immediately assign
the selectedVariables to a variable
const auto& selectedVariables = ....getSelectVariables()
and use this below to make the code
much more readable.
src/engine/Server.cpp
Outdated
if(query.selectClause()._varsOrAsterisk.isAsterisk()) { | ||
string str; | ||
str+=(query.selectClause()._varsOrAsterisk.getAsterisk()); | ||
j["selected"] = std::vector<std::string>{str}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you also need the variable names in the (correct!) order,
and not only the asterisk.
src/parser/ParsedQuery.h
Outdated
[[nodiscard]] SelectVarsOrAsterisk get() const { | ||
// return clone(); | ||
return SelectVarsOrAsterisk{_varsOrAsterisk}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need this functionality,
You should simply use the copy constructor, which
I think in this case even exists implicitly.
(you don't need "clone" or "get" for this)
src/parser/ParsedQuery.h
Outdated
[[nodiscard]] char getAsterisk() const { | ||
return std::get<char>(_varsOrAsterisk);; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want this method.
Because "isAsterisk" means "export all Variables"
and we don't really need the *
which is an implementation detail.
In the one method were you print it, you can just hardcode the *
src/parser/ParsedQuery.h
Outdated
struct SelectClause { | ||
vector<string> _selectedVariables; | ||
// Aliases will be empty if Selector '*' is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More explicit comment:
// `_aliases` will be empty if `_varsOrAsterisk.isAsterisk()` is true. This means that there was a
// `SELECT *` clause in the query.
TODO: add Unit Tests for SparqlParserTest/e2e
Updated _variablesOrder type from List to Set, Reviewed asString() function in case of Selector asterisk TODO: add Unit Tests for SparqlParserTest/e2e
De-factor from set to list (by input & unique) Result_Head_Vars are now ordered Should the results_bindings be in the same order? TODO: add Unit Tests for SparqlParserTest/e2e
removed backup code as well added comment for 'nlohmann' object for having object keys alphabetically sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is getting better.
I have formulated some improved comments and names
and a few suggestions for types.
src/engine/QueryExecutionTree.cpp
Outdated
validIndices.emplace_back(pair<size_t, ResultTable::ResultType>( | ||
elem.second, resultTable->getResultType(elem.second))); | ||
if(selectedVarsOrAsterisk.isAsterisk()) { | ||
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call them orderedVariablesFromQuery
instead of orderedVariables
src/engine/QueryExecutionTree.cpp
Outdated
if(selectedVarsOrAsterisk.isAsterisk()) { | ||
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder(); | ||
auto allVars = getVariableColumns(); | ||
for (const auto& var : orderedVariables) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var
-> variableFromQuery
src/engine/QueryExecutionTree.cpp
Outdated
elem.second, resultTable->getResultType(elem.second))); | ||
if(selectedVarsOrAsterisk.isAsterisk()) { | ||
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder(); | ||
auto allVars = getVariableColumns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allVars
-> variablesFromExecutionTree
src/engine/QueryExecutionTree.cpp
Outdated
} | ||
} | ||
for(const auto& var : allVars){ | ||
LOG(DEBUG) << "Variable " << var.first << " was not parsed!! \n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG(WARN) << "The variable \"" << var.first << "\" was found in the execution tree, but not in the original query. This is likely a bug\n";
src/engine/QueryExecutionTree.cpp
Outdated
if(selectedVarsOrAsterisk.isAsterisk()) { | ||
for(auto var: selectedVarsOrAsterisk.retrieveOrder()){ | ||
if (getVariableColumns().contains(var)) { | ||
auto columnIndex = getVariableColumns().at(var); | ||
exportColumns.push_back(VariableAndColumnIndex{ | ||
var, columnIndex, resultTable.getResultType(columnIndex)}); | ||
} else { | ||
exportColumns.emplace_back(std::nullopt); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also use your code from above here (including the warning etc.)
We have another PR by someone else upcoming, that will soon remove this duplication.
src/parser/ParsedQuery.h
Outdated
@@ -357,11 +359,29 @@ class ParsedQuery { | |||
[[nodiscard]] auto& getSelectVariables() { | |||
return std::get<std::vector<string>>(_varsOrAsterisk); | |||
} | |||
|
|||
void pushVariablesOrder(const string& variable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this function addVariableFromQueryBody
and document it.
// Add a variable, that was found in the query body. The added variables will only be used if `isAsterisk` is true.
src/parser/ParsedQuery.h
Outdated
[[nodiscard]] std::list<string>& retrieveOrder() { | ||
return _variablesOrder; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should delete the non-const overload. If someone modifies the variables from outside,
this class is not valid anymore.
src/parser/ParsedQuery.h
Outdated
return _variablesOrder; | ||
} | ||
|
||
[[nodiscard]] std::list<string> retrieveOrder() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this orderedVariablesFromQueryBody
and comment
// Return the variables that were added by the
addVariableFromQueryBody` function. The return value does not contain duplicates.
src/parser/ParsedQuery.h
Outdated
// `_aliases` will be empty if Selector '*' is present. | ||
// Essentially, there is a `SELECT *` clause in the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This means, that there is a .......
(instead of the "Essentially")
Thank you for your comments and help @joka921 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there.
validIndices.emplace_back(pair<size_t, ResultTable::ResultType>( | ||
it->second, resultTable->getResultType(it->second))); | ||
allVars.erase(it); | ||
variablesFromExecutionTree.erase(it); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else branch should also warn:
OG(WARN) << "The variable "" << variableFromQuery <<
"" was found in the original, query, but not in the execution tree. "
"This is likely a bug\n";
exportColumns.push_back(VariableAndColumnIndex{ | ||
var, columnIndex, resultTable.getResultType(columnIndex)}); | ||
variableFromQuery, columnIndex, resultTable.getResultType(columnIndex)}); | ||
variablesFromExecutionTree.erase(variableFromQuery); | ||
} else { | ||
exportColumns.emplace_back(std::nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add the complementary warning here.
src/parser/ParsedQuery.cpp
Outdated
@@ -41,7 +41,7 @@ string ParsedQuery::asString() const { | |||
os << "\nSELECT: {\n\t"; | |||
if(usesAsterisk) { | |||
auto list = selectClause._varsOrAsterisk.retrieveOrder(); | |||
std::list<string>::iterator it; | |||
std::vector<string>::iterator it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm sorry, I think one of my comments got lost here).
src/parser/ParsedQuery.h
Outdated
void setAsterisk() { | ||
_varsOrAsterisk = '*'; | ||
} | ||
std::vector<string> _variablesOrder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this member _variablesFromQueryBody
src/parser/ParsedQuery.h
Outdated
} | ||
|
||
[[nodiscard]] std::list<string> retrieveOrder() const { | ||
[[nodiscard]] auto retrieveOrder() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here one of my comments disappeared:
Call this method orderedVariablesFromQueryBody
and comment it:
// get the variables which
addVariableFromQueryBody` was previously called. The result contains no duplicates and is ordered by the first appearance in the query body.
No description provided.