Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting 'Select *' implementation #546

Merged
merged 14 commits into from Jan 27, 2022
Merged

Conversation

joaomarques90
Copy link
Contributor

No description provided.

reintroducing '_re_string.size() > 0'
Restored code,
Implementing 'variant<_selectedVariables,Asterisk>' throughout the files
e2e -> accept whitespace in vars-paths

TODO: test/SparqlParserTest
updating to the '_varsOrAsterisk' variant type
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this implementations,
We have found 2 bugs together and have discussed how the SelectedVarsOrAsterisk should work to make the code more readable.

Please add some unit tests for the SparqlParser and some e2e tests via the scientists.quieries.yaml in the e2e directory.

src/engine/QueryExecutionTree.cpp Outdated Show resolved Hide resolved
src/engine/QueryExecutionTree.cpp Outdated Show resolved Hide resolved
src/engine/QueryExecutionTree.cpp Outdated Show resolved Hide resolved
src/engine/QueryExecutionTree.cpp Outdated Show resolved Hide resolved
src/parser/ParsedQuery.h Show resolved Hide resolved
src/engine/QueryPlanner.cpp Show resolved Hide resolved
query.hasSelectClause()
? query.selectClause()._selectedVariables
: std::vector<std::string>{"?subject", "?predicate", "?object"};
query.hasSelectClause() ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong,
even with the * you have to put all the variables
in the ["selected"] field, and they have to be in the correct order
(as by the column indices).

@@ -162,9 +164,12 @@ void SparqlParser::parseSelect(ParsedQuery* query) {
if (_lexer.accept("reduced")) {
selectClause._reduced = true;
}
if (_lexer.accept("*")) {
selectClause._varsOrAsterisk = '*';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selectClause._varsOrAsterisk.setAsterisk()

ASSERT_EQ(1u, selectClause._selectedVariables.size());
ASSERT_EQ("?movie", selectClause._selectedVariables[0]);
ASSERT_EQ(1u, std::get<ParsedQuery::_selectedVariables>(selectClause._varsOrAsterisk).size());
ASSERT_EQ("?movie", std::get<ParsedQuery::_selectedVariables>(selectClause._varsOrAsterisk)[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also implement some tests for queries with SELECT *.

@joaomarques90
Copy link
Contributor Author

Branch renamed "select_asterisk"

@joaomarques90 joaomarques90 restored the latest branch January 21, 2022 15:47
@joaomarques90 joaomarques90 reopened this Jan 21, 2022
TODO:

Unit tests for SparqlParserTest
e2e tests
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these improvements. It is already much more readable, and it almost works.

  1. The QleverJson format also needs the variable field to be set to the explicit variables in the correct order and not to "*".
  2. I discussed with @hannahbast that we want the variables in a SELECT * query to be in the order of their first appearance in the query. Since doing this exactly again would need to intrusively change the parsing of the complete query, and since this is not required by the standard but just a convenience feature, you can implement this as follows:
  • In the SelectedVarsOrAsterisk class, store the complete text of the query.
  • As soon as you resolve the asterisk to concrete variables (you know all the variables then) you can just search for the first occurence of each variable in the query and sort by this. It is ok, if the order is wrong in the presence of subqueries, e.g.
SELECT * WHERE {
  { SELECT ?a WHERE {
     ?a <pred> ?c
 } }
 ?a <pred> ?b.
?a <pred2> ?c
}

We get the variable order ?a ?c ?b although the first occurence of ?c is in the inner suquery space, and ?a ?b ?c would be expected.

  • As soon as there's a GROUP BY in the query, SELECT * is illegal. Make sure to implement this case, and also write a test, that checks, that the parser throws a readable error in this case.

@@ -91,22 +92,29 @@ void QueryExecutionTree::setVariableColumns(
// _____________________________________________________________________________
template <QueryExecutionTree::ExportSubFormat format>
ad_utility::stream_generator::stream_generator
QueryExecutionTree::generateResults(const vector<string>& selectVars,
QueryExecutionTree::generateResults(const SelectedVarsOrAsterisk & selectVarsOrAsterisk,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the type is now called SelectedVarsOrAsterisk
the corresponding variables should also be called selectedVarsOrAsterisk
(vs. selectVarsOrAsterisk)

if (getVariableColumns().contains(var)) {
auto columnIndex = getVariableColumns().at(var);
if(selectVarsOrAsterisk.isAsterisk()) {
for(const auto& elem: getVariableColumns()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a further improvement:
the type of elem here is a pair of (std::string, Index) (the index is an unsigned integer).
The string is the Variable (the keys of the hash map).

So you can further Improve this to
for (const auto& [variable, columnIndex]) {.... }

here variable is what previously was elem.first and columnIndex is variable.second.

This is a C++17 feature called "structured binding" in case you want to research it.

var, columnIndex, resultTable.getResultType(columnIndex)});
} else {
exportColumns.emplace_back(std::nullopt);
elem.first, elem.second, resultTable.getResultType(elem.second)});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you sort here by the variable (currently "elem.first", then the output order becomes
deterministic for the asterisk (easiest way for the e2e tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it should then be commented, that these deterministic orders are not required by the standard.


if(selectVarsOrAsterisk.isAsterisk()) {
vector<string> vars_names;
for(auto const& varName_index: getVariableColumns()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Also consider the structured binding here.

  2. This only works, if getVariableColumns() (a hash map) alsways gives us the results in the same order. This should "typically" work, but in general you should either sort them consistently, or write exactly the variables which you have extracted here to the selectedVarsOrAsterisk

}
if (!usePatternTrick) {
continue;
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?
it should work as expected.
You can then immediately assign
the selectedVariables to a variable
const auto& selectedVariables = ....getSelectVariables() and use this below to make the code
much more readable.

if(query.selectClause()._varsOrAsterisk.isAsterisk()) {
string str;
str+=(query.selectClause()._varsOrAsterisk.getAsterisk());
j["selected"] = std::vector<std::string>{str};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you also need the variable names in the (correct!) order,

and not only the asterisk.

[[nodiscard]] SelectVarsOrAsterisk get() const {
// return clone();
return SelectVarsOrAsterisk{_varsOrAsterisk};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need this functionality,
You should simply use the copy constructor, which
I think in this case even exists implicitly.
(you don't need "clone" or "get" for this)

src/parser/ParsedQuery.h Outdated Show resolved Hide resolved
Comment on lines 349 to 351
[[nodiscard]] char getAsterisk() const {
return std::get<char>(_varsOrAsterisk);;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this method.
Because "isAsterisk" means "export all Variables"
and we don't really need the * which is an implementation detail.
In the one method were you print it, you can just hardcode the *

struct SelectClause {
vector<string> _selectedVariables;
// Aliases will be empty if Selector '*' is set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More explicit comment:

// `_aliases` will be empty if `_varsOrAsterisk.isAsterisk()` is true. This means that there was a
// `SELECT *` clause in the query.

TODO:

add Unit Tests for SparqlParserTest/e2e
Updated _variablesOrder type from List to Set,
Reviewed asString() function in case of Selector asterisk

TODO:

add Unit Tests for SparqlParserTest/e2e
De-factor from set to list (by input & unique)
Result_Head_Vars are now ordered

Should the results_bindings be in the same order?

TODO:

add Unit Tests for SparqlParserTest/e2e
removed backup code as well
added comment for 'nlohmann' object for having  object keys alphabetically sorted
@joaomarques90 joaomarques90 marked this pull request as ready for review January 27, 2022 13:19
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is getting better.
I have formulated some improved comments and names
and a few suggestions for types.

validIndices.emplace_back(pair<size_t, ResultTable::ResultType>(
elem.second, resultTable->getResultType(elem.second)));
if(selectedVarsOrAsterisk.isAsterisk()) {
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call them orderedVariablesFromQuery instead of orderedVariables

if(selectedVarsOrAsterisk.isAsterisk()) {
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder();
auto allVars = getVariableColumns();
for (const auto& var : orderedVariables) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var -> variableFromQuery

elem.second, resultTable->getResultType(elem.second)));
if(selectedVarsOrAsterisk.isAsterisk()) {
list<string> orderedVariables = selectedVarsOrAsterisk.retrieveOrder();
auto allVars = getVariableColumns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allVars -> variablesFromExecutionTree

}
}
for(const auto& var : allVars){
LOG(DEBUG) << "Variable " << var.first << " was not parsed!! \n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG(WARN) << "The variable \"" << var.first << "\" was found in the execution tree, but not in the original query. This is likely a bug\n";

Comment on lines 162 to 171
if(selectedVarsOrAsterisk.isAsterisk()) {
for(auto var: selectedVarsOrAsterisk.retrieveOrder()){
if (getVariableColumns().contains(var)) {
auto columnIndex = getVariableColumns().at(var);
exportColumns.push_back(VariableAndColumnIndex{
var, columnIndex, resultTable.getResultType(columnIndex)});
} else {
exportColumns.emplace_back(std::nullopt);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use your code from above here (including the warning etc.)
We have another PR by someone else upcoming, that will soon remove this duplication.

@@ -357,11 +359,29 @@ class ParsedQuery {
[[nodiscard]] auto& getSelectVariables() {
return std::get<std::vector<string>>(_varsOrAsterisk);
}

void pushVariablesOrder(const string& variable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this function addVariableFromQueryBody and document it.

// Add a variable, that was found in the query body. The added variables will only be used if `isAsterisk` is true.

src/parser/ParsedQuery.h Outdated Show resolved Hide resolved
Comment on lines 372 to 374
[[nodiscard]] std::list<string>& retrieveOrder() {
return _variablesOrder;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should delete the non-const overload. If someone modifies the variables from outside,
this class is not valid anymore.

return _variablesOrder;
}

[[nodiscard]] std::list<string> retrieveOrder() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this orderedVariablesFromQueryBody and comment

// Return the variables that were added by the addVariableFromQueryBody` function. The return value does not contain duplicates.

Comment on lines 383 to 384
// `_aliases` will be empty if Selector '*' is present.
// Essentially, there is a `SELECT *` clause in the query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// This means, that there is a .......
(instead of the "Essentially")

Thank you for your comments and help @joka921 !
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there.

validIndices.emplace_back(pair<size_t, ResultTable::ResultType>(
it->second, resultTable->getResultType(it->second)));
allVars.erase(it);
variablesFromExecutionTree.erase(it);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else branch should also warn:
OG(WARN) << "The variable "" << variableFromQuery <<
"" was found in the original, query, but not in the execution tree. "
"This is likely a bug\n";

exportColumns.push_back(VariableAndColumnIndex{
var, columnIndex, resultTable.getResultType(columnIndex)});
variableFromQuery, columnIndex, resultTable.getResultType(columnIndex)});
variablesFromExecutionTree.erase(variableFromQuery);
} else {
exportColumns.emplace_back(std::nullopt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add the complementary warning here.

@@ -41,7 +41,7 @@ string ParsedQuery::asString() const {
os << "\nSELECT: {\n\t";
if(usesAsterisk) {
auto list = selectClause._varsOrAsterisk.retrieveOrder();
std::list<string>::iterator it;
std::vector<string>::iterator it;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still auto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm sorry, I think one of my comments got lost here).

void setAsterisk() {
_varsOrAsterisk = '*';
}
std::vector<string> _variablesOrder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this member _variablesFromQueryBody

}

[[nodiscard]] std::list<string> retrieveOrder() const {
[[nodiscard]] auto retrieveOrder() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here one of my comments disappeared:
Call this method orderedVariablesFromQueryBody and comment it:
// get the variables which addVariableFromQueryBody` was previously called. The result contains no duplicates and is ordered by the first appearance in the query body.

@joka921 joka921 merged commit d9d162b into ad-freiburg:master Jan 27, 2022
@hannahbast hannahbast mentioned this pull request Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants