Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode_or_json_validator should assume literals are unicode rather than json #1967

Merged
merged 3 commits into from
Nov 3, 2014
Merged

unicode_or_json_validator should assume literals are unicode rather than json #1967

merged 3 commits into from
Nov 3, 2014

Conversation

aliceh75
Copy link
Contributor

@aliceh75 aliceh75 commented Oct 9, 2014

As per #1966:

When using the recline grid, attempting a full text search with an integer causes the grid to hang forever.

This is what happens:

  1. The full text search is sent to the datastore API using the q parameter, which is set to (say) 1234;
  2. q is validated using ckanext.datastore.logic.schema.unicode_or_json_validator;
  3. In turns this calls json_validator which does the validation by attempting to parse the string as json. This uses json.loads. This implementation considers literals to be valid json [*], so u'1234' becomes int(1234)
  4. Later in the process, ckanext.datastore.plugin.DatastorePlugin.datastore_validate, checks whether q is a dict or a string. It doesn't allow other types - so the search fails on validation.

The fact that unicode_or_json_validator accepts literals has other implications - for instance a double quoted string "hello" would be accepted, and returned without the quotes. So I suggest the best option is to change unicode_or_json_validator to not accept JSON literals.

I will create a PR for this.

[*] RFC 4267 says that a JSON document should contain an array or a string, however the ECMA variation of the standard does not impose this restriction. In practice most implementations accept litterals.

@wardi
Copy link
Contributor

wardi commented Oct 9, 2014

@vitorbaptista what's unicode_or_json_validator used for? If you're expecting JSON wouldn't it be better to always pass JSON? i.e. add a separate qjson parameter to datastore_search.

Another option might be to only accept actual dicts passed to the API (although that won't work for web forms).

@aliceh75 won't your change make it impossible to pass strings that happen to look like JSON objects?

@aliceh75
Copy link
Contributor Author

aliceh75 commented Oct 9, 2014

@wardi : The current code already makes it impossible to search for text that looks like JSON objects - but if you means that my fix does not go far enough, you are correct. Someone might genuinely want to search for "{"a":"value"}" (after all, someone might upload a document that contains JSON data).

The fix for that problem is a larger one though - as you say, having a separate qjson is the only way to fully address the problem.

(As far as I understand, the reason q accepts both text strings and json dicts is that is used for full text search on the _full_text field (when using a string) or on a given field (when using a json dict))

@wardi
Copy link
Contributor

wardi commented Oct 9, 2014

So this change means that instead of the current situation of using "42" (with the quotes) to search for the number 42 as a string, you can just search for 42. So you only need to add double quotes around things that look like JSON objects, e.g. "{}" to search for the string {}.

That is an improvement.

@wardi
Copy link
Contributor

wardi commented Oct 9, 2014

Both unicode_or_json_validator and json_validator need some docstrings that explain the input they accept, reject and how they convert data.

json_validator is also strange because it will pass through dicts, lists and None, but not other JSON types like bool and float (although it lets you create them) That doesn't seem intentional.

@vitorbaptista
Copy link
Contributor

@wardi The intent was to add support for full-text search on a specific field, while keeping the code backwards-compatible. @aliceh75 found a case where the code isn't backwards compatible, so this is a bug IMO.

The documentation for this specific attribute is in https://github.com/ckan/ckan/blob/master/ckanext/datastore/logic/action.py#L277-L280.

:param q: full text query. If it's a string, it'll search on all fields on
         each row. If it's a dictionary as {"key1": "a", "key2": "b"},
         it'll search on each specific field (optional)

:type q: string or dictionary

@aliceh75
Copy link
Contributor Author

@wardi: I've added docstrings.

@wardi
Copy link
Contributor

wardi commented Oct 15, 2014

@aliceh75 thanks, those look good. The last thing I'll ask for is a few new tests in https://github.com/ckan/ckan/blob/master/ckanext/datastore/tests/test_interface.py so that we know this will keep working in the future.

@wardi
Copy link
Contributor

wardi commented Oct 15, 2014

@amercader amercader added this to the CKAN 2.3 milestone Oct 22, 2014
@aliceh75
Copy link
Contributor Author

@wardi : I wrote some tests, and in so-doing realized there was another issue: columns that contains numbers only (ie. type int or float) do not get added to the full text index.

While this is a separate issue, I think it makes sense to fix both together - otherwise this PR would merely be adding FTS support for integers within text strings, which sounds like an unfinished feature.

The issue in question is fairly easy to fix. It is caused by _to_full_text in db.py which explicitely only includes json dicts and strings in the full text representation. See the code at : https://github.com/ckan/ckan/blob/master/ckanext/datastore/db.py#L751

The commit that introduced this code does not offer any rational for excluding numeric fields from the full text search. I think they should be re-introduced, and I think it should be done as part of this PR. Let me know if you agree, and I will do so (and ensure my tests cover both cases).

@aliceh75
Copy link
Contributor Author

In the meantime I've updated the tests to only test for what actually works: integers/numbers within strings.

@aliceh75
Copy link
Contributor Author

@wardi : Any thoughts on the issue of non-text fields not being added to the full text field? I'm keen to write a fix for this, but I would like to know if I should do it as part of this PR or submit a separate PR. Thanks!

@wardi
Copy link
Contributor

wardi commented Oct 30, 2014

@aliceh75 yes, sorry for my slow reply. If you'd like to fix both issues as part of the same PR please go ahead. I agree that number columns are useful to have as part of the full text search.

@aliceh75
Copy link
Contributor Author

@wardi Done. As not all PostgreSQL field types make sense in a full text search, I decided to limit it to:

int8, int4, int2, float4, float8, date, time, timetz, timestamp, numeric, text

And I have added the following tests:

  • integer within text string
  • integer in integer column
  • float within text string
  • float in float column
  • date in date column

I noticed that some tests were failing depending on PostgreSQL version because they relied on the exact implementation of to_tsvector. The point of the test is not to test to_tsvector, so I've changed them not to test the exact format of the full text column, but only to check that expected lexemes are included in the full text column.

wardi added a commit that referenced this pull request Nov 3, 2014
…er-fails

unicode_or_json_validator should assume literals are unicode rather than json
@wardi wardi merged commit cf68e74 into ckan:master Nov 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants