-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode_or_json_validator should assume literals are unicode rather than json #1967
unicode_or_json_validator should assume literals are unicode rather than json #1967
Conversation
@vitorbaptista what's unicode_or_json_validator used for? If you're expecting JSON wouldn't it be better to always pass JSON? i.e. add a separate qjson parameter to datastore_search. Another option might be to only accept actual dicts passed to the API (although that won't work for web forms). @aliceh75 won't your change make it impossible to pass strings that happen to look like JSON objects? |
@wardi : The current code already makes it impossible to search for text that looks like JSON objects - but if you means that my fix does not go far enough, you are correct. Someone might genuinely want to search for "{"a":"value"}" (after all, someone might upload a document that contains JSON data). The fix for that problem is a larger one though - as you say, having a separate (As far as I understand, the reason |
So this change means that instead of the current situation of using That is an improvement. |
Both json_validator is also strange because it will pass through dicts, lists and None, but not other JSON types like bool and float (although it lets you create them) That doesn't seem intentional. |
@wardi The intent was to add support for full-text search on a specific field, while keeping the code backwards-compatible. @aliceh75 found a case where the code isn't backwards compatible, so this is a bug IMO. The documentation for this specific attribute is in https://github.com/ckan/ckan/blob/master/ckanext/datastore/logic/action.py#L277-L280.
|
@wardi: I've added docstrings. |
@aliceh75 thanks, those look good. The last thing I'll ask for is a few new tests in https://github.com/ckan/ckan/blob/master/ckanext/datastore/tests/test_interface.py so that we know this will keep working in the future. |
.. or maybe they should be in https://github.com/ckan/ckan/blob/master/ckanext/datastore/tests/test_search.py |
@wardi : I wrote some tests, and in so-doing realized there was another issue: columns that contains numbers only (ie. type int or float) do not get added to the full text index. While this is a separate issue, I think it makes sense to fix both together - otherwise this PR would merely be adding FTS support for integers within text strings, which sounds like an unfinished feature. The issue in question is fairly easy to fix. It is caused by The commit that introduced this code does not offer any rational for excluding numeric fields from the full text search. I think they should be re-introduced, and I think it should be done as part of this PR. Let me know if you agree, and I will do so (and ensure my tests cover both cases). |
In the meantime I've updated the tests to only test for what actually works: integers/numbers within strings. |
@wardi : Any thoughts on the issue of non-text fields not being added to the full text field? I'm keen to write a fix for this, but I would like to know if I should do it as part of this PR or submit a separate PR. Thanks! |
@aliceh75 yes, sorry for my slow reply. If you'd like to fix both issues as part of the same PR please go ahead. I agree that number columns are useful to have as part of the full text search. |
@wardi Done. As not all PostgreSQL field types make sense in a full text search, I decided to limit it to: int8, int4, int2, float4, float8, date, time, timetz, timestamp, numeric, text And I have added the following tests:
I noticed that some tests were failing depending on PostgreSQL version because they relied on the exact implementation of |
…er-fails unicode_or_json_validator should assume literals are unicode rather than json
As per #1966:
When using the recline grid, attempting a full text search with an integer causes the grid to hang forever.
This is what happens:
q
parameter, which is set to (say)1234
;q
is validated usingckanext.datastore.logic.schema.unicode_or_json_validator
;json_validator
which does the validation by attempting to parse the string as json. This usesjson.loads
. This implementation considers literals to be valid json [*], sou'1234'
becomesint(1234)
ckanext.datastore.plugin.DatastorePlugin.datastore_validate
, checks whetherq
is adict
or astring
. It doesn't allow other types - so the search fails on validation.The fact that
unicode_or_json_validator
accepts literals has other implications - for instance a double quoted string"hello"
would be accepted, and returned without the quotes. So I suggest the best option is to changeunicode_or_json_validator
to not accept JSON literals.I will create a PR for this.
[*] RFC 4267 says that a JSON document should contain an array or a string, however the ECMA variation of the standard does not impose this restriction. In practice most implementations accept litterals.