unicode_or_json_validator should assume literals are unicode rather than json #1967

aliceh75 · 2014-10-09T12:02:02Z

As per #1966:

When using the recline grid, attempting a full text search with an integer causes the grid to hang forever.

This is what happens:

The full text search is sent to the datastore API using the q parameter, which is set to (say) 1234;
q is validated using ckanext.datastore.logic.schema.unicode_or_json_validator;
In turns this calls json_validator which does the validation by attempting to parse the string as json. This uses json.loads. This implementation considers literals to be valid json [*], so u'1234' becomes int(1234)
Later in the process, ckanext.datastore.plugin.DatastorePlugin.datastore_validate, checks whether q is a dict or a string. It doesn't allow other types - so the search fails on validation.

The fact that unicode_or_json_validator accepts literals has other implications - for instance a double quoted string "hello" would be accepted, and returned without the quotes. So I suggest the best option is to change unicode_or_json_validator to not accept JSON literals.

I will create a PR for this.

[*] RFC 4267 says that a JSON document should contain an array or a string, however the ECMA variation of the standard does not impose this restriction. In practice most implementations accept litterals.

wardi · 2014-10-09T15:20:43Z

@vitorbaptista what's unicode_or_json_validator used for? If you're expecting JSON wouldn't it be better to always pass JSON? i.e. add a separate qjson parameter to datastore_search.

Another option might be to only accept actual dicts passed to the API (although that won't work for web forms).

@aliceh75 won't your change make it impossible to pass strings that happen to look like JSON objects?

aliceh75 · 2014-10-09T15:38:38Z

@wardi : The current code already makes it impossible to search for text that looks like JSON objects - but if you means that my fix does not go far enough, you are correct. Someone might genuinely want to search for "{"a":"value"}" (after all, someone might upload a document that contains JSON data).

The fix for that problem is a larger one though - as you say, having a separate qjson is the only way to fully address the problem.

(As far as I understand, the reason q accepts both text strings and json dicts is that is used for full text search on the _full_text field (when using a string) or on a given field (when using a json dict))

wardi · 2014-10-09T15:51:06Z

So this change means that instead of the current situation of using "42" (with the quotes) to search for the number 42 as a string, you can just search for 42. So you only need to add double quotes around things that look like JSON objects, e.g. "{}" to search for the string {}.

That is an improvement.

wardi · 2014-10-09T16:30:52Z

Both unicode_or_json_validator and json_validator need some docstrings that explain the input they accept, reject and how they convert data.

json_validator is also strange because it will pass through dicts, lists and None, but not other JSON types like bool and float (although it lets you create them) That doesn't seem intentional.

vitorbaptista · 2014-10-09T19:49:54Z

@wardi The intent was to add support for full-text search on a specific field, while keeping the code backwards-compatible. @aliceh75 found a case where the code isn't backwards compatible, so this is a bug IMO.

The documentation for this specific attribute is in https://github.com/ckan/ckan/blob/master/ckanext/datastore/logic/action.py#L277-L280.

:param q: full text query. If it's a string, it'll search on all fields on
         each row. If it's a dictionary as {"key1": "a", "key2": "b"},
         it'll search on each specific field (optional)

:type q: string or dictionary

aliceh75 · 2014-10-10T13:42:56Z

@wardi: I've added docstrings.

wardi · 2014-10-15T18:09:27Z

@aliceh75 thanks, those look good. The last thing I'll ask for is a few new tests in https://github.com/ckan/ckan/blob/master/ckanext/datastore/tests/test_interface.py so that we know this will keep working in the future.

wardi · 2014-10-15T18:10:55Z

.. or maybe they should be in https://github.com/ckan/ckan/blob/master/ckanext/datastore/tests/test_search.py

aliceh75 · 2014-10-24T10:32:32Z

@wardi : I wrote some tests, and in so-doing realized there was another issue: columns that contains numbers only (ie. type int or float) do not get added to the full text index.

While this is a separate issue, I think it makes sense to fix both together - otherwise this PR would merely be adding FTS support for integers within text strings, which sounds like an unfinished feature.

The issue in question is fairly easy to fix. It is caused by _to_full_text in db.py which explicitely only includes json dicts and strings in the full text representation. See the code at : https://github.com/ckan/ckan/blob/master/ckanext/datastore/db.py#L751

The commit that introduced this code does not offer any rational for excluding numeric fields from the full text search. I think they should be re-introduced, and I think it should be done as part of this PR. Let me know if you agree, and I will do so (and ensure my tests cover both cases).

…han json #1966

aliceh75 · 2014-10-24T11:39:44Z

In the meantime I've updated the tests to only test for what actually works: integers/numbers within strings.

aliceh75 · 2014-10-30T10:24:22Z

@wardi : Any thoughts on the issue of non-text fields not being added to the full text field? I'm keen to write a fix for this, but I would like to know if I should do it as part of this PR or submit a separate PR. Thanks!

wardi · 2014-10-30T12:52:30Z

@aliceh75 yes, sorry for my slow reply. If you'd like to fix both issues as part of the same PR please go ahead. I agree that number columns are useful to have as part of the full text search.

…1966

…es are present

aliceh75 · 2014-10-31T08:12:45Z

@wardi Done. As not all PostgreSQL field types make sense in a full text search, I decided to limit it to:

int8, int4, int2, float4, float8, date, time, timetz, timestamp, numeric, text

And I have added the following tests:

integer within text string
integer in integer column
float within text string
float in float column
date in date column

I noticed that some tests were failing depending on PostgreSQL version because they relied on the exact implementation of to_tsvector. The point of the test is not to test to_tsvector, so I've changed them not to test the exact format of the full text column, but only to check that expected lexemes are included in the full text column.

…er-fails unicode_or_json_validator should assume literals are unicode rather than json

amercader assigned wardi Oct 14, 2014

aliceh75 mentioned this pull request Oct 15, 2014

Add phrase search to datastore #1979

Closed

amercader added this to the CKAN 2.3 milestone Oct 22, 2014

unicode_or_json_validator should assume literals are unicode rather t…

9c6a9a1

…han json #1966

aliceh75 mentioned this pull request Oct 30, 2014

Date & number search problems NaturalHistoryMuseum/ckanext-nhm#59

Closed

aliceh75 added 2 commits October 30, 2014 15:44

Ensure that numeric and date fields are added to the full text index #…

7c79bde

…1966

Do not test the implementation of ts_vector, only that expected lexem…

4a6f1ab

…es are present

wardi added a commit that referenced this pull request Nov 3, 2014

Merge pull request #1967 from aliceh75/1966-full-text-search-on-integ…

cf68e74

…er-fails unicode_or_json_validator should assume literals are unicode rather than json

wardi merged commit cf68e74 into ckan:master Nov 3, 2014

aliceh75 mentioned this pull request Nov 3, 2014

Full text search on integers fails #1966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode_or_json_validator should assume literals are unicode rather than json #1967

unicode_or_json_validator should assume literals are unicode rather than json #1967

aliceh75 commented Oct 9, 2014

wardi commented Oct 9, 2014

aliceh75 commented Oct 9, 2014

wardi commented Oct 9, 2014

wardi commented Oct 9, 2014

vitorbaptista commented Oct 9, 2014

aliceh75 commented Oct 10, 2014

wardi commented Oct 15, 2014

wardi commented Oct 15, 2014

aliceh75 commented Oct 24, 2014

aliceh75 commented Oct 24, 2014

aliceh75 commented Oct 30, 2014

wardi commented Oct 30, 2014

aliceh75 commented Oct 31, 2014

unicode_or_json_validator should assume literals are unicode rather than json #1967

unicode_or_json_validator should assume literals are unicode rather than json #1967

Conversation

aliceh75 commented Oct 9, 2014

wardi commented Oct 9, 2014

aliceh75 commented Oct 9, 2014

wardi commented Oct 9, 2014

wardi commented Oct 9, 2014

vitorbaptista commented Oct 9, 2014

aliceh75 commented Oct 10, 2014

wardi commented Oct 15, 2014

wardi commented Oct 15, 2014

aliceh75 commented Oct 24, 2014

aliceh75 commented Oct 24, 2014

aliceh75 commented Oct 30, 2014

wardi commented Oct 30, 2014

aliceh75 commented Oct 31, 2014