Query Engine #14

ahirner · 2016-03-16T19:18:18Z

Querying tables and documents in a flexible, concise and precise way is important for two reasons:

The User:
- enters search terms and wants relevant results
- looks for similar tables (the original feature of the hackathon)
Automatically aggregate data from similar documents for up-stream calculations (the canonical example is actually in the OS branch: total cost calculation https://github.com/ahirner/TabulaRazr-OS/blob/xirr-specific/xirr_calc.py)

To do that we need the following matching functions:

exact
boolean combinations
fuzzy
semantic (hackathon feature)
on the following fields:
Header(s), context strings
Document text (everything else not associated with tables)
Column labels
Column types and subtypes
Strings ('other') columns

I propose to use elasticsearch by indexing documents and tables as separate types. It is fast, scalable and allows to translate every requirement into queries that are native to elasticsearch.
Even semantic search (=ANN, #9) can be achieved by transforming vector representations into proxy "words" as done here: https://github.com/ascribe/image-match/blob/master/image_match/signature_database_base.py (although '5' based discrete vectors, not dense vectors).

ahirner self-assigned this Mar 16, 2016

ahirner added this to the Architecture Freeze milestone Mar 16, 2016

ahirner added the backend label Mar 16, 2016

ahirner mentioned this issue Mar 16, 2016

Calculate the similarities from a given table to other tables #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Engine #14

Query Engine #14

ahirner commented Mar 16, 2016

Query Engine #14

Query Engine #14

Comments

ahirner commented Mar 16, 2016