Skip to content
This repository has been archived by the owner on Jan 2, 2019. It is now read-only.

Query Engine #14

Open
ahirner opened this issue Mar 16, 2016 · 0 comments
Open

Query Engine #14

ahirner opened this issue Mar 16, 2016 · 0 comments
Assignees
Labels

Comments

@ahirner
Copy link
Owner

ahirner commented Mar 16, 2016

Querying tables and documents in a flexible, concise and precise way is important for two reasons:

  1. The User:
    • enters search terms and wants relevant results
    • looks for similar tables (the original feature of the hackathon)
  2. Automatically aggregate data from similar documents for up-stream calculations (the canonical example is actually in the OS branch: total cost calculation https://github.com/ahirner/TabulaRazr-OS/blob/xirr-specific/xirr_calc.py)

To do that we need the following matching functions:

  • exact
  • boolean combinations
  • fuzzy
  • semantic (hackathon feature)
    on the following fields:
  • Header(s), context strings
  • Document text (everything else not associated with tables)
  • Column labels
  • Column types and subtypes
  • Strings ('other') columns

I propose to use elasticsearch by indexing documents and tables as separate types. It is fast, scalable and allows to translate every requirement into queries that are native to elasticsearch.
Even semantic search (=ANN, #9) can be achieved by transforming vector representations into proxy "words" as done here: https://github.com/ascribe/image-match/blob/master/image_match/signature_database_base.py (although '5' based discrete vectors, not dense vectors).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant