Skip to content
This repository has been archived by the owner on Mar 11, 2021. It is now read-only.

Search Architecture #480

Closed
tsmaeder opened this issue Nov 11, 2016 · 9 comments
Closed

Search Architecture #480

tsmaeder opened this issue Nov 11, 2016 · 9 comments

Comments

@tsmaeder
Copy link
Contributor

From a post to almighty-public:

As we move to offer more search functionality in our API, I am becoming a bit worried about our architecture in that respect. If I look at the search service we have written for the "link typeahead" feature, we see code that parses a query string in a format that is special to this exact API call, and converts it directly into a query on the postgres db.

If we look at what the code does, it could be expressed like this:

searchTerms:= parseSpecialSyntax(queryString)
`return select * from work_items i where i.id matches(searchTearms) or i.title matches(searchTerms) or i.Description matches(searchTerms)

While the first line is specific to this particular API call, the second part can be reused in other queries. I have tried to express this separation in the WorkItemRepository.List() method by passing in a tree of operators that describes the query instead of a query string.

I am quite open to discuss how we want to offer a search API on our object model to the request handlers, but I think that we need an internal search API layer that offers what we can search for in a structure way. Query parsing for special case queries (for example, parsing well known URLs) should not be part of this API, but done in a layer above. This way we can reuse a lot of checks (for example for sql injection) and the translation to the sql layer. I believe that we will write a bunch of special case queries that we will be able to map to a relatively simple set internal search API calls.

The purpose of this issue is to track this concern and to develop a proposal how to address it.

@tsmaeder tsmaeder added this to the Sprint #123 milestone Nov 11, 2016
@tsmaeder
Copy link
Contributor Author

I did some experiments, adding full text indices on the work item sample db (130k work items).

Sizes: (1 page == 8k)

Relation size:
select relpages from pg_class where relname='work_items';    
N    15192

index sizes: 
SELECT c2.relname, c2.relpages FROM pg_class c, pg_class c2, pg_index i WHERE c.relname = 'work_items' AND c.oid = i.indrelid AND c2.oid = i.indexrelid^JORDER BY c2.relname;
 fulltext_search_index |    19471
 wi_des_ftidx          |     6158
 wi_id_ftidx           |      904
 wi_title_ftidx        |      823
 work_items_pkey       |      512

One thing that is interesting here is that the combined index sizes of the single field full text indices are smaller thant the ft-index of the combined fields ("fulltext_search_index"). One reason may be that no weights were set when creating the individual indices.

@tsmaeder
Copy link
Contributor Author

Here are some query execution times:

explain (analyze, costs) select fields->'system.title' from work_items where to_tsvector('english', fields->>'system.title') @@ to_tsquery('english', 'NullPointerException') or to_tsvector('english', fields->>'system.description') @@ to_tsquery('english', 'NullPointerException') or to_tsvector('english', id::text) @@ to_tsquery('english', 'NullPointerException)

Bitmap Heap Scan on work_items  (cost=68.34..7104.24 rows=1972 width=510) (actual time=0.045..0.172 rows=50 loops=1)
   Recheck Cond: ((to_tsvector('english'::regconfig, (fields ->> 'system.title'::text)) @@ '''nullpointerexcept'''::tsquery) OR (to_tsvector('english'::regconfig, (fields ->> 'system.description'::text)) @@ '''nullpointerexcept'''::tsquery) OR (to_tsvector('english'::regconfig, (id)::text) @@ '''nullpointerexcept'''::tsquery))
   Heap Blocks: exact=43
   ->  BitmapOr  (cost=68.34..68.34 rows=1982 width=0) (actual time=0.033..0.033 rows=0 loops=1)
         ->  Bitmap Index Scan on wi_title_ftidx  (cost=0.00..20.95 rows=661 width=0) (actual time=0.021..0.021 rows=42 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, (fields ->> 'system.title'::text)) @@ '''nullpointerexcept'''::tsquery)
         ->  Bitmap Index Scan on wi_des_ftidx  (cost=0.00..24.95 rows=661 width=0) (actual time=0.008..0.008 rows=12 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, (fields ->> 'system.description'::text)) @@ '''nullpointerexcept'''::tsquery)
         ->  Bitmap Index Scan on wi_id_ftidx  (cost=0.00..20.95 rows=661 width=0) (actual time=0.004..0.004 rows=0 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, (id)::text) @@ '''nullpointerexcept'''::tsquery)
 Planning time: 0.160 ms
 Execution time: 0.205 ms

explain (analyze, costs) select fields->>'title' from work_items where tsv @@ to_tsquery('english', 'NullPointerException');
 Bitmap Heap Scan on work_items  (cost=170.34..1248.40 rows=302 width=510) (actual time=0.275..0.405 rows=50 loops=1)
   Recheck Cond: (tsv @@ '''nullpointerexcept'''::tsquery)
   Heap Blocks: exact=43
   ->  Bitmap Index Scan on fulltext_search_index  (cost=0.00..170.26 rows=302 width=0) (actual time=0.264..0.264 rows=50 loops=1)
         Index Cond: (tsv @@ '''nullpointerexcept'''::tsquery)
 Planning time: 0.194 ms
 Execution time: 0.429 ms

It is interesting to observe that the query for individual fields is faster than the query for the combined field.

@tsmaeder
Copy link
Contributor Author

Execution time rises as selectivity goes down: searching for "test" takes 107ms (combined fields) and 113ms (single fields)

@tsmaeder
Copy link
Contributor Author

tsmaeder commented Nov 11, 2016

Thinking about queries for fields in json: in order to generate meaningful queries for field values (businessValue < 5, title ilike 'foo%), we need to know type of value that is stored in a field. This implies that a field must have a unique type for the scope of work item types we use to generate the query. The db schema we have right now does not ensure this property.

@aslakknutsen
Copy link
Contributor

aslakknutsen commented Nov 11, 2016

This implies that a field must have a unique type for the scope of work item types we use to generate the query.

Yeah, I agree. This is where the 'general purpose' WIT starts to break down a bit. I observed the same thing in looking trough VSO as well.

While you can define a Field as { Name: X, Type: Markup, Ref: system.assignee } in the WIT def, the Ref system.assignee require a Type User to actually work.

Technically, most(all?) system.* Fields (known fields) have a predefined type.

@tsmaeder
Copy link
Contributor Author

Ranking with single-field queries is more complicated. Not clear how to compute an overall rank.

@tsmaeder
Copy link
Contributor Author

Added https://github.com/tsmaeder/almighty-core/tree/poc_matches_operator with an added "matches" operator.

@tsmaeder
Copy link
Contributor Author

Open issues here: field types, joins, ranking.

@tsmaeder tsmaeder removed this from the Sprint #123 milestone Nov 14, 2016
@joshuawilson joshuawilson modified the milestone: Backlog Jul 3, 2017
@aslakknutsen
Copy link
Contributor

@baijum Something to evaluate as the Search Query lang is being written? tsmaeder@df4df84

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants