Skip to content

Latest commit

 

History

History
380 lines (269 loc) · 10.7 KB

user-guide.rst

File metadata and controls

380 lines (269 loc) · 10.7 KB

User Guide

Note

Examples will be based on IMDB data. This is a work in progress. Some sections still need to be furnished.

Query

The ~pandagg.tree.query.abstract.Query class provides :

  • multiple syntaxes to declare and udpate a query
  • query validation (with nested clauses validation)
  • ability to insert clauses at specific points
  • tree-like visual representation

Instantiation

From native "dict" query

Given the following query:

>>> expected_query = {'bool': {'must': [ >>> {'terms': {'genres': ['Action', 'Thriller']}}, >>> {'range': {'rank': {'gte': 7}}}, >>> {'nested': { >>> 'path': 'roles', >>> 'query': {'bool': {'must': [ >>> {'term': {'roles.gender': {'value': 'F'}}}, >>> {'term': {'roles.role': {'value': 'Reporter'}}}]} >>> } >>> }} >>> ]}}

To instantiate ~pandagg.tree.query.abstract.Query, simply pass "dict" query as argument:

>>> from pandagg.query import Query >>> q = Query(expected_query)

A visual representation of the query is available with ~pandagg.tree.query.abstract.Query.show:

>>> q.show() <Query> bool └── must ├── nested, path="roles" │ └── query │ └── bool │ └── must │ ├── term, field=roles.gender, value="F" │ └── term, field=roles.role, value="Reporter" ├── range, field=rank, gte=7 └── terms, genres=["Action", "Thriller"]

Call ~pandagg.tree.query.abstract.Query.to_dict to convert it to native dict:

>>> q.to_dict() {'bool': { 'must': [ {'range': {'rank': {'gte': 7}}}, {'terms': {'genres': ['Action', 'Thriller']}}, {'bool': {'must': [ {'term': {'roles.role': {'value': 'Reporter'}}}, {'term': {'roles.gender': {'value': 'F'}}}]}}}} ]} ] }}

>>> from pandagg.utils import equal_queries >>> equal_queries(q.to_dict(), expected_query) True

Note

equal_queries function won't consider order of clauses in must/should parameters since it actually doesn't matter in Elasticsearch execution, ie

>>> equal_queries({'must': [A, B]}, {'must': [B, A]}) True

With DSL classes

Pandagg provides a DSL to declare this query in a quite similar fashion:

>>> from pandagg.query import Nested, Bool, Range, Term, Terms

>>> q = Bool(must=[ >>> Terms(genres=['Action', 'Thriller']), >>> Range(rank={"gte": 7}), >>> Nested( >>> path='roles', >>> query=Bool(must=[ >>> Term(roles__gender='F'), >>> Term(roles__role='Reporter') >>> ]) >>> ) >>> ])

All these classes inherit from ~pandagg.tree.query.abstract.Query and thus provide the same interface.

>>> from pandagg.query import Query >>> isinstance(q, Query) True

With single clause as flattened syntax

In the flattened syntax, the query clause type is used as first argument:

>>> from pandagg.query import Query >>> q = Query('terms', genres=['Action', 'Thriller'])

Query enrichment

All methods described below return a new ~pandagg.tree.query.abstract.Query instance, and keep unchanged the initial query.

For instance:

>>> from pandagg.query import Query >>> initial_q = Query() >>> enriched_q = initial_q.query('terms', genres=['Comedy', 'Short'])

>>> initial_q.to_dict() None

>>> enriched_q.to_dict() {'terms': {'genres': ['Comedy', 'Short']}}

Note

Calling ~pandagg.tree.query.abstract.Query.to_dict on an empty Query returns None

>>> from pandagg.query import Query >>> Query().to_dict() None

query() method

The base method to enrich a ~pandagg.tree.query.abstract.Query is ~pandagg.tree.query.abstract.Query.query.

Considering this query:

>>> from pandagg.query import Query >>> q = Query()

~pandagg.tree.query.abstract.Query.query accepts following syntaxes:

from dictionnary:

>>> q.query({"terms": {"genres": ['Comedy', 'Short']})

flattened syntax:

>>> q.query("terms", genres=['Comedy', 'Short'])

from Query instance (this includes DSL classes):

>>> from pandagg.query import Terms
>>> q.query(Terms(genres=['Action', 'Thriller']))

Compound clauses specific methods

~pandagg.tree.query.abstract.Query instance also exposes following methods for specific compound queries:

(TODO: detail allowed syntaxes)

Specific to bool queries:

  • ~pandagg.tree.query.abstract.Query.bool
  • ~pandagg.tree.query.abstract.Query.filter
  • ~pandagg.tree.query.abstract.Query.must
  • ~pandagg.tree.query.abstract.Query.must_not
  • ~pandagg.tree.query.abstract.Query.should

Specific to other compound queries:

  • ~pandagg.tree.query.abstract.Query.nested
  • ~pandagg.tree.query.abstract.Query.constant_score
  • ~pandagg.tree.query.abstract.Query.dis_max
  • ~pandagg.tree.query.abstract.Query.function_score
  • ~pandagg.tree.query.abstract.Query.has_child
  • ~pandagg.tree.query.abstract.Query.has_parent
  • ~pandagg.tree.query.abstract.Query.parent_id
  • ~pandagg.tree.query.abstract.Query.pinned_query
  • ~pandagg.tree.query.abstract.Query.script_score
  • ~pandagg.tree.query.abstract.Query.boost

Inserted clause location

On all insertion methods detailed above, by default, the inserted clause is placed at the top level of your query, and generates a bool clause if necessary.

Considering the following query:

>>> from pandagg.query import Query >>> q = Query('terms', genres=['Action', 'Thriller']) >>> q.show() <Query> terms, genres=["Action", "Thriller"]

A bool query will be created:

>>> q = q.query('range', rank={"gte": 7}) >>> q.show() <Query> bool └── must ├── range, field=rank, gte=7 └── terms, genres=["Action", "Thriller"]

And reused if necessary:

>>> q = q.must_not('range', year={"lte": 1970}) >>> q.show() <Query> bool ├── must │ ├── range, field=rank, gte=7 │ └── terms, genres=["Action", "Thriller"] └── must_not └── range, field=year, lte=1970

Specifying a specific location requires to name queries :

>>> from pandagg.query import Nested

>>> q = q.nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F')) >>> q.show() <Query> bool ├── must │ ├── nested, _name=nested_roles, path="roles" │ │ └── query │ │ └── term, field=roles.gender, value="F" │ ├── range, field=rank, gte=7 │ └── terms, genres=["Action", "Thriller"] └── must_not └── range, field=year, lte=1970

Doing so allows to insert clauses above/below given clause using parent/child parameters:

>>> q = q.query('term', roles__role='Reporter', parent='nested_roles') >>> q.show() <Query> bool ├── must │ ├── nested, _name=nested_roles, path="roles" │ │ └── query │ │ └── bool │ │ └── must │ │ ├── term, field=roles.role, value="Reporter" │ │ └── term, field=roles.gender, value="F" │ ├── range, field=rank, gte=7 │ └── terms, genres=["Action", "Thriller"] └── must_not └── range, field=year, lte=1970

TODO: explain parent_param, child_param, mode merging strategies on same named clause etc..

Aggregation

The ~pandagg.tree.aggs.aggs.Aggs class provides :

  • multiple syntaxes to declare and udpate a aggregation
  • clause validation (with nested clauses validation)
  • ability to insert clauses at specific points

Aggregation declaration

Aggregation response

TODO

TODO

Mapping

Interactive mapping

In interactive context, the ~pandagg.interactive.mapping.IMapping class provides navigation features with autocompletion to quickly discover a large mapping:

>>> from pandagg.mapping import IMapping >>> from examples.imdb.load import mapping >>> m = IMapping(imdb_mapping) >>> m.roles <IMapping subpart: roles> roles [Nested] ├── actor_id Integer ├── first_name Text │ └── raw ~ Keyword ├── gender Keyword ├── last_name Text │ └── raw ~ Keyword └── role Keyword >>> m.roles.first_name <IMapping subpart: roles.first_name> first_name Text └── raw ~ Keyword

To get the complete field definition, just call it:

>>> m.roles.first_name() <Mapping Field first_name> of type text: { "type": "text", "fields": { "raw": { "type": "keyword" } } }

A IMapping instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.

Suppose you have the following client:

>>> from elasticsearch import Elasticsearch >>> client = Elasticsearch(hosts=['localhost:9200'])

Client can be bound at instantiation:

>>> m = IMapping(imdb_mapping, client=client, index_name='movies')

Doing so will generate a a attribute on mapping fields, this attribute will list all available aggregation for that field type (with autocompletion):

>>> m.roles.gender.a.terms() [('M', {'key': 'M', 'doc_count': 2296792}), ('F', {'key': 'F', 'doc_count': 1135174})]

Note

Nested clauses will be automatically taken into account.

Cluster indices discovery

TODO