Skip to content

Latest commit

 

History

History
302 lines (238 loc) · 9.5 KB

user-guide.rst

File metadata and controls

302 lines (238 loc) · 9.5 KB

User Guide

Note

Examples will be based on IMDB data. This is a work in progress. Some sections still need to be furnished.

Query

The ~pandagg.tree.query.Query class allows multiple ways to declare and udpate an Elasticsearch query.

Let's explore the multiple ways we have to declare the following query:

>>> expected_query = {'bool': {'must': [ >>> {'terms': {'genres': ['Action', 'Thriller']}}, >>> {'range': {'rank': {'gte': 7}}}, >>> {'nested': { >>> 'path': 'roles', >>> 'query': {'bool': {'must': [ >>> {'term': {'roles.gender': {'value': 'F'}}}, >>> {'term': {'roles.role': {'value': 'Reporter'}}}]} >>> } >>> }} >>> ]}}

Pandagg DSL

Pandagg provides a DSL to declare this query in a quite similar fashion:

>>> from pandagg.query import Nested, Bool, Query, Range, Term, Terms

>>> q = Query( >>> Bool(must=[ >>> Terms('genres', terms=['Action', 'Thriller']), >>> Range('rank', gte=7), >>> Nested( >>> path='roles', >>> query=Bool(must=[ >>> Term('roles.gender', value='F'), >>> Term('roles.role', value='Reporter') >>> ]) >>> ) >>> ]) >>>)

The serialized query is then available with query_dict method:

>>> q.query_dict() == expected_query True

A visual representation of the query helps to have a clearer view:

>>> q <Query> bool └── must ├── nested │ ├── path="roles" │ └── query │ └── bool │ └── must │ ├── term, field=roles.gender, value="F" │ └── term, field=roles.role, value="Reporter" ├── range, field=rank, gte=7 └── terms, field=genres, values=['Action', 'Thriller']

Chaining

Another way to declare this query is through chaining:

>>> from pandagg.utils import equal_queries >>> from pandagg.query import Nested, Bool, Query, Range, Term, Terms

>>> q = Query()>>> .query({'terms': {'genres': ['Action', 'Thriller']}})>>> .nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))>>> .query(Range('rank', gte=7))>>> .query(Term('roles.role', value='Reporter'), parent='nested_roles')

>>> equal_queries(q.query_dict(), expected_query) True

Note

equal_queries function won't consider order of clauses in must/should parameters since it actually doesn't matter in Elasticsearch execution, ie >>> equal_queries({'must': [A, B]}, {'must': [B, A]}) True

Regular syntax

Eventually, you can also use regular Elasticsearch dict syntax:

>>> q = Query(expected_query) >>> q <Query> bool └── must ├── nested │ ├── path="roles" │ └── query │ └── bool │ └── must │ ├── term, field=roles.gender, value="F" │ └── term, field=roles.role, value="Reporter" ├── range, field=rank, gte=7 └── terms, field=genres, values=['Action', 'Thriller']

Aggregation

Aggregation declaration

Aggregation response

TODO

Mapping

Here is a portion of IMDB example mapping:

>>> imdb_mapping = { >>> 'dynamic': False, >>> 'properties': { >>> 'movie_id': {'type': 'integer'}, >>> 'name': { >>> 'type': 'text', >>> 'fields': { >>> 'raw': {'type': 'keyword'} >>> } >>> }, >>> 'year': { >>> 'type': 'date', >>> 'format': 'yyyy' >>> }, >>> 'rank': {'type': 'float'}, >>> 'genres': {'type': 'keyword'}, >>> 'roles': { >>> 'type': 'nested', >>> 'properties': { >>> 'role': {'type': 'keyword'}, >>> 'actor_id': {'type': 'integer'}, >>> 'gender': {'type': 'keyword'}, >>> 'first_name': { >>> 'type': 'text', >>> 'fields': { >>> 'raw': {'type': 'keyword'} >>> } >>> }, >>> 'last_name': { >>> 'type': 'text', >>> 'fields': { >>> 'raw': {'type': 'keyword'} >>> } >>> } >>> } >>> } >>> } >>> }

Mapping DSL

The ~pandagg.tree.mapping.Mapping class provides a more compact view, which can help when dealing with large mappings:

>>> from pandagg.mapping import Mapping >>> m = Mapping(imdb_mapping) <Mapping> {Object} ├── genres Keyword ├── movie_id Integer ├── name Text │ └── raw ~ Keyword ├── rank Float ├── roles [Nested] │ ├── actor_id Integer │ ├── first_name Text │ │ └── raw ~ Keyword │ ├── gender Keyword │ ├── last_name Text │ │ └── raw ~ Keyword │ └── role Keyword └── year Date

With pandagg DSL, an equivalent declaration would be the following:

>>> from pandagg.mapping import Mapping, Object, Nested, Float, Keyword, Date, Integer, Text >>> >>> dsl_mapping = Mapping(properties=[ >>> Integer('movie_id'), >>> Text('name', fields=[ >>> Keyword('raw') >>> ]), >>> Date('year', format='yyyy'), >>> Float('rank'), >>> Keyword('genres'), >>> Nested('roles', properties=[ >>> Keyword('role'), >>> Integer('actor_id'), >>> Keyword('gender'), >>> Text('first_name', fields=[ >>> Keyword('raw') >>> ]), >>> Text('last_name', fields=[ >>> Keyword('raw') >>> ]) >>> ]) >>> ])

Which is exactly equivalent to initial mapping:

>>> dsl_mapping.serialize() == imdb_mapping True

Interactive mapping

In interactive context, the ~pandagg.interactive.mapping.IMapping class provides navigation features with autocompletion to quickly discover a large mapping:

>>> from pandagg.mapping import IMapping >>> m = IMapping(imdb_mapping) >>> m.roles <IMapping subpart: roles> roles [Nested] ├── actor_id Integer ├── first_name Text │ └── raw ~ Keyword ├── gender Keyword ├── last_name Text │ └── raw ~ Keyword └── role Keyword >>> m.roles.first_name <IMapping subpart: roles.first_name> first_name Text └── raw ~ Keyword

To get the complete field definition, just call it:

>>> m.roles.first_name() <Mapping Field first_name> of type text: { "type": "text", "fields": { "raw": { "type": "keyword" } } }

A IMapping instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.

Suppose you have the following client:

>>> from elasticsearch import Elasticsearch >>> client = Elasticsearch(hosts=['localhost:9200'])

Client can be bound either at initiation:

>>> m = IMapping(imdb_mapping, client=client, index_name='movies')

or afterwards through bind method:

>>> m = IMapping(imdb_mapping) >>> m.bind(client=client, index_name='movies')

Doing so will generate a a attribute on mapping fields, this attribute will list all available aggregation for that field type (with autocompletion):

>>> m.roles.gender.a.terms() [('M', {'key': 'M', 'doc_count': 2296792}), ('F', {'key': 'F', 'doc_count': 1135174})]

Note

Nested clauses will be automatically taken into account.

Cluster indices discovery

TODO