When executing a search request via ~pandagg.search.Search.execute
method of ~pandagg.search.Search
, a ~pandagg.response.Response
instance is returned.
>>> from elasticsearch import Elasticsearch >>> from pandagg.search import Search >>> >>> client = ElasticSearch(hosts=['localhost:9200']) >>> response = Search(using=client, index='movies')>>> .size(2)>>> .filter('term', genres='Documentary')>>> .agg('avg_rank', 'avg', field='rank')>>> .execute()
>>> response <Response> took 9ms, success: True, total result >=10000, contains 2 hits
>>> response.__class__ pandagg.response.Response
ElasticSearch raw dict response is available under data attribute:
>>> response.data { 'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 0.0, 'hits': [{'_index': 'movies', ...}], 'aggregations': {'avg_rank': {'value': 6.496829211219546}} }
Hits are available under hits attribute:
>>> response.hits <Hits> total: >10000, contains 2 hits
>>> response.hits.total {'value': 10000, 'relation': 'gte'}
>>> response.hits.hits [<Hit 642> score=0.00, <Hit 643> score=0.00]
Those hits are instances of ~pandagg.response.Hit
.
Directly iterating over ~pandagg.response.Response
will return those hits:
>>> list(response) [<Hit 642> score=0.00, <Hit 643> score=0.00]
>>> hit = next(iter(response))
Each hit contains the raw dict under data attribute:
>>> hit.data {'_index': 'movies', '_type': '_doc', '_id': '642', '_score': 0.0, '_source': {'movie_id': 642, 'name': '10 Tage in Calcutta', 'year': 1984, 'genres': ['Documentary'], 'roles': None, 'nb_roles': 0, 'directors': [{'director_id': 33096, 'first_name': 'Reinhard', 'last_name': 'Hauff', 'full_name': 'Reinhard Hauff', 'genres': ['Documentary', 'Drama', 'Musical', 'Short']}], 'nb_directors': 1, 'rank': None}}
>>> hit._index 'movies'
>>> hit._source {'movie_id': 642, 'name': '10 Tage in Calcutta', 'year': 1984, 'genres': ['Documentary'], 'roles': None, 'nb_roles': 0, 'directors': [{'director_id': 33096, 'first_name': 'Reinhard', 'last_name': 'Hauff', 'full_name': 'Reinhard Hauff', 'genres': ['Documentary', 'Drama', 'Musical', 'Short']}], 'nb_directors': 1, 'rank': None}
If pandas dependency is installed, hits can be parsed as a dataframe:
- >>> hits.to_dataframe()
_index _score _type directors genres movie_id name nb_directors nb_roles rank roles year
_id 642 movies 0.0 _doc [{'director_id': 33096, 'first_name': 'Reinhard', 'last_name': 'Hauff', 'full_name': 'Reinhard Hauff', 'genres': ['Documentary', 'Drama', 'Musical', 'Short']}] [Documentary] 642 10 Tage in Calcutta 1 0 None None 1984 643 movies 0.0 _doc [{'director_id': 32148, 'first_name': 'Tanja', 'last_name': 'Hamilton', 'full_name': 'Tanja Hamilton', 'genres': ['Documentary']}] [Documentary] 643 10 Tage, ein ganzes Leben 1 0 None None 2004
Aggregations are handled differently, the aggregations attribute of a ~pandagg.response.Response
returns a ~pandagg.response.Aggregations
instance, that provides specific parsing abilities in addition to exposing raw aggregations response under data attribute.
Let's build a bit more complex aggregation query to showcase its functionalities:
>>> from elasticsearch import Elasticsearch >>> from pandagg.search import Search >>> >>> client = Elasticsearch(hosts=['localhost:9200']) >>> response = Search(using=client, index='movies')>>> .size(0)>>> .groupby('decade', 'histogram', interval=10, field='year')>>> .groupby('genres', size=3)>>> .agg('avg_rank', 'avg', field='rank')>>> .aggs('avg_nb_roles', 'avg', field='nb_roles')>>> .filter('range', year={"gte": 1990})>>> .execute()
Note
for more details about how to build aggregation query, consult user-guide.aggs
section
Using data attribute:
>>> response.aggregations.data {'decade': {'buckets': [{'key': 1990.0, 'doc_count': 79495, 'genres': {'doc_count_error_upper_bound': 0, 'sum_other_doc_count': 38060, 'buckets': [{'key': 'Drama', 'doc_count': 12232, 'avg_nb_roles': {'value': 18.518067364290385}, 'avg_rank': {'value': 5.981429367965072}}, {'key': 'Short', ...
Using ~pandagg.response.Aggregations.to_normalized
:
>>> response.aggregations.to_normalized() {'level': 'root', 'key': None, 'value': None, 'children': [{'level': 'decade', 'key': 1990.0, 'value': 79495, 'children': [{'level': 'genres', 'key': 'Drama', 'value': 12232, 'children': [{'level': 'avg_rank', 'key': None, 'value': 5.981429367965072}, {'level': 'avg_nb_roles', 'key': None, 'value': 18.518067364290385}]}, {'level': 'genres', 'key': 'Short', 'value': 12197, 'children': [{'level': 'avg_rank', 'key': None, 'value': 6.311325829450123}, ...
Using ~pandagg.response.Aggregations.to_interactive_tree
:
>>> response.aggregations.to_interactive_tree() <IResponse> root ├── decade=1990 79495 │ ├── genres=Documentary 8393 │ │ ├── avg_nb_roles 3.7789824854045038 │ │ └── avg_rank 6.517093241977517 │ ├── genres=Drama 12232 │ │ ├── avg_nb_roles 18.518067364290385 │ │ └── avg_rank 5.981429367965072 │ └── genres=Short 12197 │ ├── avg_nb_roles 3.023284414200213 │ └── avg_rank 6.311325829450123 └── decade=2000 57649 ├── genres=Documentary 8639 │ ├── avg_nb_roles 5.581433036231045 │ └── avg_rank 6.980897812811443 ├── genres=Drama 11500 │ ├── avg_nb_roles 14.385391304347825 │ └── avg_rank 6.269675415719865 └── genres=Short 13451 ├── avg_nb_roles 4.053081555274701 └── avg_rank 6.83625304327684
Doing so requires to identify a level that will draw the line between:
- grouping levels: those which will be used to identify rows (here decades, and genres), and provide doc_count per row
- columns levels: those which will be used to populate columns and cells (here avg_nb_roles and avg_rank)
The tabular format will suit especially well aggregations with a T shape.
Using ~pandagg.response.Aggregations.to_dataframe
:
- >>> response.aggregations.to_dataframe()
avg_nb_roles avg_rank doc_count
decade genres 1990.0 Drama 18.518067 5.981429 12232 Short 3.023284 6.311326 12197 Documentary 3.778982 6.517093 8393 2000.0 Short 4.053082 6.836253 13451 Drama 14.385391 6.269675 11500 Documentary 5.581433 6.980898 8639
Using ~pandagg.response.Aggregations.to_tabular
:
>>> response.aggregations.to_tabular() (['decade', 'genres'], {(1990.0, 'Drama'): {'doc_count': 12232, 'avg_rank': 5.981429367965072, 'avg_nb_roles': 18.518067364290385}, (1990.0, 'Short'): {'doc_count': 12197, 'avg_rank': 6.311325829450123, 'avg_nb_roles': 3.023284414200213}, (1990.0, 'Documentary'): {'doc_count': 8393, 'avg_rank': 6.517093241977517, 'avg_nb_roles': 3.7789824854045038}, (2000.0, 'Short'): {'doc_count': 13451, 'avg_rank': 6.83625304327684, 'avg_nb_roles': 4.053081555274701}, (2000.0, 'Drama'): {'doc_count': 11500, 'avg_rank': 6.269675415719865, 'avg_nb_roles': 14.385391304347825}, (2000.0, 'Documentary'): {'doc_count': 8639, 'avg_rank': 6.980897812811443, 'avg_nb_roles': 5.581433036231045}})
Note
TODO - explain parameters:
- index_orient
- grouped_by
- expand_columns
- expand_sep
- normalize
- with_single_bucket_groups