## Exploration on `elasticsearch_dsl`

In [1]:
import json
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

In [2]:
url = 'http://localhost:9200'
es = Elasticsearch(url)

### Example Search

In [3]:
s = Search()

query_string = 'tags:(machine learning)'
s = s.query('query_string', query=query_string, fields=['tags'])
s = s.filter('range', **{'datetime': {'gte': '2017-07-01 00:00:00'}})
s = s.highlight('tags')

res = es.search(index='bangda_blog',
                body=s.to_dict(),
                size=50)

num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

Number of results: 29


In [4]:
# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

{
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "datetime": {
                            "gte": "2017-07-01 00:00:00"
                        }
                    }
                }
            ],
            "must": [
                {
                    "query_string": {
                        "query": "tags:(machine learning)",
                        "fields": [
                            "tags"
                        ]
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "tags": {}
        }
    }
}


Search results:

`_score` is by default using Lucene's Practical Scoring Function. This is a similarity model based of **Tfidf** for queries.

In [5]:
ml_blogs = []

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))
    ml_blogs.append(result['_source']['text'])

Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Deep Learning Basis & Cheatsheet                                    | [Machine Learning, Deep Learning]                                              | 1.773541
First Journey through Kaggle                                        | [Machine Learning, Kaggle]                                                     | 1.6966699
My Kaggle Experiences Summary                                       | [Machine Learning, Kaggle]                                                     | 1.6966699
Stanford NLP (coursera) Notes (10) - Relation Extraction            | [NLP, Machine Learning]                                                        | 1.6966699
L1 Norm Regularization Explained     

### Query DSL

Elasticsearch provides a full Query DSL based on JSON to define queries.

#### Compare `query context` and `filter context`

- A query clause used in query context answers the question "How well does this document match this query clause?" Besides deciding whether or not the document matches, the query clause also calculates a `_score` representing how well the document matches, relative to other documents.

- In filter context, a query clause answers the question "Does this document match this query clause?" The answer is a simple Yes or No — no scores are calculated

In [6]:
s = Search()

query_string = 'tags:(machine learning)'
s = s.query('query_string', query=query_string, fields=['tags'])
# change from filter to query
s = s.query('range', **{'datetime': {'gte': '2017-07-01 00:00:00'}})
s = s.highlight('tags')

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "tags:(machine learning)",
                        "fields": [
                            "tags"
                        ]
                    }
                },
                {
                    "range": {
                        "datetime": {
                            "gte": "2017-07-01 00:00:00"
                        }
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "tags": {}
        }
    }
}


**Observations**: `range` on `datetime` moves under `must` from `filter`

#### Explore `term` / `match` query  and logical operators (`must`, `should`, `must_not`)

- Exact `deep`

In [7]:
s = Search()

s = s.query('term', title='learning')

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

res = es.search(index='bangda_blog', body=s.to_dict(), size=50)
num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))

{
    "query": {
        "term": {
            "title": "learning"
        }
    }
}
Number of results: 13
Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Deep Learning Basis & Cheatsheet                                    | [Machine Learning, Deep Learning]                                              | 1.6724609
Deep Learning for Anomaly Detection                                 | [Machine Learning, Deep Learning, Anomaly Detection]                           | 1.5507748
AB Testing (Udacity) Learning Notes (1)                             | [Data Science, Analytics]                                                      | 1.4455951
AB Testing (Udacity) Learning Notes (2)                             | [Data Science, Analy

- Exact `deep learning`

Use `title='deep learning'` in `term` will return 0 results, need to use `must` to connect: `must` is like `AND` in SQL

In [8]:
s = Search()

s = s.query('term', title='deep')
s = s.query('term', title='learning')

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

res = es.search(index='bangda_blog', body=s.to_dict(), size=50)
num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))

{
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "title": "deep"
                    }
                },
                {
                    "term": {
                        "title": "learning"
                    }
                }
            ]
        }
    }
}
Number of results: 3
Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Deep Learning Basis & Cheatsheet                                    | [Machine Learning, Deep Learning]                                              | 4.912375
Deep Learning for Anomaly Detection                                 | [Machine Learning, Deep Learning, Anomaly Detection]                           

- Exact `Deep` OR `Learning`

Use `title='deep learning'` in `match` will return all results mention `deep` **OR** `learning`

In [9]:
s = Search()

s = s.query('match', title='learning')

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

res = es.search(index='bangda_blog', body=s.to_dict(), size=50)
num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))

{
    "query": {
        "match": {
            "title": "learning"
        }
    }
}
Number of results: 13
Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Deep Learning Basis & Cheatsheet                                    | [Machine Learning, Deep Learning]                                              | 1.6724609
Deep Learning for Anomaly Detection                                 | [Machine Learning, Deep Learning, Anomaly Detection]                           | 1.5507748
AB Testing (Udacity) Learning Notes (1)                             | [Data Science, Analytics]                                                      | 1.4455951
AB Testing (Udacity) Learning Notes (2)                             | [Data Science, Anal

Use `should`: like `OR` in SQL

In [10]:
s = Search()

query = Q('bool', must=[Q('term', title='deep')], should=[Q('term', title='learning')])
s = s.query(query)

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

res = es.search(index='bangda_blog', body=s.to_dict(), size=50)
num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))

{
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "title": "deep"
                    }
                }
            ],
            "should": [
                {
                    "term": {
                        "title": "learning"
                    }
                }
            ]
        }
    }
}
Number of results: 3
Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Deep Learning Basis & Cheatsheet                                    | [Machine Learning, Deep Learning]                                              | 4.912375
Deep Learning for Anomaly Detection                                 | [Machine Learning, Deep Learning, Anomaly

#### Query on multiple fields

In [11]:
s = Search()

query = Q('bool', must=[Q('match', title='recommender'), Q('match', tags='learning')])
s = s.query(query)

# full query
full_query = s.to_dict()
print(json.dumps(full_query, indent=4))

res = es.search(index='bangda_blog', body=s.to_dict(), size=50)
num_results = len(res['hits']['hits'])
print('Number of results:', num_results)

print('{:67s} | {:78s} | {}'.format('Title', 'Tags', 'Score'))
print('-' * 162)    
for result in res['hits']['hits']:
    print('{:67s} | {:78s} | {}'.format(result['_source']['title'], result['_source']['tags'], result['_score']))

{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "title": "recommender"
                    }
                },
                {
                    "match": {
                        "tags": "learning"
                    }
                }
            ]
        }
    }
}
Number of results: 3
Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Matrix Factorization for Recommender Systems                        | [Recommender Sys, Machine Learning, Factorization Models]                      | 3.561971
Recommender Systems Overview - From A Survey                        | [Recommender Sys, Machine Learning]                                    