# ElasticSearch Process and Tips (Indexing, Analyzing, Querying)

### Things About Your Local ES Installation

There is a config directory where you have ES installed.  This is where you would edit the config yml file, or add files for stopwords or synonyms.  (This is important information you will need, so make sure you know where your install lives.)

To start up: Run elasticsearch (bin/elasticsearch), then run bin/kibana inside the kibana directory.  This establishes the plugins you might want, like Sense.  See below re Sense.

### The Sense Plugin to Kibana and ES is Useful:

Install Kibana, and then add plugins (they become "apps" under the tab in the Kibana toolbar. (The little grid icon to the right of the "Sense" name in the pic below.)

Sense plugin: https://github.com/bleskes/sense  (Note, it used to be part of Marvel.)
Install directions: https://www.elastic.co/guide/en/sense/current/installing.html

**Example: in the ES docs**:

    curl 'localhost:9200/_cat/indices?v'

**In Sense**:

    GET /_cat/indices?v

<img src="images/sense_ui.png">

In the github repo, there is a text file of Sense queries that will work after you index your data using the steps below.

### Using Python...  Slightly different API:

We will be loading the data from the stored dataframe, and then indexing the data in ES.  We can do queries against it from Python or using the Sense plugin.

In [41]:
from elasticsearch import Elasticsearch, client

#es = Elasticsearch(hosts=[{'host': 'elasticsearch.aws.blahblah.com', 'port': '9200'}])
localEs = Elasticsearch()
localclient = client.IndicesClient(localEs)

### Analyzers, Defaults, and Preventing Analysis

Analysis is the process of chopping up your text and storing it in a form that can be searched efficiently against.

#### Read this:

https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html

An Analyzer, is in order, a sequence of optional
* character filters
* tokenizers
* token filters

To prevent analysis, you can specify "not_analyzed" on the index itself.  The Interwebs also suggest "keyword" as the analyzer for a field, but some folks claim it does some simple analyis.

The default analyzer (if unspecified!) for string fields is "standard."  In a custom analyzer, it would be defined:

    {
        "type":      "custom",
        "tokenizer": "standard",
        "filter":  [ "lowercase", "stop" ]
    }
    
More on default analysis from the docs (https://www.elastic.co/guide/en/elasticsearch/guide/current/_controlling_analysis.html):

>While we can specify an analyzer at the field level, how do we determine which analyzer is used for a field if none is specified at the field level?
>
>Analyzers can be specified at several levels. Elasticsearch works through each level until it finds an analyzer that it can use. At index time, the order is as follows:
>
>1. The analyzer defined in the field mapping, else
>2. The analyzer named default in the index settings, which defaults to
>3. The standard analyzer
>
>...At search time, the sequence is slightly different:...
>
>1. The analyzer defined in the query itself, else
>2. The search_analyzer defined in the field mapping, else
>3. The analyzer defined in the field mapping, else
>4. The analyzer named default_search in the index settings, which defaults to
>5. The analyzer named default in the index settings, which defaults to
>6. The standard analyzer
    
#### We can inspect analysis with the "analyze" function (or "_analyze" in the curl style).

In [43]:
#localEs.indices.delete(index='my_index')
localEs.indices.create(index='my_index')

{u'acknowledged': True}

In [44]:
# this is the default analyzer ES will use if you don't specify one! Specify one!

localclient.analyze(index='my_index', analyzer='standard',
                                        text='My kitty-cat is adorable.')

{u'tokens': [{u'end_offset': 2,
   u'position': 0,
   u'start_offset': 0,
   u'token': u'my',
   u'type': u'<ALPHANUM>'},
  {u'end_offset': 8,
   u'position': 1,
   u'start_offset': 3,
   u'token': u'kitty',
   u'type': u'<ALPHANUM>'},
  {u'end_offset': 12,
   u'position': 2,
   u'start_offset': 9,
   u'token': u'cat',
   u'type': u'<ALPHANUM>'},
  {u'end_offset': 15,
   u'position': 3,
   u'start_offset': 13,
   u'token': u'is',
   u'type': u'<ALPHANUM>'},
  {u'end_offset': 24,
   u'position': 4,
   u'start_offset': 16,
   u'token': u'adorable',
   u'type': u'<ALPHANUM>'}]}

In [46]:
# A utility to make analysis results easier to read:

def get_analyzer_tokens(result):
    ''' Utility to combine tokens in an analyzer result. '''
    tokens = result[u'tokens']
    print tokens
    return ' '.join([token['token'] for token in tokens])

In [49]:
get_analyzer_tokens(localclient.analyze(index='my_index',
                                        analyzer="standard",
                                        text='My kitty-cat\'s a pain in the neck.'))

[{u'end_offset': 2, u'token': u'my', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}, {u'end_offset': 8, u'token': u'kitty', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}, {u'end_offset': 14, u'token': u"cat's", u'type': u'<ALPHANUM>', u'start_offset': 9, u'position': 2}, {u'end_offset': 16, u'token': u'a', u'type': u'<ALPHANUM>', u'start_offset': 15, u'position': 3}, {u'end_offset': 21, u'token': u'pain', u'type': u'<ALPHANUM>', u'start_offset': 17, u'position': 4}, {u'end_offset': 24, u'token': u'in', u'type': u'<ALPHANUM>', u'start_offset': 22, u'position': 5}, {u'end_offset': 28, u'token': u'the', u'type': u'<ALPHANUM>', u'start_offset': 25, u'position': 6}, {u'end_offset': 33, u'token': u'neck', u'type': u'<ALPHANUM>', u'start_offset': 29, u'position': 7}]


u"my kitty cat's a pain in the neck"

**NB: Prevent analysis with "keyword" analyzer, or set the index itself as "not_analyzed" in settings.**

But if you do this, you need to match on EXACT field contents to search for it.  Best to keep an analyzed copy too, if it's meant to be english searchable text.

In [7]:
get_analyzer_tokens(localclient.analyze(index='my_index',
                                        analyzer='keyword',
                                        text='My kitty-cat\'s a pain in the neck.'))

[{u'end_offset': 34, u'token': u"My kitty-cat's a pain in the neck.", u'type': u'word', u'start_offset': 0, u'position': 0}]


u"My kitty-cat's a pain in the neck."

## The Built-In ES "English" Analyzer:
### A useful analyzer for text is the built-in English one, which does this, approximately:

https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html

See: 
https://simpsora.wordpress.com/2014/05/02/customizing-elasticsearch-english-analyzer/

>Tokenizer: Standard tokenizer

>TokenFilters:
>* Standard token filter
>* English possessive filter, which removes trailing 's from words
>* Lowercase token filter
>* Stop token filter
>* Keyword marker filter, which protects certain tokens from modification by stemmers
>* Porter stemmer filter, which reduces words down to a base form (“stem”)


These are the stop-words defined for English:

    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
    no, not, of, on, or, such, that, the, their, then, there, these,
    they, this, to, was, will, with
    
If you want to customize you can create a new filter yourself or use a file in your config directory for ES.    

In [48]:
# Try it on some text and see...
get_analyzer_tokens(localclient.analyze(index='my_index', analyzer='english',
                                        text='My kitty-cat\'s a pain in the neck.'))

[{u'end_offset': 2, u'token': u'my', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}, {u'end_offset': 8, u'token': u'kitti', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}, {u'end_offset': 14, u'token': u'cat', u'type': u'<ALPHANUM>', u'start_offset': 9, u'position': 2}, {u'end_offset': 21, u'token': u'pain', u'type': u'<ALPHANUM>', u'start_offset': 17, u'position': 4}, {u'end_offset': 33, u'token': u'neck', u'type': u'<ALPHANUM>', u'start_offset': 29, u'position': 7}]


u'my kitti cat pain neck'

If you wanted to customize the 'english' analyzer with your own special rules (extra stopwords etc), see here: https://www.elastic.co/guide/en/elasticsearch/guide/current/configuring-language-analyzers.html
        

## Analyzers and Custom Analyzers

You want to make sure you are explicit about types in your data, so that ES doesn't just guess and maybe get it wrong. Also, this is how you set explicit analysis.



Create a setting for the index:

    PUT /my_index
    {
        "settings": {
            "analysis": {
                "char_filter": { ... custom character filters ... },
                "tokenizer":   { ...    custom tokenizers     ... },
                "filter":      { ...   custom token filters   ... },
                "analyzer":    { ...    custom analyzers referring to the definitions above ... }
            }
        }
    }
    
For example - this saves a bunch of analysis components into an analyzer called 'my_analyzer':

    PUT /my_index
    {
        "settings": {
            "analysis": {
                "char_filter": {
                    "&_to_and": {
                        "type":       "mapping",
                        "mappings": [ "&=> and "]
                }},
                "filter": {
                    "my_stopwords": {
                        "type":       "stop",
                        "stopwords": [ "the", "a" ]
                }},
                "analyzer": {
                    "my_analyzer": {
                        "type":         "custom",
                        "char_filter":  [ "html_strip", "&_to_and" ],
                        "tokenizer":    "standard",
                        "filter":       [ "lowercase", "my_stopwords" ]
                }}
    }}}
    
 Then you **use it**, by referring to it in a mapping for a document in this index:
 
     PUT /my_index/_mapping/my_type
    {
        "properties": {
            "title": {
                "type":      "string",
                "analyzer":  "my_analyzer"
            }
        }
    }
    
#### Remember: If you don't assign it to a field in a mapping, you aren't using it.

In Python:

In [53]:
MY_SETTINGS = {
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
        }}
}

MAPPING = {
    "my_doc_type": {
        "properties": {
            "title": {
                "type":      "string",
                "analyzer":  "my_analyzer"
            }
        }
    }
}

## Stopwords Note

The default list of stopwords is indicated thusly:
    
>"stopwords": "\_english\_"

So you can specify both that filter and a custom stopwords list, if you want.

In [55]:
import json

#localEs.indices.delete(index='my_index')  # uncomment is this index exists and you need to remake it
localEs.indices.create(index='my_index', body=json.dumps(MY_SETTINGS))
localEs.indices.put_mapping(index='my_index', doc_type="my_doc_type", body=json.dumps(MAPPING))

{u'acknowledged': True}

In [56]:
# Check that your mapping looks right!

localclient.get_mapping(index='my_index')

{u'my_index': {u'mappings': {u'my_doc_type': {u'properties': {u'title': {u'analyzer': u'my_analyzer',
      u'type': u'string'}}}}}}

In [57]:
res = localclient.analyze(index='my_index', analyzer='my_analyzer',\
                          text="<p>This is the title & a Capitalized Word!</p>")

In [58]:
get_analyzer_tokens(res)

[{u'end_offset': 7, u'token': u'this', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 0}, {u'end_offset': 10, u'token': u'is', u'type': u'<ALPHANUM>', u'start_offset': 8, u'position': 1}, {u'end_offset': 20, u'token': u'title', u'type': u'<ALPHANUM>', u'start_offset': 15, u'position': 3}, {u'end_offset': 22, u'token': u'and', u'type': u'<ALPHANUM>', u'start_offset': 21, u'position': 4}, {u'end_offset': 36, u'token': u'capitalized', u'type': u'<ALPHANUM>', u'start_offset': 25, u'position': 6}, {u'end_offset': 41, u'token': u'word', u'type': u'<ALPHANUM>', u'start_offset': 37, u'position': 7}]


u'this is title and capitalized word'

## Tokenizers vs. Analyzers - Be Careful.

Some of the names in ES are confusing.  There is a **"standard" analyzer** and a **"standard" tokenizer**. https://www.elastic.co/guide/en/elasticsearch/guide/current/standard-tokenizer.html#standard-tokenizer

Check them out:

In [35]:
get_analyzer_tokens(localclient.analyze(index='my_index',
                                        analyzer='standard',
                                        text='My kitty-cat\'s not a pain in the \'neck\'!'))

[{u'end_offset': 2, u'token': u'my', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}, {u'end_offset': 8, u'token': u'kitty', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}, {u'end_offset': 14, u'token': u"cat's", u'type': u'<ALPHANUM>', u'start_offset': 9, u'position': 2}, {u'end_offset': 18, u'token': u'not', u'type': u'<ALPHANUM>', u'start_offset': 15, u'position': 3}, {u'end_offset': 20, u'token': u'a', u'type': u'<ALPHANUM>', u'start_offset': 19, u'position': 4}, {u'end_offset': 25, u'token': u'pain', u'type': u'<ALPHANUM>', u'start_offset': 21, u'position': 5}, {u'end_offset': 28, u'token': u'in', u'type': u'<ALPHANUM>', u'start_offset': 26, u'position': 6}, {u'end_offset': 32, u'token': u'the', u'type': u'<ALPHANUM>', u'start_offset': 29, u'position': 7}, {u'end_offset': 38, u'token': u'neck', u'type': u'<ALPHANUM>', u'start_offset': 34, u'position': 8}]


u"my kitty cat's not a pain in the neck"

In [34]:
#  The difference is subtle but there.
get_analyzer_tokens(localclient.analyze(index='my_index',
                                        tokenizer="standard",
                                        text='My kitty-cat\'s not a pain in the \'neck\'!'))

[{u'end_offset': 2, u'token': u'My', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}, {u'end_offset': 8, u'token': u'kitty', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}, {u'end_offset': 14, u'token': u"cat's", u'type': u'<ALPHANUM>', u'start_offset': 9, u'position': 2}, {u'end_offset': 18, u'token': u'not', u'type': u'<ALPHANUM>', u'start_offset': 15, u'position': 3}, {u'end_offset': 20, u'token': u'a', u'type': u'<ALPHANUM>', u'start_offset': 19, u'position': 4}, {u'end_offset': 25, u'token': u'pain', u'type': u'<ALPHANUM>', u'start_offset': 21, u'position': 5}, {u'end_offset': 28, u'token': u'in', u'type': u'<ALPHANUM>', u'start_offset': 26, u'position': 6}, {u'end_offset': 32, u'token': u'the', u'type': u'<ALPHANUM>', u'start_offset': 29, u'position': 7}, {u'end_offset': 38, u'token': u'neck', u'type': u'<ALPHANUM>', u'start_offset': 34, u'position': 8}]


u"My kitty cat's not a pain in the neck"

In [37]:
# However, if you use the english analyzer it will override that uppercase and also remove the negation, 
# because "not" is in the stopwords list:
get_analyzer_tokens(localclient.analyze(index='my_index',
                                        analyzer="english",
                                        tokenizer="standard",
                                        text='My kitty-cat\'s not a pain in the \'neck\'!'))

[{u'end_offset': 2, u'token': u'my', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}, {u'end_offset': 8, u'token': u'kitti', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}, {u'end_offset': 14, u'token': u'cat', u'type': u'<ALPHANUM>', u'start_offset': 9, u'position': 2}, {u'end_offset': 25, u'token': u'pain', u'type': u'<ALPHANUM>', u'start_offset': 21, u'position': 5}, {u'end_offset': 38, u'token': u'neck', u'type': u'<ALPHANUM>', u'start_offset': 34, u'position': 8}]


u'my kitti cat pain neck'

## Indexing Yelp Data

In [2]:
import pandas as pd

In [3]:
df = pd.read_msgpack("data/yelp_df_forES.msg")

In [4]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,user_id,tokens,text_length,net_sentiment,sent_per_token,fake_name
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,wife took birthday breakfast excellent weather...,542,35.0,0.064576,The Our Saturday
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0hT2KtfLiobPvh6cDC8JQg,love gyro plate rice good also dig candy selec...,50,6.0,0.12,Rice
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,rosie dakota love chaparral dog park convenien...,280,10.0,0.035714,Rosie Dakota LOVE
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,general manager scott petello good egg detail ...,248,13.0,0.052419,General Manager Scott
7,hW0Ne_HTHEAgGF1rAdmR-g,2012-07-12,JL7GXJ9u4YMx7Rzs05NfiQ,4,"Luckily, I didn't have to travel far to make m...",1ieuYcKS7zeAv_U15AB13A,luckily travel far make connecting flight than...,162,13.0,0.080247,Luckily And Phoenix


In [22]:
# test with a small sample if you want
dfshort = df.query('stars >= 5 and net_sentiment > 35')

In [23]:
len(dfshort)

358

In [14]:
# filter out any rows with a nan for sent_per_token, which breaks bulk load:
df = df[df.sent_per_token.isnull() != True]

In [15]:
MAPPING = {
	        'review': {
	            'properties': {
	                'business_id': {'index': 'not_analyzed', 'type': 'string'},
	                'date': {'index': 'not_analyzed', 'format': 'dateOptionalTime', 'type': 'date'},
	                'review_id': {'index': 'not_analyzed', 'type': 'string'},
                    'stars': {'index': 'not_analyzed', 'type': 'integer'},
	                'text': {'index': 'analyzed',
                             'analyzer': 'english',
                             'store': 'yes',
                             "term_vector": "with_positions_offsets_payloads",
                             'type': 'string'},
                    'fake_name': {'index':'not_analyzed', 'type': 'string'},
                    'text_orig': {'index': 'not_analyzed', 'type': 'string'},
                    'user_id': {'index': 'not_analyzed', 'type': 'string'},
                    'net_sentiment': {'index': 'not_analyzed', 'type': 'integer'},
                    'sent_per_token': {'index': 'not_analyzed', 'type': 'float'}
	            }}
	}

In [18]:
localEs.indices.delete(index='yelp')  # nb: this errors the first time you run it. comment out.
localEs.indices.create(index='yelp')
localEs.indices.put_mapping(index='yelp', doc_type='review', body=json.dumps(MAPPING))

{u'acknowledged': True}

In [19]:
# Bulk data is structured as alternating opt_dict and data dicts.

bulk_data = []

for index, row in df.iterrows():
    data_dict = {}
    data_dict['text_orig'] = row['text']
    data_dict['text']= row['text']
    data_dict['net_sentiment'] = row['net_sentiment'] 
    data_dict['sent_per_token'] = row['sent_per_token']
    data_dict['stars'] = row['stars']
    data_dict['fake_name'] = row['fake_name']
    data_dict['user_id'] = row['user_id']
    data_dict['business_id'] = row['business_id']
    data_dict['date'] = row['date']
    data_dict['review_id'] = row['review_id']
    op_dict = {
            "index": {
            "_index": 'yelp',
            "_type": 'review',
            "_id": row['review_id']
            }
    }
    bulk_data.append(op_dict)
    bulk_data.append(data_dict)

In [20]:
bulk_data[0]

{'index': {'_id': u'fWKvX83p0-ka4JS3dc6E5A',
  '_index': 'yelp',
  '_type': 'review'}}

In [21]:
len(bulk_data)

344884

In [None]:
# May time out with a large bulk_data bump or error and fail without any reason.  Mine did, so see below.
res = localEs.bulk(index = 'yelp', body = bulk_data)

In [24]:
# In order to find the error, I did them one-by-one, with a try.
for ind, obj in enumerate(bulk_data):
    # every other one is the data, so use those to do it one by one
    if ind % 2 != 0:
        try:
            localEs.index(index='yelp', doc_type='review', id=obj['review_id'], body=json.dumps(obj))
        except:
            print obj

In [82]:
localEs.search(index='yelp', doc_type='review', q='pizza-cookie')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'mkNmB283CAXzUfpQm0hr5Q',
    u'_index': u'yelp',
    u'_score': 2.1016731,
    u'_source': {u'business_id': u'8Hn5X1AqgmSLHRG2KgBJBg',
     u'date': u'2012-03-09',
     u'fake_name': u'Probably Phoenix There',
     u'net_sentiment': 5.0,
     u'review_id': u'mkNmB283CAXzUfpQm0hr5Q',
     u'sent_per_token': 0.16129032258064516,
     u'stars': 4,
     u'text': u'Great pizza, better pizza cookie.',
     u'text_orig': u'Great pizza, better pizza cookie.',
     u'user_id': u'4TZARz8AjNWfukK2UfZitA'},
    u'_type': u'review'},
   {u'_id': u'lxrcTePTWl6GpssJOPl1Sg',
    u'_index': u'yelp',
    u'_score': 1.7622974,
    u'_source': {u'business_id': u'ioGIaoswvbbeY8P965IYRA',
     u'date': u'2012-05-19',
     u'fake_name': u'Oregano Flagstaff NAU',
     u'net_sentiment': 5.0,
     u'review_id': u'lxrcTePTWl6GpssJOPl1Sg',
     u'sent_per_token': 0.0625,
     u'stars': 5,
     u'text': u'Save room at the e

Remember that score relevancy results are based on the indexed TF-IDF for the doc and docs:
    https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html

Want to explain why something matched?  You need the id of the matched doc.

In [59]:
localEs.explain(index='yelp', doc_type='review', q='pizza-cookie', id=u'fmn5yGrPChOYMR2vGOIrYA')

{u'_id': u'fmn5yGrPChOYMR2vGOIrYA',
 u'_index': u'yelp',
 u'_type': u'review',
 u'explanation': {u'description': u'sum of:',
  u'details': [{u'description': u'sum of:',
    u'details': [{u'description': u'weight(_all:pizza in 10428) [PerFieldSimilarity], result of:',
      u'details': [{u'description': u'score(doc=10428,freq=2.0), product of:',
        u'details': [{u'description': u'queryWeight, product of:',
          u'details': [{u'description': u'idf(docFreq=1896, maxDocs=36713)',
            u'details': [],
            u'value': 3.9628572},
           {u'description': u'queryNorm',
            u'details': [],
            u'value': 0.13838264}],
          u'value': 0.5483907},
         {u'description': u'fieldWeight in 10428, product of:',
          u'details': [{u'description': u'tf(freq=2.0), with freq of:',
            u'details': [{u'description': u'termFreq=2.0',
              u'details': [],
              u'value': 2.0}],
            u'value': 1.4142135},
           {u'descr

### More Like This

A variety of options for finding similar documents, including term counts and custom stop words: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-mlt-query.html
  
        

In [82]:
text = df.iloc[0].text
text

u'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

In [99]:
QUERY = {
      "query": { 
        "more_like_this" : {
            "fields" : ["text"],
            "like_text": text,
            "analyzer" : "english",
            "min_term_freq" : 2
            }
        }
    }

In [100]:
# Result is not brilliant, though.  You could limit the hits unless a score threshold is hit.
localEs.search(index='yelp', doc_type='review', body=json.dumps(QUERY))

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'fWKvX83p0-ka4JS3dc6E5A',
    u'_index': u'yelp',
    u'_score': 1.8600082,
    u'_source': {u'business_id': u'9yKzy9PApeiPPOUJEtnvkg',
     u'date': u'2011-01-26',
     u'fake_name': u'The Our Saturday',
     u'net_sentiment': 35.0,
     u'review_id': u'fWKvX83p0-ka4JS3dc6E5A',
     u'sent_per_token': 0.06457564575645756,
     u'stars': 5,
     u'text': u'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you

### Suggestions: For Mispellings

Can be added to queries too, to help if there are no matches.  Still in development, though. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#search-suggesters

In [118]:
SUGGESTION = {
    "my-suggestion":
        {"text" : "cheese piza",
         "term" : { "field" : "text" }
        }
}

In [120]:
# I don't love the results, tbh.  Fail on cheese.

localEs.suggest(index='yelp', body=SUGGESTION)

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'my-suggestion': [{u'length': 6,
   u'offset': 0,
   u'options': [],
   u'text': u'chees'},
  {u'length': 4,
   u'offset': 7,
   u'options': [{u'freq': 4320, u'score': 0.75, u'text': u'pizza'},
    {u'freq': 647, u'score': 0.75, u'text': u'pita'},
    {u'freq': 21, u'score': 0.75, u'text': u'pima'},
    {u'freq': 18, u'score': 0.75, u'text': u'pina'},
    {u'freq': 3, u'score': 0.75, u'text': u'pizz'}],
   u'text': u'piza'}]}

## Reminders:
* check your mapping on your fields
* check your analyzer results - they can be mysterious and hidden; if you configure wrong, it will use defaults...
* check your document tokenization
* use multi-fields to be sure of matches that may need stopwords too

## Let's Index the Businesses too

In [25]:
biz = pd.read_msgpack("data/biz_stats_df.msg")

In [26]:
len(biz)

11479

In [27]:
biz.head(3)

Unnamed: 0,business_id,reviews,sent_per_token_median,net_sentiment_median,stars_median,stars_mean,stars_std,text_length_median,ids,fake_name
0,--5jkZ3-nUPZxUvtcbr8Uw,11,0.046753,10.0,5.0,4.545455,0.687552,327.0,--5jkZ3-nUPZxUvtcbr8Uw,George The Tzatziki
1,--BlvDO_RG2yElKu9XA1_g,29,0.041667,10.0,5.0,4.103448,1.291312,242.0,--BlvDO_RG2yElKu9XA1_g,Hawaii Hawaii That
2,-0D_CYhlD2ILkmLR0pBmnA,3,0.023339,13.0,4.0,3.333333,1.154701,298.0,-0D_CYhlD2ILkmLR0pBmnA,Saw Olive Gourmet


In [31]:
B_MAPPING = {
	        'business': {
	            'properties': {
	                'business_id': {'index': 'not_analyzed', 'type': 'string'},
	                'reviews': {'index': 'not_analyzed', 'type': 'integer'},
	                'stars_median': {'index': 'not_analyzed', 'type': 'float'},
                    'stars_mean': {'index': 'not_analyzed', 'type': 'float'},
                    'text_length_median': {'index': 'not_analyzed', 'type': 'float'},
                    'fake_name': {'index':'not_analyzed', 'type': 'string'},
                    'net_sentiment_median': {'index': 'not_analyzed', 'type': 'float'},
                    'sent_per_token_median': {'index': 'not_analyzed', 'type': 'float'}
	            }}
	}

In [32]:
#localEs.indices.delete(index='yelp')  # nb: this errors the first time you run it. comment out.
#localEs.indices.create(index='yelp')  # do not do this is you already made the reviews!
localEs.indices.put_mapping(index='yelp', doc_type='business', body=json.dumps(B_MAPPING))

{u'acknowledged': True}

In [34]:
# Bulk data is structured as alternating opt_dict and data dicts.

bulk_data = []

for index, row in biz.iterrows():
    data_dict = {}
    data_dict['net_sentiment_median'] = row['net_sentiment_median'] 
    data_dict['sent_per_token_median'] = row['sent_per_token_median']
    data_dict['stars_median'] = row['stars_median']
    data_dict['stars_mean'] = row['stars_mean']
    data_dict['fake_name'] = row['fake_name']
    data_dict['text_length_median'] = row['text_length_median']
    data_dict['business_id'] = row['business_id']
    data_dict['reviews'] = row['reviews']
    op_dict = {
            "index": {
            "_index": 'yelp',
            "_type": 'business',
            "_id": row['business_id']
            }
    }
    bulk_data.append(op_dict)
    bulk_data.append(data_dict)

In [35]:
# May time out with a large bulk_data bump or error and fail without any reason.  Mine did, so see below.
res = localEs.bulk(index = 'yelp', body = bulk_data)

In [39]:
localEs.search(index='yelp', doc_type='business', q='JokKtdXU7zXHcr20Lrk29A')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'JokKtdXU7zXHcr20Lrk29A',
    u'_index': u'yelp',
    u'_score': 2.140637,
    u'_source': {u'business_id': u'JokKtdXU7zXHcr20Lrk29A',
     u'fake_name': u'This The The',
     u'net_sentiment_median': 11.0,
     u'reviews': 571,
     u'sent_per_token_median': 0.05241935483870968,
     u'stars_mean': 4.352014010507881,
     u'stars_median': 4.0,
     u'text_length_median': 231.0},
    u'_type': u'business'}],
  u'max_score': 2.140637,
  u'total': 1},
 u'timed_out': False,
 u'took': 6}

## Aggregate Queries to get Business ID's and More


Here we are using the operator "and" to make sure all words in the search match, and then getting counts of matching business id's.

In [125]:
QUERY = {
    "query": {
        "match": {
            "text": {      
                "query":    "good pizza",
                "operator": "and"
            }
        }
    },
    "aggs" : {
      "businesses" : {
            "terms" : {
              "field" : "business_id"
            }
        }
    }
}

In [124]:
localEs.search(index="yelp", doc_type="review", body=json.dumps(QUERY))

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'aggregations': {u'businesses': {u'buckets': [{u'doc_count': 184,
     u'key': u'VVeogjZya58oiTxK7qUjAQ'},
    {u'doc_count': 123, u'key': u'pwpl-rxwNRQdgqFz_-qMPg'},
    {u'doc_count': 120, u'key': u'V1nEpIRmEa1768oj_tuxeQ'},
    {u'doc_count': 93, u'key': u'dcd3C1gWv-vVdQ9XYV8Ubw'},
    {u'doc_count': 69, u'key': u'7SO_rX1F6rQEl-5s3wZxgQ'},
    {u'doc_count': 68, u'key': u'VY_tvNUCCXGXQeSvJl757Q'},
    {u'doc_count': 52, u'key': u'8Hn5X1AqgmSLHRG2KgBJBg'},
    {u'doc_count': 52, u'key': u'LMG0zsAkUSscIvmV9vvm3A'},
    {u'doc_count': 46, u'key': u'XkNQVTkCEzBrq7OlRHI11Q'},
    {u'doc_count': 46, u'key': u'YQvg0JCGRFUkb6reMMf3Iw'}],
   u'doc_count_error_upper_bound': 23,
   u'sum_other_doc_count': 4053}},
 u'hits': {u'hits': [{u'_id': u'csIoVmNtuhxrQvNd0zPi6g',
    u'_index': u'yelp',
    u'_score': 2.7316122,
    u'_source': {u'business_id': u'e_riFHMoJ1Yguvr0KtOkDQ',
     u'date': u'2012-10-27',
     u'fake_name': u'Very T

In [128]:
# exact match on field: https://www.elastic.co/guide/en/elasticsearch/guide/master/_finding_exact_values.html
# requires not indexed field for the match
QUERY = {
  "query": {
    "constant_score" : { 
        "filter" : {
            "term" : { 
                      "business_id" : "VVeogjZya58oiTxK7qUjAQ"
                    }
            }
        }
  }
}

In [129]:
localEs.search(index="yelp", doc_type="business", body=json.dumps(QUERY))

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'VVeogjZya58oiTxK7qUjAQ',
    u'_index': u'yelp',
    u'_score': 1.0,
    u'_source': {u'business_id': u'VVeogjZya58oiTxK7qUjAQ',
     u'fake_name': u'Pizzeria Bianco Useless',
     u'net_sentiment_median': 9.0,
     u'reviews': 486,
     u'sent_per_token_median': 0.034909109373413225,
     u'stars_mean': 3.9526748971193415,
     u'stars_median': 4.0,
     u'text_length_median': 279.5},
    u'_type': u'business'}],
  u'max_score': 1.0,
  u'total': 1},
 u'timed_out': False,
 u'took': 656}

## Now Move to The JS App

Now that you have the data indexed and searchable, we can build a small app to iterate on for the eventual ui you might want.  Use the "web" folder in my repo for that.

### Some Other Reference Materials:

* Tutorial slides from PyCon 2015, repo here (may be out of date!): https://github.com/erikrose/elasticsearch-tutorial
* Docs for the Python Lib: https://elasticsearch-py.readthedocs.org/en/master/index.html
