# Indexing in MongoDB:

## Single-field indexes:

Let's evaluate the two situations -where we used index and the one in which we don't- performance wise.

Let's query the restaurant with: 41156888 as id.

In [15]:
import pymongo as pg #Among other things handles conncection between the db and python environement 
from pprint import pprint #pretty print
import datetime #Needed to format the datetime input later on.
import pandas as pd

In [3]:
client=pg.MongoClient()
db=client.test

In [6]:
db.restaurants.find_one()

{'_id': ObjectId('5cccb9c3d398404a2c2a03ea'),
 'address': {'building': '351',
  'coord': [-73.98513559999999, 40.7676919],
  'street': 'West   57 Street',
  'zipcode': '10019'},
 'borough': 'Manhattan',
 'cuisine': 'Irish',
 'grades': [{'date': datetime.datetime(2014, 9, 6, 0, 0),
   'grade': 'A',
   'score': 2},
  {'date': datetime.datetime(2013, 7, 22, 0, 0), 'grade': 'A', 'score': 11},
  {'date': datetime.datetime(2012, 7, 31, 0, 0), 'grade': 'A', 'score': 12},
  {'date': datetime.datetime(2011, 12, 29, 0, 0), 'grade': 'A', 'score': 12}],
 'name': 'Dj Reynolds Pub And Restaurant',
 'restaurant_id': '30191841'}

In [31]:
print("Straight search")
res=db.restaurants.find({'restaurant_id':'41156888'}).explain()
pd.DataFrame({'Examined documents':[res['executionStats']['totalDocsExamined']],'Time to execute(ms)':[res['executionStats']['executionTimeMillis']],'Num of results':[res['executionStats']['nReturned']]})

Straight search


Unnamed: 0,Examined documents,Num of results,Time to execute(ms)
0,25359,1,30


We examined all the files! (British museum search).

In [39]:
print('Using limit(1)')
limres=db.restaurants.find({'restaurant_id':'41156888'}).limit(1).explain()
pd.DataFrame({'Examined documents':[limres['executionStats']['totalDocsExamined']],'Time to execute(ms)':[limres['executionStats']['executionTimeMillis']],'Num of results':[limres['executionStats']['nReturned']]})

Using limit(1)


Unnamed: 0,Examined documents,Num of results,Time to execute(ms)
0,6027,1,8


There is a considerable decrease for both the number of examined files and the execution time. Although, this is not a very attractive way for tackling this problem of query performance in general, and that's because...well for starters, it's not always the case that we are looking only for one of the documents fulfilling the given condition. In this case since the restaurant id is unique, it's quite convenient, we get the same result as a full search with a lower cost...In fact, the research starts and once it founds a result matching, it stops so we get that sense of improved performance while it's only due to finding the wanted result sooner rather than later (It could have found it much later and the performance wouldn't be this good). More technically speaking, limit() determinates how many time we are going to iterate over the cursor (that contains the result of the query) in order to generate the returned value. If we are going to iterate only once, once a result is obtained continuing has no meaning. So it's improved performance but it's 'fake' improvement, in other words there is no heuristic (when it comes to limit()) that optimizes the search within the documents. General queries don't always get this convenient. 

Indexes are a great way to optimize queries like this because they organize data by a given field to let MongoDB find it

quickly. We will try to create an index on the retaurant_id field:

### Index creation:

In [54]:
db.restaurants.create_index('restaurant_id') #Create index on the restaurant_id field.

'restaurant_id_1'

In [59]:
pprint(db.restaurants.index_information())

{'_id_': {'key': [('_id', 1)], 'ns': 'test.restaurants', 'v': 2},
 'restaurant_id_1': {'key': [('restaurant_id', 1)],
                     'ns': 'test.restaurants',
                     'v': 2}}


In [61]:
print("Indexed search:")
Indres=db.restaurants.find({'restaurant_id':'41156888'}).explain()
pd.DataFrame({'Examined documents':[res['executionStats']['totalDocsExamined']],'Time to execute(ms)':[res['executionStats']['executionTimeMillis']],'Num of results':[res['executionStats']['nReturned']]})

Indexed search:


Unnamed: 0,Examined documents,Num of results,Time to execute(ms)
0,760,14,1


Best performance so far, thing to be expected. Although this gives great read peerformance although it hurts the insertion performances when we have a lot of them, since they too need update at each transaction. So the task of choosing fields to index is really important keeping in mind this trade-off.

## Compound indexes: