# MongoDB

MongoDB is a document database. It stores JSON objects.

- [Documentation](https://docs.mongodb.com)
- [Query selectors](https://docs.mongodb.com/manual/reference/operator/query/#query-selectors)

Note that MongoDB also provides a GUI interface via [MongDB Compasss](https://www.mongodb.com/products/compass) that might be useful when you are getting familiar with MongoDB. However, we will focus only on `pymongo`.

## Concepts

- What a document database is
- Why document databases
- Collections ~ tables
- Documents ~ rows
- Joins are possible but more common to embed nested objects
- [Basic data manipulation: CRUD](https://docs.mongodb.com/manual/crud/)
- Using `find`
- Simple summaries
- Using the `aggregate` method and setting up pipelines
- Geospatial queries
- Creating indexes to speed up queries

In [1]:
from pymongo import MongoClient, GEOSPHERE
from bson.objectid import ObjectId
from bson.son import SON

In [2]:
import requests
from bson import json_util

In [3]:
import collections
from pathlib import Path

In [4]:
import os

In [5]:
from pprint import pprint

## Set up

This connects to the MongoDB daemon

In [6]:
client = MongoClient()

This specifies the database. It does not matter if it does not exist.

In [7]:
client.drop_database('starwars')

In [8]:
db = client.starwars

This specifies a `collection`

In [9]:
people = db.people

Check what collections are in the database. Note that the `people` collection is only created when the first value is inserted.

In [10]:
db.list_collection_names()

[]

## Get Data

In [11]:
base_url = 'http://swapi.dev/api/'

In [12]:
resp = requests.get(os.path.join(base_url, 'people/1'))
data = resp.json()

In [13]:
data

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'http://swapi.dev/api/planets/1/',
 'films': ['http://swapi.dev/api/films/1/',
  'http://swapi.dev/api/films/2/',
  'http://swapi.dev/api/films/3/',
  'http://swapi.dev/api/films/6/'],
 'species': [],
 'vehicles': ['http://swapi.dev/api/vehicles/14/',
  'http://swapi.dev/api/vehicles/30/'],
 'starships': ['http://swapi.dev/api/starships/12/',
  'http://swapi.dev/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'http://swapi.dev/api/people/1/'}

We will fetch details of the homeworld and starships as a nested document.

In [14]:
def get_nested(d):
    d['homeworld']  = requests.get(d['homeworld']).json()
    urls = d['starships']
    starships = [requests.get(url).json() for url in urls]
    d['starships']  = starships
    return d

We need to convert numbers from strings returned by the REST API

In [15]:
def convert_str(x):
    try:
        return int(x)
    except:
        return x

def to_num(data):
    for key in data:
        val = data[key]
        if isinstance(val, str):
            data[key] = convert_str(val)
        elif isinstance(val, dict):
            for k, v in val.items():
                if isinstance(v, str):
                    val[k] = convert_str(v)
        elif isinstance(val, list):
            for i, item in enumerate(val):
                if isinstance(item, str):
                    data[key][i] = convert_str(item)
                elif isinstance(item, dict):
                    for k, v in item.items():
                        if isinstance(v, str):
                            data[key][i][k] = convert_str(v)      
    return data

In [16]:
data = to_num(get_nested(data))

In [17]:
data

{'name': 'Luke Skywalker',
 'height': 172,
 'mass': 77,
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': {'name': 'Tatooine',
  'rotation_period': 23,
  'orbital_period': 304,
  'diameter': 10465,
  'climate': 'arid',
  'gravity': '1 standard',
  'terrain': 'desert',
  'surface_water': 1,
  'population': 200000,
  'residents': ['http://swapi.dev/api/people/1/',
   'http://swapi.dev/api/people/2/',
   'http://swapi.dev/api/people/4/',
   'http://swapi.dev/api/people/6/',
   'http://swapi.dev/api/people/7/',
   'http://swapi.dev/api/people/8/',
   'http://swapi.dev/api/people/9/',
   'http://swapi.dev/api/people/11/',
   'http://swapi.dev/api/people/43/',
   'http://swapi.dev/api/people/62/'],
  'films': ['http://swapi.dev/api/films/1/',
   'http://swapi.dev/api/films/3/',
   'http://swapi.dev/api/films/4/',
   'http://swapi.dev/api/films/5/',
   'http://swapi.dev/api/films/6/'],
  'created': '2014-12-09T13:50:4

## Insertion

### Single inserts

In [18]:
result = people.insert_one(data)

In [19]:
db.list_collection_names()

['people']

### Bulk inserts

We load some previously retrieved values from file to avoid hitting the SWAPI server repeatedly.

In [20]:
import pickle

with open('sw.pickle', 'rb') as f:
    xs = pickle.load(f)

In [21]:
result = people.insert_many(xs)

In [22]:
result.inserted_ids

[ObjectId('5f5f8fa42324e433baa62a71'),
 ObjectId('5f5f8fa42324e433baa62a72'),
 ObjectId('5f5f8fa42324e433baa62a73'),
 ObjectId('5f5f8fa42324e433baa62a74'),
 ObjectId('5f5f8fa42324e433baa62a75'),
 ObjectId('5f5f8fa42324e433baa62a76'),
 ObjectId('5f5f8fa42324e433baa62a77'),
 ObjectId('5f5f8fa42324e433baa62a78'),
 ObjectId('5f5f8fa42324e433baa62a79')]

## Queries

In [23]:
people.find_one(
    # search criteria
    {'name': 'Luke Skywalker'}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
)

{'_id': ObjectId('5f5f8fa42324e433baa62a70'),
 'name': 'Luke Skywalker',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue'}

In [24]:
for p in people.find(
    # search criteria
    {}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
):
    print(p)

{'_id': ObjectId('5f5f8fa42324e433baa62a70'), 'name': 'Luke Skywalker', 'hair_color': 'blond', 'skin_color': 'fair', 'eye_color': 'blue'}
{'_id': ObjectId('5f5f8fa42324e433baa62a71'), 'name': 'C-3PO', 'hair_color': 'n/a', 'skin_color': 'gold', 'eye_color': 'yellow'}
{'_id': ObjectId('5f5f8fa42324e433baa62a72'), 'name': 'R2-D2', 'hair_color': 'n/a', 'skin_color': 'white, blue', 'eye_color': 'red'}
{'_id': ObjectId('5f5f8fa42324e433baa62a73'), 'name': 'Darth Vader', 'hair_color': 'none', 'skin_color': 'white', 'eye_color': 'yellow'}
{'_id': ObjectId('5f5f8fa42324e433baa62a74'), 'name': 'Leia Organa', 'hair_color': 'brown', 'skin_color': 'light', 'eye_color': 'brown'}
{'_id': ObjectId('5f5f8fa42324e433baa62a75'), 'name': 'Owen Lars', 'hair_color': 'brown, grey', 'skin_color': 'light', 'eye_color': 'blue'}
{'_id': ObjectId('5f5f8fa42324e433baa62a76'), 'name': 'Beru Whitesun lars', 'hair_color': 'brown', 'skin_color': 'light', 'eye_color': 'blue'}
{'_id': ObjectId('5f5f8fa42324e433baa62a77'

### Using object ID

Note that ObjectID is NOT a string. You must convert a string to ObjectID before use.

From the official docs, the ObjectID consists of

- a 4-byte value representing the seconds since the Unix epoch,
- a 5-byte random value, and
- a 3-byte counter, starting with a random value.

In particular, note that sorting by ObjectIDs generated across different machines will give an approximate time ordering.

In [25]:
result.inserted_ids[0]

ObjectId('5f5f8fa42324e433baa62a71')

In [26]:
people.find_one(
    result.inserted_ids[0],
    {'name': True, 'hair_color': True, 'skin_color': True, 'eye_color': True}
)

{'_id': ObjectId('5f5f8fa42324e433baa62a71'),
 'name': 'C-3PO',
 'hair_color': 'n/a',
 'skin_color': 'gold',
 'eye_color': 'yellow'}

### Bulk queries

The general `find` method returns a cursor, where each entry is a dictionary.

In [27]:
for person in people.find(
    {'gender': 'male'}
):
    print(person['name'])

Luke Skywalker
Darth Vader
Owen Lars
Biggs Darklighter
Obi-Wan Kenobi


You can also explicitly define the projection.

In [28]:
for x in people.find(
    {'gender': 'male'},             
    {
        '_id': False,
        'name': True,
        'gender': True
    }
): 
    pprint(x)

{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'male', 'name': 'Darth Vader'}
{'gender': 'male', 'name': 'Owen Lars'}
{'gender': 'male', 'name': 'Biggs Darklighter'}
{'gender': 'male', 'name': 'Obi-Wan Kenobi'}


#### Using regex search

In [29]:
for x in people.find(
    {
        'name': {'$regex': '^L'},
    },
    {
        'name': True, 
        'gender': True, 
        '_id': False
    }
):
    pprint(x)

{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'female', 'name': 'Leia Organa'}


The above example uses the JavaScript regular expression syntax. You can also use Python regular expressions with `ppymongo`.

In [30]:
import re

name_pat = re.compile(r'^l', re.IGNORECASE)

In [31]:
for x in people.find(
    {
        'name': name_pat,
    },
    {
        'name': True,
        'gender': True,
        '_id': False
    }
):
    pprint(x)

{'gender': 'male', 'name': 'Luke Skywalker'}
{'gender': 'female', 'name': 'Leia Organa'}


#### Using relational operators

In [32]:
for x in people.find(
    {
        'mass': {'$lt': 100},
    },
    {
        'name': True, 
        'mass': True, 
        '_id': False
    }
):
    pprint(x)

{'mass': 77, 'name': 'Luke Skywalker'}
{'mass': 75, 'name': 'C-3PO'}
{'mass': 32, 'name': 'R2-D2'}
{'mass': 49, 'name': 'Leia Organa'}
{'mass': 75, 'name': 'Beru Whitesun lars'}
{'mass': 32, 'name': 'R5-D4'}
{'mass': 84, 'name': 'Biggs Darklighter'}
{'mass': 77, 'name': 'Obi-Wan Kenobi'}


In [33]:
mass_range = {'$lt': 100, '$gt': 50}

In [34]:
for x in people.find(
    {
        'mass': mass_range,
    },
    {
        'name': True, 
        'mass': True,
        '_id': False
    }
):
    pprint(x)

{'mass': 77, 'name': 'Luke Skywalker'}
{'mass': 75, 'name': 'C-3PO'}
{'mass': 75, 'name': 'Beru Whitesun lars'}
{'mass': 84, 'name': 'Biggs Darklighter'}
{'mass': 77, 'name': 'Obi-Wan Kenobi'}


#### Nested search

Nowadays, many relational databases allow you to store data as JSON columns.  However, document databases allow the convenience of nested searches.

In [35]:
for x in people.find(
    {
        'homeworld.name': 'Tatooine',
    },
    {
        'name': True, 
        'species.name': True, 
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker', 'species': []}
{'name': 'C-3PO', 'species': []}
{'name': 'Darth Vader', 'species': []}
{'name': 'Owen Lars', 'species': []}
{'name': 'Beru Whitesun lars', 'species': []}
{'name': 'R5-D4', 'species': []}
{'name': 'Biggs Darklighter', 'species': []}


#### Matching multiple criteria

This is quite subtle. By default, when matching on multiple criteria, the search is across items. Here `Obi-Wan Kenobi` is returned because each of the 3 conditions is matched by one or more of his starships, even though none of his starships match all 3 criteria.

In [36]:
for x in people.find(
    {'name': 'Obi-Wan Kenobi'},
    {
        'starships': True,
        '_id': False
    }
):
    pprint(x)

{'starships': [{'MGLT': 'unknown',
                'cargo_capacity': 60,
                'consumables': '7 days',
                'cost_in_credits': 180000,
                'created': '2014-12-20T17:35:23.906000Z',
                'crew': 1,
                'edited': '2014-12-20T21:23:49.930000Z',
                'films': ['http://swapi.dev/api/films/5/',
                          'http://swapi.dev/api/films/6/'],
                'hyperdrive_rating': '1.0',
                'length': 8,
                'manufacturer': 'Kuat Systems Engineering',
                'max_atmosphering_speed': 1150,
                'model': 'Delta-7 Aethersprite-class interceptor',
                'name': 'Jedi starfighter',
                'passengers': 0,
                'pilots': ['http://swapi.dev/api/people/10/',
                           'http://swapi.dev/api/people/58/'],
                'starship_class': 'Starfighter',
                'url': 'http://swapi.dev/api/starships/48/'},
               {'MGLT

In [37]:
for x in people.find(
    {
        'starships.cost_in_credits': {'$lt': 250000},
        'starships.max_atmosphering_speed': {'$gt': 500},
        'starships.passengers': {'$gt': 0}
    },
    {
        'name': True, 
        'starship.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits': 149999,
                'max_atmosphering_speed': 1050,
                'passengers': 0},
               {'cost_in_credits': 240000,
                'max_atmosphering_speed': 850,
                'passengers': 20}]}
{'name': 'Obi-Wan Kenobi',
 'starships': [{'cost_in_credits': 180000,
                'max_atmosphering_speed': 1150,
                'passengers': 0},
               {'cost_in_credits': 125000000,
                'max_atmosphering_speed': 1050,
                'passengers': 48247},
               {'cost_in_credits': 'unknown',
                'max_atmosphering_speed': 1050,
                'passengers': 3},
               {'cost_in_credits': 320000,
                'max_atmosphering_speed': 1500,
                'passengers': 0},
               {'cost_in_credits': 168000,
                'max_atmosphering_speed': 1100,
                'passengers': 0}]}


#### Matching multiple criteria simultaneously

To find someone with a starship that matches all 3 conditions, we need to use the `elemMatch` operator.

In [38]:
for x in people.find(
    {
        'starships': {
            '$elemMatch': { 
                'cost_in_credits': {'$lt': 250000},
                'max_atmosphering_speed': {'$gt': 500},
                'passengers': {'$gt': 1}
            }
        }
    },
    {
        'name': True, 
        'starship.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

{'name': 'Luke Skywalker',
 'starships': [{'cost_in_credits': 149999,
                'max_atmosphering_speed': 1050,
                'passengers': 0},
               {'cost_in_credits': 240000,
                'max_atmosphering_speed': 850,
                'passengers': 20}]}


## Aggregate Queries

In [39]:
people.count_documents({'homeworld.name': 'Tatooine'})

7

In [40]:
people.distinct('homeworld.name')

['Alderaan', 'Naboo', 'Stewjon', 'Tatooine']

### Using aggregate

The `aggregate` function runs a pipeline of commands, and uses the `$group` operator to summarize results. Within the aggregate method, you assemble a **pipeline** of operations that is executed atomically.

Filter and count

In [41]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'count': {'$sum': 1}}},
]

In [42]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Tatooine', 'count': 7}


Filter and find total mass

In [43]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [44]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Tatooine', 'total_mass': 599}


Total mass of all members of a planet

In [45]:
cmds = [
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [46]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Tatooine', 'total_mass': 599}
{'_id': 'Naboo', 'total_mass': 32}
{'_id': 'Stewjon', 'total_mass': 77}
{'_id': 'Alderaan', 'total_mass': 49}


Filter, project, group by, sorting.

In [47]:
cmds = [
     {
         '$match': {
             'mass': {
                 '$lt': 100
                     }
         },
     },
     {
         '$group': {
             '_id': '$homeworld.name',
             'total_mass': {'$sum': '$mass'},
             'avg_mass': {'$avg': '$mass'}
         },
     },
     {
        '$sort': { 
            'avg_mass': -1
        }
     }
]

In [48]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'Stewjon', 'avg_mass': 77.0, 'total_mass': 77}
{'_id': 'Tatooine', 'avg_mass': 68.6, 'total_mass': 343}
{'_id': 'Alderaan', 'avg_mass': 49.0, 'total_mass': 49}
{'_id': 'Naboo', 'avg_mass': 32.0, 'total_mass': 32}


#### SQL equivalent (approximate)

```sql
SELECT species.name, AVG(mass) AS avg_mass, SUM(mass) AS total_mass
WHERE mass < 100
FROM people
JOIN species
ON people.species_id = species.species_id
GROUP BY species.name
ORDER BY avg_mass
```

### Using MapReduce

With `MapReduce` you get the full power of JavaScript, but it is more complex and often less efficient. You should use `aggregate` in preference to `map_reduce` in most cases.

- In the map stage, you create a (key, value) pair
- In the reduce stage, you perform a reduction (e.g. sum) of the values associated with each key

In [49]:
from bson.code import Code

Count the number by eye_color.

In [50]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    var total = 0;
    for (var i = 0; i < values.length; i++) {
        total += values[i];
    }
    return total;
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result1'
)

In [51]:
for doc in result.find():
    pprint(doc)

{'_id': 'blue', 'value': 3.0}
{'_id': 'blue-gray', 'value': 1.0}
{'_id': 'brown', 'value': 2.0}
{'_id': 'red', 'value': 2.0}
{'_id': 'yellow', 'value': 2.0}


The output is also stored in the `result1` collection we specified.

In [52]:
list(db.result1.find())

[{'_id': 'blue', 'value': 3.0},
 {'_id': 'blue-gray', 'value': 1.0},
 {'_id': 'brown', 'value': 2.0},
 {'_id': 'red', 'value': 2.0},
 {'_id': 'yellow', 'value': 2.0}]

Using JavaScript Array functions to simplify code.

In [53]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result2'
)

In [54]:
for doc in result.find():
    pprint(doc)

{'_id': 'blue', 'value': 3.0}
{'_id': 'blue-gray', 'value': 1.0}
{'_id': 'brown', 'value': 2.0}
{'_id': 'red', 'value': 2.0}
{'_id': 'yellow', 'value': 2.0}


Find avergae mass by gender.

In [55]:
mapper = Code('''
function() {
    emit(this.gender, this.mass);
}
''')

reducer = Code('''
function (key, values) {
    return Array.avg(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [56]:
for doc in result.find():
    pprint(doc)

{'_id': 'female', 'value': 62.0}
{'_id': 'male', 'value': 98.8}
{'_id': 'n/a', 'value': 46.333333333333336}


Count number of members in each species

In [57]:
mapper = Code('''
function() {
    this.species.map(function(z) {
      emit(z.name, 1);
    })
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [58]:
for doc in result.find():
    pprint(doc)

{'_id': None, 'value': 3.0}


#### Using the `aggregate` method

See if you can convert the above MapReduce queries to `aggregate` method calls. An example is provided.

In [59]:
cmds = [
    {
         '$group': {
             '_id': '$eye_color',
             'count': {'$sum': 1},
         },
     },
     {
        '$sort': { 
            '_id': 1
        }
     }
]

In [60]:
for p in people.aggregate(cmds):
    pprint(p)

{'_id': 'blue', 'count': 3}
{'_id': 'blue-gray', 'count': 1}
{'_id': 'brown', 'count': 2}
{'_id': 'red', 'count': 2}
{'_id': 'yellow', 'count': 2}


## Geospatial queries

You specify queries using [GeoJSON Objects](https://docs.mongodb.com/manual/reference/geojson/)

- Point
- LineString
- Polygon
- MultiPoint
- MultiLineString
- MultiPolygon
- GeometryCollection

In [61]:
crime = db.crime

In [62]:
import json

In [63]:
path = 'data/crime-mapping.geojson'

with open(path) as f:
    datastore = json.load(f)

In [64]:
results = crime.insert_many(datastore['features'])

In [65]:
crime.find_one({})

{'_id': ObjectId('5f5f8fa52324e433baa62a7a'),
 'geometry': {'type': 'Point', 'coordinates': [-78.78200313, 35.760212065]},
 'type': 'Feature',
 'properties': {'ucr': '2650',
  'domestic': 'N',
  'period': ['Everything', 'Last Year'],
  'street': 'KILDAIRE FARM RD',
  'radio': 'Everything,Last Year',
  'time_to': -62135553600,
  'crime_type': 'ALL OTHER - ESCAPE FROM CUSTODY OR RESIST ARREST',
  'district': 'D3',
  'phxrecordstatus': None,
  'lon': -78.78200313,
  'timeframe': ['Last Year'],
  'crimeday': 'THURSDAY',
  'phxstatus': None,
  'location_category': 'TOWN OWNED',
  'violentproperty': 'All Other',
  'residential_subdivision': 'SHOPPES OF KILDAIRE',
  'offensecategory': 'All Other Offenses',
  'chrgcnt': None,
  'time_from': -62135553600,
  'map_reference': 'P027',
  'date_to': '11/30/2017',
  'lat': 35.760212065,
  'phxcommunity': 'No',
  'crime_category': 'ALL OTHER',
  'activity_date': None,
  'beat_number': '112',
  'record': 3145,
  'incident_number': '17010528',
  'apartm

In [66]:
crime.find_one({},
              {
                  'geometry': 1,
                  '_id': 0,
              }
              )

{'geometry': {'type': 'Point', 'coordinates': [-78.78200313, 35.760212065]}}

In [67]:
crime.create_index([('geometry', GEOSPHERE)])

'geometry_2dsphere'

List 5 crimes near the location

In [68]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$near', {'$geometry' : loc})])
    },
    {
        '_id': 0,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
).limit(5):
    pprint(doc)

{'properties': {'crime_type': 'ALL OTHER - ESCAPE FROM CUSTODY OR RESIST '
                              'ARREST',
                'date_from': '2017-11-30'}}
{'properties': {'crime_type': 'LARCENY - AUTO PARTS OR ACCESSORIES',
                'date_from': '2018-03-20'}}
{'properties': {'crime_type': 'COUNTERFEITING - USING',
                'date_from': '2018-08-05'}}
{'properties': {'crime_type': 'DRUGS - DRUG VIOLATIONS '
                              '(POSS./SELL/MAN./DEL./TRNSPRT/CULT.)',
                'date_from': '2017-11-30'}}
{'properties': {'crime_type': 'VANDALISM - DAMAGE TO PROPERTY',
                'date_from': '2018-03-26'}}


List crimes committed nearby (within 200 m)

In [69]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$geoNear', {'$geometry' : loc, '$minDistance': 1e-6, '$maxDistance': 200})]),
    },
    {
        '_id': 0,
        'geometry.coordinates': 1,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
):
    pprint(doc)

{'geometry': {'coordinates': [-78.78102423, 35.7607323]},
 'properties': {'crime_type': 'ASSAULT - SIMPLE - ALL OTHER',
                'date_from': '2018-02-14'}}
{'geometry': {'coordinates': [-78.78131931, 35.761138061]},
 'properties': {'crime_type': 'VANDALISM - GRAFFITI',
                'date_from': '2018-07-20'}}
{'geometry': {'coordinates': [-78.7827814, 35.759087052]},
 'properties': {'crime_type': 'VANDALISM - GRAFFITI',
                'date_from': '2018-07-29'}}


## Indexes

Just as with relational databases, you can add indexes to speed up search. Note that while reads become faster, writes become slower. There is always a trade-off.

In [70]:
people.find({}).explain

<bound method Cursor.explain of <pymongo.cursor.Cursor object at 0x113936370>>

In [71]:
people.find({'name': 'Luke Skywalker'}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'starwars.people',
  'indexFilterSet': False,
  'parsedQuery': {'name': {'$eq': 'Luke Skywalker'}},
  'winningPlan': {'stage': 'COLLSCAN',
   'filter': {'name': {'$eq': 'Luke Skywalker'}},
   'direction': 'forward'},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 1,
  'executionTimeMillis': 0,
  'totalKeysExamined': 0,
  'totalDocsExamined': 10,
  'executionStages': {'stage': 'COLLSCAN',
   'filter': {'name': {'$eq': 'Luke Skywalker'}},
   'nReturned': 1,
   'executionTimeMillisEstimate': 0,
   'works': 12,
   'advanced': 1,
   'needTime': 10,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'direction': 'forward',
   'docsExamined': 10},
  'allPlansExecution': []},
 'serverInfo': {'host': 'Cliburns-MacBook-Pro.local',
  'port': 27017,
  'version': '4.2.3',
  'gitVersion': '6874650b362138df74be53d366bbefc321ea32d4'},
 'ok': 1.0}

In [72]:
people.create_index('name')

'name_1'

In [73]:
people.find({'name': 'Luke Skywalker'}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'starwars.people',
  'indexFilterSet': False,
  'parsedQuery': {'name': {'$eq': 'Luke Skywalker'}},
  'winningPlan': {'stage': 'FETCH',
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'name': 1},
    'indexName': 'name_1',
    'isMultiKey': False,
    'multiKeyPaths': {'name': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'name': ['["Luke Skywalker", "Luke Skywalker"]']}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 1,
  'executionTimeMillis': 1,
  'totalKeysExamined': 1,
  'totalDocsExamined': 1,
  'executionStages': {'stage': 'FETCH',
   'nReturned': 1,
   'executionTimeMillisEstimate': 0,
   'works': 2,
   'advanced': 1,
   'needTime': 0,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'docsExamined': 1,
   'alreadyHasObj': 0,
   'inputStage': {'stage':