# MongoDB

MongoDB is a document database. It stores JSON objects.

- [Documentation](https://docs.mongodb.com)
- [Query selectors](https://docs.mongodb.com/manual/reference/operator/query/#query-selectors)

Note that MongoDB also provides a GUI interface via [MongDB Compasss](https://www.mongodb.com/products/compass) that might be useful when you are getting familiar with MongoDB. However, we will focus only on `pymongo`.

## Concepts

- What a document database is
- Why document databases
- Collections ~ tables
- Documents ~ rows
- Joins are possible but more common to embed nested objects
- [Basic data manipulation: CRUD](https://docs.mongodb.com/manual/crud/)
- Using `find`
- Simple summaries
- Using the `aggregate` method and setting up pipelines
- Geospatial queries
- Creating indexes to speed up queries

In [1]:
from pymongo import MongoClient, GEOSPHERE
from bson.objectid import ObjectId
from bson.son import SON

In [2]:
import requests
from bson import json_util

In [3]:
import collections
from pathlib import Path

In [4]:
import os

In [5]:
from pprint import pprint

## Set up

This connects to the MongoDB daemon

In [6]:
client = MongoClient('mongodb:27017')

This specifies the database. It does not matter if it does not exist.

In [7]:
client.drop_database('starwars')

ServerSelectionTimeoutError: mongodb:27017: [Errno 8] nodename nor servname provided, or not known, Timeout: 30s, Topology Description: <TopologyDescription id: 5f908d0937cbfb5282ec2cb7, topology_type: Single, servers: [<ServerDescription ('mongodb', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb:27017: [Errno 8] nodename nor servname provided, or not known')>]>

In [None]:
db = client.starwars

This specifies a `collection`

In [None]:
people = db.people

Check what collections are in the database. Note that the `people` collection is only created when the first value is inserted.

In [None]:
db.list_collection_names()

## Get Data

In [None]:
base_url = 'http://swapi.dev/api/'

In [None]:
resp = requests.get(os.path.join(base_url, 'people/1'))
data = resp.json()

In [None]:
data

We will fetch details of the homeworld and starships as a nested document.

In [None]:
def get_nested(d):
    d['homeworld']  = requests.get(d['homeworld']).json()
    urls = d['starships']
    starships = [requests.get(url).json() for url in urls]
    d['starships']  = starships
    return d

We need to convert numbers from strings returned by the REST API

In [None]:
def convert_str(x):
    try:
        return int(x)
    except:
        return x

def to_num(data):
    for key in data:
        val = data[key]
        if isinstance(val, str):
            data[key] = convert_str(val)
        elif isinstance(val, dict):
            for k, v in val.items():
                if isinstance(v, str):
                    val[k] = convert_str(v)
        elif isinstance(val, list):
            for i, item in enumerate(val):
                if isinstance(item, str):
                    data[key][i] = convert_str(item)
                elif isinstance(item, dict):
                    for k, v in item.items():
                        if isinstance(v, str):
                            data[key][i][k] = convert_str(v)      
    return data

In [None]:
data = to_num(get_nested(data))

In [None]:
data

## Insertion

### Single inserts

In [None]:
result = people.insert_one(data)

In [None]:
db.list_collection_names()

### Bulk inserts

We load some previously retrieved values from file to avoid hitting the SWAPI server repeatedly.

In [None]:
import pickle

with open('sw.pickle', 'rb') as f:
    xs = pickle.load(f)

In [None]:
result = people.insert_many(xs)

In [None]:
result.inserted_ids

## Queries

In [None]:
people.find_one(
    # search criteria
    {'name': 'Luke Skywalker'}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
)

In [None]:
for p in people.find(
    # search criteria
    {}, 
    # values to return
    {'name': True, 
     'hair_color': True,
     'skin_color': True, 
     'eye_color': True
    } 
):
    print(p)

### Using object ID

Note that ObjectID is NOT a string. You must convert a string to ObjectID before use.

From the official docs, the ObjectID consists of

- a 4-byte value representing the seconds since the Unix epoch,
- a 5-byte random value, and
- a 3-byte counter, starting with a random value.

In particular, note that sorting by ObjectIDs generated across different machines will give an approximate time ordering.

In [None]:
result.inserted_ids[0]

In [None]:
people.find_one(
    result.inserted_ids[0],
    {'name': True, 'hair_color': True, 'skin_color': True, 'eye_color': True}
)

### Bulk queries

The general `find` method returns a cursor, where each entry is a dictionary.

In [None]:
for person in people.find(
    {'gender': 'male'}
):
    print(person['name'])

You can also explicitly define the projection.

In [None]:
for x in people.find(
    {'gender': 'male'},             
    {
        '_id': False,
        'name': True,
        'gender': True
    }
): 
    pprint(x)

#### Using regex search

In [None]:
for x in people.find(
    {
        'name': {'$regex': '^L'},
    },
    {
        'name': True, 
        'gender': True, 
        '_id': False
    }
):
    pprint(x)

The above example uses the JavaScript regular expression syntax. You can also use Python regular expressions with `ppymongo`.

In [None]:
import re

name_pat = re.compile(r'^l', re.IGNORECASE)

In [None]:
for x in people.find(
    {
        'name': name_pat,
    },
    {
        'name': True,
        'gender': True,
        '_id': False
    }
):
    pprint(x)

#### Using relational operators

In [None]:
for x in people.find(
    {
        'mass': {'$lt': 100},
    },
    {
        'name': True, 
        'mass': True, 
        '_id': False
    }
):
    pprint(x)

In [None]:
mass_range = {'$lt': 100, '$gt': 50}

In [None]:
for x in people.find(
    {
        'mass': mass_range,
    },
    {
        'name': True, 
        'mass': True,
        '_id': False
    }
):
    pprint(x)

#### Nested search

Nowadays, many relational databases allow you to store data as JSON columns.  However, document databases allow the convenience of nested searches.

In [None]:
for x in people.find(
    {
        'homeworld.name': 'Tatooine',
    },
    {
        'name': True, 
        'species.name': True, 
        '_id': False
    }
):
    pprint(x)

#### Matching multiple criteria

This is quite subtle. By default, when matching on multiple criteria, the search is across items. Here `Obi-Wan Kenobi` is returned because each of the 3 conditions is matched by one or more of his starships, even though none of his starships match all 3 criteria.

In [None]:
for x in people.find(
    {
        'starships.cost_in_credits': {'$lt': 250000},
        'starships.max_atmosphering_speed': {'$gt': 500},
        'starships.passengers': {'$gt': 0}
    },
    {
        'name': True, 
        'starship.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

In [None]:
for x in people.find(
    {'name': 'Obi-Wan Kenobi'},
    {
        'starships.name': True,
        'starships.cost_in_credits': True,
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        '_id': False
    }
):
    pprint(x)

#### Matching multiple criteria simultaneously

To find someone with a starship that matches all 3 conditions, we need to use the `elemMatch` operator.

In [None]:
for x in people.find(
    {
        'starships': {
            '$elemMatch': { 
                'cost_in_credits': {'$lt': 250000},
                'max_atmosphering_speed': {'$gt': 500},
                'passengers': {'$gt': 1}
            }
        }
    },
    {
        'name': True, 
        'starships.name': True, 
        'starships.max_atmosphering_speed': True,
        'starships.passengers': True,
        'starships.cost_in_credits': True,     
        '_id': False
    }
):
    pprint(x)

## Aggregate Queries

In [None]:
people.count_documents({'homeworld.name': 'Tatooine'})

In [None]:
people.distinct('homeworld.name')

### Using aggregate

The `aggregate` function runs a pipeline of commands, and uses the `$group` operator to summarize results. Within the aggregate method, you assemble a **pipeline** of operations that is executed atomically.

Filter and count

In [None]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'count': {'$sum': 1}}},
]

In [None]:
for p in people.aggregate(cmds):
    pprint(p)

Filter and find total mass

In [None]:
cmds = [
     {'$match': {'homeworld.name': 'Tatooine'}},
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [None]:
for p in people.aggregate(cmds):
    pprint(p)

Total mass of all members of a planet

In [None]:
cmds = [
     {'$group': {'_id': '$homeworld.name', 
                 'total_mass': {'$sum': '$mass'}}},
]

In [None]:
for p in people.aggregate(cmds):
    pprint(p)

Filter, project, group by, sorting.

In [None]:
cmds = [
     {
         '$match': {
             'mass': {
                 '$lt': 100
                     }
         },
     },
     {
         '$group': {
             '_id': '$homeworld.name',
             'total_mass': {'$sum': '$mass'},
             'avg_mass': {'$avg': '$mass'}
         },
     },
     {
        '$sort': { 
            'avg_mass': -1
        }
     }
]

In [None]:
for p in people.aggregate(cmds):
    pprint(p)

#### SQL equivalent (approximate)

```sql
SELECT species.name, AVG(mass) AS avg_mass, SUM(mass) AS total_mass
WHERE mass < 100
FROM people
JOIN species
ON people.species_id = species.species_id
GROUP BY species.name
ORDER BY avg_mass
```

### Using MapReduce

With `MapReduce` you get the full power of JavaScript, but it is more complex and often less efficient. You should use `aggregate` in preference to `map_reduce` in most cases.

- In the map stage, you create a (key, value) pair
- In the reduce stage, you perform a reduction (e.g. sum) of the values associated with each key

#### Motivating Python example

In [None]:
from functools import reduce

In [None]:
eye_color = ['blue', 'blue', 'green', 'brown', 'grey', 'green', 'blue']

In [None]:
res = [(x, 1) for x in eye_color]
res

In [None]:
d = {}
for k, v in res:
    d[k] = d.get(k, 0) + v
d

#### Map-reduce example in Mongo

In [None]:
from bson.code import Code

Count the number by eye_color

In [None]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    var total = 0;
    for (var i = 0; i < values.length; i++) {
        total += values[i];
    }
    return total;
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result1'
)

In [None]:
for doc in result.find():
    pprint(doc)

The output is also stored in the `result1` collection we specified.

In [None]:
list(db.result1.find())

Using JavaScript Array functions to simplify code.

In [None]:
mapper = Code('''
function() {
    emit(this.eye_color, 1);
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result2'
)

In [None]:
for doc in result.find():
    pprint(doc)

Find avergae mass by gender.

In [None]:
mapper = Code('''
function() {
    emit(this.gender, this.mass);
}
''')

reducer = Code('''
function (key, values) {
    return Array.avg(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [None]:
for doc in result.find():
    pprint(doc)

Count number of members in each species

In [None]:
mapper = Code('''
function() {
    this.species.map(function(z) {
      emit(z.name, 1);
    })
}
''')

reducer = Code('''
function (key, values) {
    return Array.sum(values);
}
''')

result = people.map_reduce(
    mapper, 
    reducer, 
    'result3'
)

In [None]:
for doc in result.find():
    pprint(doc)

#### Using the `aggregate` method

See if you can convert the above MapReduce queries to `aggregate` method calls. An example is provided.

In [None]:
cmds = [
    {
         '$group': {
             '_id': '$eye_color',
             'count': {'$sum': 1},
         },
     },
     {
        '$sort': { 
            '_id': 1
        }
     }
]

In [None]:
for p in people.aggregate(cmds):
    pprint(p)

## Geospatial queries

You specify queries using [GeoJSON Objects](https://docs.mongodb.com/manual/reference/geojson/)

- Point
- LineString
- Polygon
- MultiPoint
- MultiLineString
- MultiPolygon
- GeometryCollection

In [None]:
crime = db.crime

In [None]:
import json

In [None]:
path = 'data/crime-mapping.geojson'

with open(path) as f:
    datastore = json.load(f)

In [None]:
results = crime.insert_many(datastore['features'])

In [None]:
crime.find_one({})

In [None]:
crime.find_one({},
              {
                  'geometry': 1,
                  '_id': 0,
              }
              )

In [None]:
crime.create_index([('geometry', GEOSPHERE)])

List 5 crimes near the location

In [None]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$near', {'$geometry' : loc})])
    },
    {
        '_id': 0,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
).limit(5):
    pprint(doc)

List crimes committed nearby (within 200 m)

In [None]:
loc = SON([('type', 'Point'), ('coordinates', [-78.78200313, 35.760212065])])

for doc in crime.find(
    {
        'geometry' : SON([('$geoNear', {'$geometry' : loc, '$minDistance': 1e-6, '$maxDistance': 200})]),
    },
    {
        '_id': 0,
        'geometry.coordinates': 1,
        'properties.crime_type': 1,
        'properties.date_from': 1
    }
):
    pprint(doc)

## Indexes

Just as with relational databases, you can add indexes to speed up search. Note that while reads become faster, writes become slower. There is always a trade-off.

In [None]:
people.find({}).explain

In [None]:
people.find({'name': 'Luke Skywalker'}).explain()

In [None]:
people.create_index('name')

In [None]:
people.find({'name': 'Luke Skywalker'}).explain()