![image.png](attachment:image.png)

# Data Science and AI
## Lab 2.1.4: Python with MongoDB
INSTRUCTIONS:
- Run the cells
- Observe and understand the results
- Answer the questions

## Introduction to PyMongo

In [2]:
# from IPython.core.display import display, HTML
!pip install pymongo
!pip install folium
import os

import pymongo
from pymongo import MongoClient

import numpy as np
import pandas as pd

import folium

Collecting pymongo
[?25l  Downloading https://files.pythonhosted.org/packages/a3/8c/ec46f4aa95515989711a7893e64c30f9d33c58eaccc01f8f37c4513739a2/pymongo-3.9.0-cp37-cp37m-macosx_10_6_intel.whl (378kB)
[K     |████████████████████████████████| 378kB 1.1MB/s eta 0:00:01
[?25hInstalling collected packages: pymongo
Successfully installed pymongo-3.9.0


In [3]:
print('PyMongo version: %s' % pymongo.__version__)
print('Folium version : %s' % folium.__version__)

PyMongo version: 3.9.0
Folium version : 0.10.0


## Start the MongoDB server
Start the `mongod` process to start the server

Type at the command prompt

    $ ./mongod --dbpath <path-to-db-directory>

In [4]:
server = 'localhost'
port = 27017
client = MongoClient(server, port)

#### QUESTION: What would this do?

In [5]:
db = client.test
collection = db.people
collection.drop()

ServerSelectionTimeoutError: localhost:27017: [Errno 61] Connection refused

#### ANSWER: Drops collection `people` from database `test`

#### QUESTION: What would this do?

In [None]:
client.test.drop

#### ANSWER: Drops database `test`

#### Drop database `mydatabase` before creating it below

In [7]:
client.mydatabase.drop

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'mydatabase'), 'drop')

#### Create a new database called `mydatabase`

In [8]:
mydb = client['mydatabase']

#### Confirm that the database exists
- list all databases in your system

In [9]:
print(client.list_database_names())

['admin', 'config', 'local', 'test']


- check for the database by name

In [11]:
dblist = client.list_database_names()

if 'mydatabase' in dblist:
    print('The database exists.')

If the new database was never created before, it will not be found because `mongodb` is lazy: the db will not get created until data has been written to it!

#### Create a collection called `customers` (with object name `mycol`(

In [12]:
mycol = mydb['customers']

#### Create a document (i.e. a dictionary)
- Create a document (i.e. a dictionary) with _two_ `name`:`value` items

        'name' = 'John' and
        'address' = 'Highway 37'

- Insert it into the `customers` collection

In [13]:
mydict = {
    'name': 'John',
    'address': 'Highway 37'
}

x = mycol.insert_one(mydict)

#### Now test for the existence of the database

In [14]:
print(client.list_database_names())

['admin', 'config', 'local', 'mydatabase', 'test']


#### List all collections in the database

In [15]:
print(mydb.list_collection_names())

['customers']


#### Insert another record in the `customers` collection
- Insert another record in the `customers` collection
        'name' = 'Peter'
        'address' = 'Lowstreet 27'

- Return the value of the `_id` field

In [16]:
mydict = {
    'name': 'Peter',
    'address': 'Lowstreet 27'
}

x = mycol.insert_one(mydict)
print(x.inserted_id)

5d8ef6f1065b5e2b127feb57


#### Given the list of dicts below
- Given the list of dicts below
- Insert multiple documents into the collection using the `insert_many()` method

In [17]:
mylist = [
    {'name': 'Amy', 'address': 'Apple st 652'},
    {'name': 'Hannah', 'address': 'Mountain 21'},
    {'name': 'Michael', 'address': 'Valley 345'},
    {'name': 'Sandy', 'address': 'Ocean blvd 2'},
    {'name': 'Betty', 'address': 'Green Grass 1'},
    {'name': 'Richard', 'address': 'Sky st 331'},
    {'name': 'Susan', 'address': 'One way 98'},
    {'name': 'Vicky', 'address': 'Yellow Garden 2'},
    {'name': 'Ben', 'address': 'Park Lane 38'},
    {'name': 'William', 'address': 'Central st 954'},
    {'name': 'Chuck', 'address': 'Main Road 989'},
    {'name': 'Viola', 'address': 'Sideway 1633'}
]

In [18]:
x = mycol.insert_many(mylist)

#### Print a list of the `_id values` of the inserted documents

In [19]:
x.inserted_ids

[ObjectId('5d8ef705065b5e2b127feb58'),
 ObjectId('5d8ef705065b5e2b127feb59'),
 ObjectId('5d8ef705065b5e2b127feb5a'),
 ObjectId('5d8ef705065b5e2b127feb5b'),
 ObjectId('5d8ef705065b5e2b127feb5c'),
 ObjectId('5d8ef705065b5e2b127feb5d'),
 ObjectId('5d8ef705065b5e2b127feb5e'),
 ObjectId('5d8ef705065b5e2b127feb5f'),
 ObjectId('5d8ef705065b5e2b127feb60'),
 ObjectId('5d8ef705065b5e2b127feb61'),
 ObjectId('5d8ef705065b5e2b127feb62'),
 ObjectId('5d8ef705065b5e2b127feb63')]

#### Execute the next cell to insert a list of dicts with specified `_id`s

In [20]:
mylist = [
    {'_id': 1, 'name': 'John', 'address': 'Highway 37'},
    {'_id': 2, 'name': 'Peter', 'address': 'Lowstreet 27'},
    {'_id': 3, 'name': 'Amy', 'address': 'Apple st 652'},
    {'_id': 4, 'name': 'Hannah', 'address': 'Mountain 21'},
    {'_id': 5, 'name': 'Michael', 'address': 'Valley 345'},
    {'_id': 6, 'name': 'Sandy', 'address': 'Ocean blvd 2'},
    {'_id': 7, 'name': 'Betty', 'address': 'Green Grass 1'},
    {'_id': 8, 'name': 'Richard', 'address': 'Sky st 331'},
    {'_id': 9, 'name': 'Susan', 'address': 'One way 98'},
    {'_id': 10, 'name': 'Vicky', 'address': 'Yellow Garden 2'},
    {'_id': 11, 'name': 'Ben', 'address': 'Park Lane 38'},
    {'_id': 12, 'name': 'William', 'address': 'Central st 954'},
    {'_id': 13, 'name': 'Chuck', 'address': 'Main Road 989'},
    {'_id': 14, 'name': 'Viola', 'address': 'Sideway 1633'}
]

In [21]:
x = mycol.insert_many(mylist)
x.inserted_ids

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

#### Now try inserting a new dict with an existing `_id`

In [22]:
try:
    x = mycol.insert_one({'_id': 14, 'name': 'Manuel', 'address': 'Barcelona'})
except pymongo.errors.DuplicateKeyError as ex:
    print(ex)

E11000 duplicate key error collection: mydatabase.customers index: _id_ dup key: { _id: 14 }


So, if we want to manage `_id`s in code, we need to be careful!

#### This returns the first document in the collection

In [23]:
x = mycol.find_one()
x

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'),
 'name': 'John',
 'address': 'Highway 37'}

#### Do the same for the document containing a `name` with `Hannah`

In [24]:
x = mycol.find_one({'name': 'Hannah'})
x

{'_id': ObjectId('5d8ef705065b5e2b127feb59'),
 'name': 'Hannah',
 'address': 'Mountain 21'}

#### This returns (and prints) all documents in the collection

In [25]:
for x in mycol.find():
    print(x)

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John', 'address': 'Highway 37'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter', 'address': 'Lowstreet 27'}
{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': ObjectId('5d8ef705065b5e2b127feb59'), 'name': 'Hannah', 'address': 'Mountain 21'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Valley 345'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5c'), 'name': 'Betty', 'address': 'Green Grass 1'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5e'), 'name': 'Susan', 'address': 'One way 98'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb60'), 'name': 'Ben', 'address': 'Park Lane 38'}
{'_id': ObjectI

#### This returns only the name and address fields

In [26]:
for x in mycol.find({}, {'_id': 0, 'name': 1, 'address': 1}):
    print(x)

{'name': 'John', 'address': 'Highway 37'}
{'name': 'Peter', 'address': 'Lowstreet 27'}
{'name': 'Amy', 'address': 'Apple st 652'}
{'name': 'Hannah', 'address': 'Mountain 21'}
{'name': 'Michael', 'address': 'Valley 345'}
{'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'name': 'Betty', 'address': 'Green Grass 1'}
{'name': 'Richard', 'address': 'Sky st 331'}
{'name': 'Susan', 'address': 'One way 98'}
{'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'name': 'Ben', 'address': 'Park Lane 38'}
{'name': 'William', 'address': 'Central st 954'}
{'name': 'Chuck', 'address': 'Main Road 989'}
{'name': 'Viola', 'address': 'Sideway 1633'}
{'name': 'John', 'address': 'Highway 37'}
{'name': 'Peter', 'address': 'Lowstreet 27'}
{'name': 'Amy', 'address': 'Apple st 652'}
{'name': 'Hannah', 'address': 'Mountain 21'}
{'name': 'Michael', 'address': 'Valley 345'}
{'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'name': 'Betty', 'address': 'Green Grass 1'}
{'name': 'Richard', 'address': 'Sky st 331'}
{'name': 'Susa

#### Print only the `_id` and name fields

In [27]:
for x in mycol.find({}, {'_id': 1, 'name': 1}):
    print(x)

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter'}
{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy'}
{'_id': ObjectId('5d8ef705065b5e2b127feb59'), 'name': 'Hannah'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5c'), 'name': 'Betty'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5e'), 'name': 'Susan'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky'}
{'_id': ObjectId('5d8ef705065b5e2b127feb60'), 'name': 'Ben'}
{'_id': ObjectId('5d8ef705065b5e2b127feb61'), 'name': 'William'}
{'_id': ObjectId('5d8ef705065b5e2b127feb62'), 'name': 'Chuck'}
{'_id': ObjectId('5d8ef705065b5e2b127feb63'), 'name': 'Viola'}
{'_id': 1, 'name': 'John'}
{'_id': 2, 'name': 'Peter'}
{'_id': 3, 'name': 'Amy'}
{'_id': 4, 'name': 'Hannah'}
{'_id'

So, we must explicitly use `'_id': 0` to exclude it, but for other fields we simply omit them from the dict argument.

To include field conditionals in a query, we use `$` operators. This finds addresses starting with 'S' or greater

In [29]:
myquery = {'address': {'$gt': 'S'}}

mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Valley 345'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb63'), 'name': 'Viola', 'address': 'Sideway 1633'}
{'_id': 5, 'name': 'Michael', 'address': 'Valley 345'}
{'_id': 8, 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': 10, 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': 14, 'name': 'Viola', 'address': 'Sideway 1633'}


Here are some more comparison operators:

        $gt
    $gte
        $eq
    $in
        $nin
    $exists
        $and
    $or
        $not
            
Experiment with these until you understand how to use them.

#### Now find all docs with an address that begins with 'S'
**HINT**: The value for 'address' in the argument should be the regex-based dict `{ '$regex': '^S' }`

In [30]:
myquery = {'address': {'$regex': '^S'}}

mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef705065b5e2b127feb63'), 'name': 'Viola', 'address': 'Sideway 1633'}
{'_id': 8, 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': 14, 'name': 'Viola', 'address': 'Sideway 1633'}


#### Sorting can be applied by invoking the `sort()` method after the `find()` method
- Sort the collection by the name field

In [31]:
mydoc = mycol.find().sort('name')

for x in mydoc:
    print(x)

{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': 3, 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': ObjectId('5d8ef705065b5e2b127feb60'), 'name': 'Ben', 'address': 'Park Lane 38'}
{'_id': 11, 'name': 'Ben', 'address': 'Park Lane 38'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5c'), 'name': 'Betty', 'address': 'Green Grass 1'}
{'_id': 7, 'name': 'Betty', 'address': 'Green Grass 1'}
{'_id': ObjectId('5d8ef705065b5e2b127feb62'), 'name': 'Chuck', 'address': 'Main Road 989'}
{'_id': 13, 'name': 'Chuck', 'address': 'Main Road 989'}
{'_id': ObjectId('5d8ef705065b5e2b127feb59'), 'name': 'Hannah', 'address': 'Mountain 21'}
{'_id': 4, 'name': 'Hannah', 'address': 'Mountain 21'}
{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John', 'address': 'Highway 37'}
{'_id': 1, 'name': 'John', 'address': 'Highway 37'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Valley 345'}
{'_id': 5, 'name': 'Michael', 'address': 'Valley

#### Now sort in reverse order
**HINT**: The `sort()` method takes an optional second parameter

In [32]:
mydoc = mycol.find().sort('name', direction=pymongo.DESCENDING)

for x in mydoc:
    print(x)

{'_id': ObjectId('5d8ef705065b5e2b127feb61'), 'name': 'William', 'address': 'Central st 954'}
{'_id': 12, 'name': 'William', 'address': 'Central st 954'}
{'_id': ObjectId('5d8ef705065b5e2b127feb63'), 'name': 'Viola', 'address': 'Sideway 1633'}
{'_id': 14, 'name': 'Viola', 'address': 'Sideway 1633'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': 10, 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5e'), 'name': 'Susan', 'address': 'One way 98'}
{'_id': 9, 'name': 'Susan', 'address': 'One way 98'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'_id': 6, 'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': 8, 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter', 'address': 'Lowstreet 27'}
{'_id': 2, 'name': 'Peter',

#### A single record can be deleted by specifying some criterion

In [33]:
mycol.delete_one({'address': 'Mountain 21'})

<pymongo.results.DeleteResult at 0x1d356025708>

#### Now delete all docs with the 2-digit `Id` values

In [34]:
mycol.delete_many({'_id': {'$lt': 15}})
for x in mycol.find():
    print(x)

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John', 'address': 'Highway 37'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter', 'address': 'Lowstreet 27'}
{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Valley 345'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5c'), 'name': 'Betty', 'address': 'Green Grass 1'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5e'), 'name': 'Susan', 'address': 'One way 98'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb60'), 'name': 'Ben', 'address': 'Park Lane 38'}
{'_id': ObjectId('5d8ef705065b5e2b127feb61'), 'name': 'William', 'address': 'Central st 954'}
{'_id': Obj

- This would delete all docs
    ```python
    x = mycol.delete_many({})
    ```
- This would remove the collection
    ```python
    mycol.drop()
    ```

#### Change the first instance of `address`
- Change the first instance of 'address' == 'Valley 345' to 'Canyon 123' using `update_one()`

**HINT**: The first paramater of `update_one()` is the criterion (query); the second is a dictionary specifying the field to change and its new value.

In [35]:
myquery = {'address': 'Valley 345'}
newvalues = {'$set': {'address': 'Canyon 123'}}

mycol.update_one(myquery, newvalues)
for x in mycol.find():
    print(x)

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John', 'address': 'Highway 37'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter', 'address': 'Lowstreet 27'}
{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Canyon 123'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy', 'address': 'Ocean blvd 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5c'), 'name': 'Betty', 'address': 'Green Grass 1'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5d'), 'name': 'Richard', 'address': 'Sky st 331'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5e'), 'name': 'Susan', 'address': 'One way 98'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5f'), 'name': 'Vicky', 'address': 'Yellow Garden 2'}
{'_id': ObjectId('5d8ef705065b5e2b127feb60'), 'name': 'Ben', 'address': 'Park Lane 38'}
{'_id': ObjectId('5d8ef705065b5e2b127feb61'), 'name': 'William', 'address': 'Central st 954'}
{'_id': Obj

#### The `limit()` method can be applied after the `find()` method to limit the number of docs returned
- Show the first _5_ docs

In [36]:
myresult = mycol.find().limit(5)
for x in myresult:
    print(x)

{'_id': ObjectId('5d8ef6dd065b5e2b127feb56'), 'name': 'John', 'address': 'Highway 37'}
{'_id': ObjectId('5d8ef6f1065b5e2b127feb57'), 'name': 'Peter', 'address': 'Lowstreet 27'}
{'_id': ObjectId('5d8ef705065b5e2b127feb58'), 'name': 'Amy', 'address': 'Apple st 652'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5a'), 'name': 'Michael', 'address': 'Canyon 123'}
{'_id': ObjectId('5d8ef705065b5e2b127feb5b'), 'name': 'Sandy', 'address': 'Ocean blvd 2'}


In [37]:
# close the connection
client.close()

## PyMongo for Data Science
**MongoDB** has many more features of interest to developers, but the main focus of a Data Scientist will be wrangling and munging the data.

It may or may not be desirable to do all the data munging in **Pandas** for a large, distributed database, it may be imperative to perform aggregation in **MongoDB**. 

## Demo
Based on the [**Simple MongoDB demo**](https://rsandstroem.github.io/MongoDBDemo.html) from the **Data Scientist Blog**

This code creates a database named `test` and populates it from a JSON file using the `mongoimport` program (executed in the operating systems itself, rather than in Python)

In [38]:
db = client.test

# Drop the collection in case it was previously created
collection = db.people
collection.drop()

In [40]:
# change to the data directory
back_dir = os.getcwd()
os.chdir('data')
os.getcwd()

'C:\\Users\\liuy\\Documents\\DSIA\\DSIA-MEL-PT-201910-master\\Module 2\\data'

In [49]:
# if this does not return 0,
# ... execute mongoimport from a commmand window instead:
RC = os.system('mongoimport -d test -c people dummyData.json')
if RC != 0:
    print('ERROR: Could not import data!\n')
    print('Need to import the data manually from the command line.')
    print('Use:')
    print('  mongoimport -d test -c people dummyData.json')

ERROR: Could not import data!

Need to import the data manually from the command line.
Use:
  mongoimport -d test -c people dummyData.json


In [50]:
# mode back to the notebook's directory
os.chdir(back_dir)
os.getcwd()

'C:\\Users\\liuy\\Documents\\DSIA\\DSIA-MEL-PT-201910-master\\Module 2'

In [57]:
db = client.test
collection = db.people
cursor = collection.find().sort('Age', pymongo.ASCENDING).limit(3)
for doc in cursor:
    print(doc)

{'_id': ObjectId('5d8efabd0ad7d60377d54cca'), 'Name': 'Sawyer, Neve M.', 'Age': 18, 'Country': 'Serbia', 'Location': '-34.37446, 174.0838'}
{'_id': ObjectId('5d8efabd0ad7d60377d54c89'), 'Name': 'Townsend, Cadman I.', 'Age': 19, 'Country': 'Somalia', 'Location': '-87.69188, -144.16138'}
{'_id': ObjectId('5d8efabd0ad7d60377d54cb1'), 'Name': 'Graham, Emerald O.', 'Age': 20, 'Country': 'Eritrea', 'Location': '61.35398, 28.04381'}


Here is a small demonstration of the **MongoDB** aggregation framework.
- We want to create a table of the number of persons in each country and their average age
- To do it we group by country
- We extract the results from **MongoDB** aggregation into a **Pandas** dataframe, and use the country as index

In [56]:
pipeline = [
    {'$group':
     {'_id': '$Country',
      'AvgAge': {'$avg': '$Age'},
      'Count': {'$sum': 1}}},
    {'$sort':
     {'Count': -1,
      'AvgAge': 1}}
]

# returns a cursor
aggResult = collection.aggregate(pipeline)

# use list to turn the cursor to an array of documents
df1 = pd.DataFrame(list(aggResult))
df1 = df1.set_index('_id')
df1.head()

Unnamed: 0_level_0,AvgAge,Count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1
China,46.25,4
Antarctica,46.333333,3
Guernsey,48.333333,3
Puerto Rico,26.5,2
Heard Island and Mcdonald Islands,29.0,2


For simple cases one can either use a cursor through `find('search term')` or use the `$match` operator in the aggregation framework.

In [55]:
pipeline = [
    {'$match': {'Country': 'China'}}
]

aggResult = collection.aggregate(pipeline)

df2 = pd.DataFrame(list(aggResult))
df2.head()

Unnamed: 0,Age,Country,Location,Name,_id
0,32,China,"39.9127, 116.3833","Holman, Hasad O.",5d8efabd0ad7d60377d54c86
1,43,China,"31.2, 121.5","Byrd, Dante A.",5d8efabd0ad7d60377d54cc2
2,57,China,"45.75, 126.6333","Carney, Tamekah I.",5d8efabd0ad7d60377d54ccb
3,53,China,"40, 95","Mayer, Violet U.",5d8efabd0ad7d60377d54cd9


Now we can apply all the power of Python libraries to analyse and visualise the data.

Here, we will use the folium package to plot markers for the locations of the people we just found in China (click on a marker to see their data)

In [54]:
world_map = folium.Map(location=[35, 100], zoom_start=4)
for i in range(len(df2)):
    location = [float(loc) for loc in df2.Location[i].split(',')]
    folium.Marker(
        location=location,
        popup=df2.Name[i] + ', age:' + str(df2.Age[i])).add_to(world_map)

world_map

In [None]:
# close the connection at the very end
client.close()

## HOMEWORK

1. Read up on how to perform aggregation in mongoDB
    - Insert a duplicate record into the collection
    ```python
    mydict = {'name': 'John', 'address': 'Highway 37'}
    ```
    - Now write a command to find docs with a duplicate 'name' field (using aggregation) and remove them
    - Print the collection

2. Read up on how to apply indexes in mongoDB
    - Create an index on the 'name' and 'address' fields in this collection
    - Print the indexes for the collection