<font size="+3"><strong>3.1. Wrangling Data with MongoDB 🇰🇪</strong></font>

In [1]:
from pprint import PrettyPrinter

import pandas as pd
from IPython.display import VimeoVideo
from pymongo import MongoClient

In [2]:
VimeoVideo("665412094", h="8334dfab2e", width=600)

In [3]:
VimeoVideo("665412135", h="dcff7ab83a", width=600)

**Task 3.1.1:** Instantiate a `PrettyPrinter`, and assign it to the variable `pp`.

- [<span id='technique'>Construct a `PrettyPrinter` instance in <span id='tool'>pprint.](../%40textbook/11-databases-mongodb.ipynb#Servers-and-Clients)

In [4]:
pp = PrettyPrinter(indent=2)

# Prepare Data

## Connect

In [5]:
VimeoVideo("665412155", h="1ca0dd03d0", width=600)

**Task 3.1.2:** Create a client that connects to the database running at `localhost` on port `27017`.

- [What's a <span id='term'>database client?](../%40textbook/11-databases-mongodb.ipynb#Servers-and-Clients)
- [What's a <span id='term'>database server?](../%40textbook/11-databases-mongodb.ipynb#Servers-and-Clients)
- [<span id='technique'>Create a client object for a <span id='tool'>MongoDB</span> instance.](../%40textbook/11-databases-mongodb.ipynb#Servers-and-Clients) 

In [6]:
client = MongoClient(host="localhost", port=27017)

## Explore

In [7]:
VimeoVideo("665412176", h="6fea7c6346", width=600)

**Task 3.1.3:** Print a list of the databases available on `client`.

- [What's an <span id='term'>iterator?](../%40textbook/02-python-advanced.ipynb#Iterators-and-Iterables-)
- [<span id='technique'>List the databases of a server using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Exploring-a-Database)
- [<span id='technique'>Print output using <span id='tool'>pprint.](../%40textbook/11-databases-mongodb.ipynb#Exploring-a-Database)

In [8]:
pp.pprint(list(client.list_databases()))

[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
  {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 6873088},
  {'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
  {'empty': False, 'name': 'local', 'sizeOnDisk': 73728}]


In [9]:
VimeoVideo("665412216", h="7d4027dc33", width=600)

**Task 3.1.4:** Assign the `"air-quality"` database to the variable `db`.

- [What's a <span id='term'>MongoDB database?](../%40textbook/11-databases-mongodb.ipynb#Databases)
- [<span id='technique'>Access a database using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Servers-and-Clients)

In [10]:
db = client["air-quality"]

In [12]:
VimeoVideo("665412231", h="89c546b00f", width=600)

**Task 3.1.5:** Use the [`list_collections`](https://pymongo.readthedocs.io/en/stable/api/pymongo/database.html?highlight=list_collections#pymongo.database.Database.list_collections) method to print a list of the collections available in `db`.

- [What's a <span id='term'>MongoDB collection?](../%40textbook/11-databases-mongodb.ipynb#Collections)
- [<span id='technique'>List the collections in a database using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Exploring-a-Database)

In [15]:
for collection in db.list_collections():
    print(collection['name'])

lagos
system.buckets.lagos
dar-es-salaam
system.buckets.dar-es-salaam
system.views
nairobi
system.buckets.nairobi


In [26]:
for collection in db.list_collections():
    print(collection) #collections have name, timeseries, info as fields

{'name': 'lagos', 'type': 'timeseries', 'options': {'timeseries': {'timeField': 'timestamp', 'metaField': 'metadata', 'granularity': 'seconds', 'bucketMaxSpanSeconds': 3600}}, 'info': {'readOnly': False}}
{'name': 'system.buckets.lagos', 'type': 'collection', 'options': {'validator': {'$jsonSchema': {'bsonType': 'object', 'required': ['_id', 'control', 'data'], 'properties': {'_id': {'bsonType': 'objectId'}, 'control': {'bsonType': 'object', 'required': ['version', 'min', 'max'], 'properties': {'version': {'bsonType': 'number'}, 'min': {'bsonType': 'object', 'required': ['timestamp'], 'properties': {'timestamp': {'bsonType': 'date'}}}, 'max': {'bsonType': 'object', 'required': ['timestamp'], 'properties': {'timestamp': {'bsonType': 'date'}}}, 'closed': {'bsonType': 'bool'}}}, 'data': {'bsonType': 'object'}, 'meta': {}}, 'additionalProperties': False}}, 'clusteredIndex': True, 'timeseries': {'timeField': 'timestamp', 'metaField': 'metadata', 'granularity': 'seconds', 'bucketMaxSpanSecon

In [14]:
VimeoVideo("665412252", h="bff2abbdc0", width=600)

**Task 3.1.6:** Assign the `"nairobi"` collection in `db` to the variable name `nairobi`.

- [<span id='technique'>Access a collection in a database using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Collections)

In [22]:
lagos = db["lagos"]

In [23]:
lagos.count_documents({})

166496

In [16]:
nairobi = db["nairobi"]

In [17]:
nairobi

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'air-quality'), 'nairobi')

In [18]:
VimeoVideo("665412270", h="e4a5f5c84b", width=600)

**Task 3.1.7:** Use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to see how many documents are in the `nairobi` collection.

- [What's a <span id='term'>MongoDB document?](../%40textbook/11-databases-mongodb.ipynb#Documents)
- [<span id='technique'>Count the documents in a collection using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Documents) 

In [19]:
nairobi.count_documents({})

202212

In [20]:
VimeoVideo("665412279", h="c2315f3be1", width=600)

**Task 3.1.8:** Use the [`find_one`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find_one) method to retrieve one document from the `nairobi` collection, and assign it to the variable name `result`.

- [What's <span id='term'>metadata?](../%40textbook/11-databases-mongodb.ipynb#Metadata)
- [What's <span id='term'>semi-structured data?](../%40textbook/11-databases-mongodb.ipynb#Semi-structured-Data)
- [<span id='technique'>Retrieve a document from a collection using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Retrieving-Data)

In [21]:
result = nairobi.find_one({}) #gives one record in nairobi collection
pp.pprint(result)

{ '_id': ObjectId('61c4aee0203ab8b7db80ee9b'),
  'metadata': { 'lat': -1.3,
                'lon': 36.785,
                'measurement': 'temperature',
                'sensor_id': 58,
                'sensor_type': 'DHT22',
                'site': 29},
  'temperature': 16.5,
  'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 4, 301000)}


In [24]:
VimeoVideo("665412306", h="e1e913dfd1", width=600)

**Task 3.1.9:** Use the [`distinct`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct) method to determine how many sensor sites are included in the `nairobi` collection.

- [<span id='technique'>Get a list of distinct values for a key among all documents using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [27]:
nairobi.distinct("metadata.site") ## two sites in this dataset, site 6 and site 29

[6, 29]

In [28]:
VimeoVideo("665412322", h="4776c6d548", width=600)

**Task 3.1.10:** Use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to determine how many readings there are for each site in the `nairobi` collection.

- [<span id='technique'>Count the documents in a collection using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data) 

In [30]:
nairobi.count_documents({"metadata.site": 6})


70360

In [34]:
nairobi.count_documents({"metadata.site": 6})
print("Documents from site 6:", nairobi.count_documents({"metadata.site": 6}) )
print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))

Documents from site 6: 70360
Documents from site 29: 131852


In [35]:
VimeoVideo("665412344", h="d2354584cd", width=600)

**Task 3.1.11:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to determine how many readings there are for each site in the `nairobi` collection.

- [<span id='technique'>Perform aggregation calculations on documents using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [37]:
## group by operation using PyMongo

##name of collection.aggregate(steps as list [] like in scikit learn pipeline      )
result = nairobi.aggregate(
    [
            {"$group": {"_id": "$metadata.site", "count": {"$count": {}}}} ## use dollarsign $ to create new field
        
    ]
)
pp.pprint(list(result))

[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]


In [38]:
VimeoVideo("665412372", h="565122c9cc", width=600)

**Task 3.1.12:** Use the [`distinct`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct) method to determine how many types of measurements have been taken in the `nairobi` collection.

- [<span id='technique'>Get a list of distinct values for a key among all documents using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [39]:
nairobi.distinct("metadata.measurement")

['humidity', 'P1', 'P2', 'temperature']

In [40]:
VimeoVideo("665412380", h="f7f7a39bb3", width=600)

**Task 3.1.13:** Use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to 3 records only.

- [<span id='technique'>Query a collection using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Retrieving-Data)

In [41]:
result = nairobi.find({"metadata.measurement": "P2"}).limit(5)   # find all metadata measurement with P2 (from the distinct list)
pp.pprint(list(result))

[ { 'P2': 34.43,
    '_id': ObjectId('61c4aee3203ab8b7db82711c'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
  { 'P2': 30.53,
    '_id': ObjectId('61c4aee3203ab8b7db82711d'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
  { 'P2': 22.8,
    '_id': ObjectId('61c4aee3203ab8b7db82711e'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(

In [42]:
VimeoVideo("665412389", h="8976ea3090", width=600)

**Task 3.1.14:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to calculate how many readings there are for each type (`"humidity"`, `"temperature"`, `"P2"`, and `"P1"`) in site `6`.

- [<span id='technique'>Perform aggregation calculations on documents using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [45]:

result = nairobi.aggregate(
    [
        # step before group by step
        {"$match": {"metadata.site": 6}},
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
    ]

)
pp.pprint(list(result))

[ {'_id': 'temperature', 'count': 17011},
  {'_id': 'humidity', 'count': 17011},
  {'_id': 'P1', 'count': 18169},
  {'_id': 'P2', 'count': 18169}]


In [46]:
VimeoVideo("665412418", h="0c4b125254", width=600)

**Task 3.1.15:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to calculate how many readings there are for each type (`"humidity"`, `"temperature"`, `"P2"`, and `"P1"`) in site `29`.

- [<span id='technique'>Perform aggregation calculations on documents using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [47]:

result = nairobi.aggregate(
    [
        # step before group by step
        {"$match": {"metadata.site": 29}},
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
        
    ]
)
pp.pprint(list(result))

[ {'_id': 'temperature', 'count': 33019},
  {'_id': 'humidity', 'count': 33019},
  {'_id': 'P1', 'count': 32907},
  {'_id': 'P2', 'count': 32907}]


## Import

In [48]:
VimeoVideo("665412437", h="7a436c7e7e", width=600)

**Task 3.1.16:** Use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method to retrieve the PM 2.5 readings from site `29`. Be sure to limit your results to 3 records only. Since we won't need the metadata for our model, use the `projection` argument to limit the results to the `"P2"` and `"timestamp"` keys only.

- [<span id='technique'>Query a collection using <span id='tool'>PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Retrieving-Data)

In [51]:
# changing projection changes what is shown -> pass in a dictionary
result = nairobi.find(
    {"metadata.site": 29, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0} #giving 1 shows, giving explicit 0 for id makes it disappear
)
pp.pprint(result.next()) #limit (1)

{'P2': 34.43, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}


In [52]:
VimeoVideo("665412442", h="494636d1ea", width=600)

**Task 3.1.17:** Read records from your `result` into the DataFrame `df`. Be sure to set the index to `"timestamp"`.

- [<span id='technique'>Create a DataFrame from a dictionary using <span id='tool'>pandas.](../%40textbook/03-pandas-getting-started.ipynb#Working-with-DataFrames)

In [55]:
result = nairobi.find(
    {"metadata.site": 29, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0} #giving 1 shows, giving explicit 0 for id makes it disappear
)
df = pd.DataFrame(result).set_index("timestamp") # result is iterator, running it once exhausts it
df.head()

Unnamed: 0_level_0,P2
timestamp,Unnamed: 1_level_1
2018-09-01 00:00:02.472,34.43
2018-09-01 00:05:03.941,30.53
2018-09-01 00:10:04.374,22.8
2018-09-01 00:15:04.245,13.3
2018-09-01 00:20:04.869,16.57


In [56]:
# Check your work
assert df.shape[1] == 1, f"`df` should have only one column, not {df.shape[1]}."
assert df.columns == [
    "P2"
], f"The single column in `df` should be `'P2'`, not {df.columns[0]}."
assert isinstance(df.index, pd.DatetimeIndex), "`df` should have a `DatetimeIndex`."

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
