<div style="float:left;font-size:20px;">
    <h1>MongoDB</h1>
</div><div style="float:right;"><img src="../assets/banner.jpg"></div>

<hr>

Key features of MongoDB:

- Schemaless, unlike relational databases. This gives greater flexibility however can make maintainence of data difficult.
- Uses JSON (BSON) for storing data and can both power high-volume applications. 
- There are drivers for nearly any language including C/C++, Python, PHP, Ruby, Perl, .NET, even Node.js.
- Stores data in _collections_, rather than tables.
- A _collection_ may be considered a table except there are no aligned columns. Each entry (row) can use varying dynamic schemas in key-value pairs.
- Each of these entries or rows inside a collection is called a _document_. 

Further reading: https://www.hongkiat.com/blog/webdev-with-mongodb-part1/


## Setup

Setup configuration script: C:\Program Files\MongoDB\Server\4.0\bin\mongod.cfg

Run the server with the config: 

`cd C:\Program Files\MongoDB\Server\4.0\bin
mongod --config mongod.cfg`   
 
 
Connect to the DB with either Python or MongoDB Compass:

`
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
client['dev'].list_collection_names()
client['dev']['test'].find_one() # Get one element
client['dev']['test'].find() # Get lazy list of all elements
`

More examples in:

*R:\Dropbox\Python\CatAna\examples\data\mongodb\mongo.py*

__Note:__ Maximum document size ~ 16MB.

# PyMongo
http://api.mongodb.com/python/current/tutorial.html
    
An important note about collections (and databases) in MongoDB is that they are created lazily - none of the above commands have actually performed any operations on the MongoDB server. Collections and databases are created when the first document is inserted into them.

- MongoDB stores data in BSON format. BSON strings are UTF-8 encoded so PyMongo must ensure that any strings it stores contain only valid UTF-8 data.

## Examples

In [2]:
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')

# List available DBs
client.list_database_names()

['admin', 'arctic', 'config', 'local', 'meta_db']

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'admin')

In [14]:
# Connect to a DB and list collections
dev = client['dev']
dev = MongoClient().dev  # Alternatively
dev.list_collection_names()

['posts']

In [13]:
# Writing data
dev = MongoClient().dev

import datetime
post = {"author": "Mike",
        "text": "My first blog post!",
        "tags": ["mongodb", "python", "pymongo"],
        "date": datetime.datetime.utcnow()}

id = dev.posts.insert_one(post)
id  # ID of the stored data element

<pymongo.results.InsertOneResult at 0x27f1abdb2c8>

In [18]:
# Retrieve data by ID
stored_post = dev.posts.find_one(id.inserted_id)
stored_post

{'_id': ObjectId('5e850256b5ec12a738d3515c'),
 'author': 'Mike',
 'text': 'My first blog post!',
 'tags': ['mongodb', 'python', 'pymongo'],
 'date': datetime.datetime(2020, 4, 1, 21, 6, 30, 290000)}

In [19]:
# Retrieve data by search
dev.posts.find_one({'author': 'Mike'})

{'_id': ObjectId('5e850256b5ec12a738d3515c'),
 'author': 'Mike',
 'text': 'My first blog post!',
 'tags': ['mongodb', 'python', 'pymongo'],
 'date': datetime.datetime(2020, 4, 1, 21, 6, 30, 290000)}

### Bulk inserts

In [27]:
# Create a collection of cats and insert 1000 new documents
if not dev.get_collection('cats'):
    dev.create_collection('cats')

import numpy as np
n_cats = 1000
ages = np.random.uniform(0, 10, n_cats)
weights = np.random.uniform(4, 1, n_cats)
breeds = ['Tabby', 'Bombay', 'Calico', 'Siamese']

cat_data = [{'age': age, 'weight': weight, 'breed': np.random.choice(breeds)} for age, weight in zip(ages, weights)]
    
dev.cats.insert_many(cat_data)

<pymongo.results.InsertManyResult at 0x27f1b68e488>

In [37]:
# Find multiple data elements, result is a lazy Cursor object
tabby_cats = dev.cats.find({'breed': 'Tabby'})
type(tabby_cats)

pymongo.cursor.Cursor

In [38]:
# First two elements
list(tabby_cats)[:2]

[{'_id': ObjectId('5e8505eeb5ec12a738d3554b'),
  'age': 6.0873644412467085,
  'weight': 3.517296448453242,
  'breed': 'Tabby'},
 {'_id': ObjectId('5e8505eeb5ec12a738d3554c'),
  'age': 3.4481095661142436,
  'weight': 1.1312420890279995,
  'breed': 'Tabby'}]

In [43]:
# Filtering by comparison operators and sorting
filtered_cats = dev.cats.find({'age': {'$lt': 2}}).sort("weight")

In [44]:
list(filtered_cats)[:3]

[{'_id': ObjectId('5e850438b5ec12a738d35470'),
  'age': 0.06707121053268095,
  'weight': 1.0072234368744741},
 {'_id': ObjectId('5e850438b5ec12a738d3530d'),
  'age': 0.23598394793078015,
  'weight': 1.0101494189603266},
 {'_id': ObjectId('5e8505eeb5ec12a738d35616'),
  'age': 0.9181937282119534,
  'weight': 1.0129457898180672,
  'breed': 'Siamese'}]

### Aggregation

In [55]:
from bson.son import SON
pipeline = [
     {"$unwind": "$breed"},
     {"$group": {"_id": "$breed", "count": {"$sum": 1}}},
     {"$sort": SON([("count", -1), ("_id", -1)])}  # SON is used a python dictionaries don’t maintain order. ALternatively use collections.OrderedDict where explicit ordering is required eg “$sort”:
]

import pprint
pprint.pprint(list(dev.cats.aggregate(pipeline)))

[{'_id': 'Calico', 'count': 268},
 {'_id': 'Tabby', 'count': 252},
 {'_id': 'Siamese', 'count': 241},
 {'_id': 'Bombay', 'count': 239}]


### Aggregation - Map/Reduce

In [65]:
from bson.code import Code
mapper = Code("""
              function () {
                this.age.forEach(function(z) {
                  emit(z, 1);
                });
              }
              """)

reducer = Code("""
               function (key, values) {
                 var total = 0;
                 for (var i = 0; i < values.length; i++) {
                   total += values[i];
                 }
                 return total;
               }
               """)

In [71]:
from bson.code import Code
mapper = Code("""
              function () {
                  emit(this.breed, this.age);
              };
              """)

reducer = Code("""
               function (key, values) {
                  return Array.sum(values);
               }
               """)

In [72]:
result = dev.cats.map_reduce(mapper, reducer, "myresults")
for doc in result.find():
  pprint.pprint(doc)

{'_id': None, 'value': 5045.712360692306}
{'_id': 'Bombay', 'value': 1198.8239883284687}
{'_id': 'Calico', 'value': 1400.349592126633}
{'_id': 'Siamese', 'value': 1271.5255462473976}
{'_id': 'Tabby', 'value': 1308.4953704555014}


### Views

In [49]:
dev.createView('catView', 'cats',
              )

TypeError: 'Collection' object is not callable. If you meant to call the 'createView' method on a 'Database' object it is failing because no such method exists.

# Arctic

Python wrapper for MongoDB that supports serialization of a number of datatypes for storage in the mongo document model. Data is compressed using LZ4, reducing both the disk an IO utilisation.

Features:
- Version Store (historised data with snapshot functionality)
- DataFrame Store
- NdArray Store
- Pickle Store

## Documentation
https://www.mongodb.com/press/man-ahl-arctic-open-source

https://github.com/manahl/arctic

https://arctic.readthedocs.io/en/latest/

### Examples
https://github.com/manahl/arctic/blob/master/howtos/how_to_use_arctic.py
https://github.com/manahl/arctic/blob/master/howtos/how_to_custom_arctic_library.py
http://api.mongodb.com/python/current/api/pymongo/collection.html

In [12]:
# Connect to MongoDB cluster
from arctic import Arctic
conn = Arctic('localhost')

In [14]:
# List libraries on cluster
conn.list_libraries()

['data', 'HomeUI', 'random', 'bigdata', 'home']

In [18]:
# List all data entries (symbols)
conn['data'].list_symbols()

['pythia/qqbar_Zgamma_fs_1000',
 'pythia/qqbar_Zgamma_full',
 'pythia/qqbar_Zgamma_full_1000',
 'pythia/qqbar_Zgamma_mu_100k',
 'pythia/qqbar_Zgamma_mu_10k']

In [20]:
# Create a new library
if 'home' not in conn.list_libraries():
    conn.initialize_library('home')

```python
# Write new data
conn['data'].write('pythia/qqbar_Zgamma_fs_1000', df)

data = conn['data']
data.write('pythia/qqbar_Zgamma_fs_1000', df, metadata={'process': 'qqbar_Zgamma',
                                                        'generator': 'pythia',
                                                        'simulation': 'full',
                                                        'events': 1000})
```

### Catana DataStore

In [None]:
db = DataStore().connect('catana')
db.ls()
# Data is written in the format: 'project.category.name'
db.write(data=a, name='numbers')




Data access is lazy
```python
>>> db['scratch.root.numbers']
 VersionedItem(symbol=scratch.root.cheese,library=arctic.catana,data=<class 'numpy.ndarray'>,version=1,metadata=None,host=localhost)
```
Read data by:
```python
numbers = db['scratch.root.cheese'].data
```
