# Introduction

This tutorial will explain the concept of document-oriented databases using MongoDB in Python. Because document-oriented databases do not express relations among data, they are less structured, and operations in document-oriented databases are much faster than analogous operations in a relational database such as MySQL. For this reason, document-oriented databases such as MongoDB scale better than their SQL counterparts, and they are useful for big data.

## Installing the Libraries

You'll need to install MongoDB and the Python library pymongo.

If you have brew, you can brew MongoDB in the following way.

```
$ brew update
$ brew install mongodb
```

If you want to install MongoDB with TLS/SSL, you can run

```
$ brew install mongodb --with-openssl
```

If not, you can install MongoDB [here](https://www.mongodb.com/download-center "MongoDB Download").

Now you should install pymongo, the Python library for MongoDB.

```
$ pip install --upgrade pymongo
```

# Document-Oriented vs. Relational DB's

We will begin by relating (pun intented) document-oriented databases to relational databases. We begin with an example. Suppose you are taking 2 classes. For both of these classes, you have an associated list of test scores, which you have stored as `classes.csv`. We can view this data as a relational database in the following way.

In [1]:
import sqlite3
from AdditionalFiles import load_sql
conn = sqlite3.connect(":memory:")
load_sql(conn, "classes.csv")

    class  grade_1  grade_2  grade_3
0  15-388       90       70       80
1  15-213      100       50       70


This is very straightforward. We have a list of classes and corresponding test scores. However, this is more complicated when we have unequal number of tests. For example, say you scored a 95% on your fourth test in 15-213. You are eager to add this exam to the table, but this proves to be a difficult task. You'd have to add a fourth column, but there is no fourth test score for 15-388 and the table quickly becomes complicated.

Note that we could represent classes as grades in seperate tables, with one table for classes and another table for the relations between classes and grades. However, even selecting all information from such a table would take up a lot of resources and proves to be inefficient. Therefore, we will take an alternate approach.

Here we introduce the *document-oriented database*. Suppose I have the same data for these classes stored in some `classes.json`.

In [2]:
import json
classes_file = open('classes.json', 'r')
classes = json.load(classes_file)
print json.dumps(classes, sort_keys=True, indent=2, separators=(',',': '))
classes_file.close()

[
  {
    "course": "15-388",
    "scores": {
      "midterm_1": 90,
      "midterm_2": 70,
      "midterm_3": 80
    }
  },
  {
    "course": "15-213",
    "scores": {
      "midterm_1": 100,
      "midterm_2": 50,
      "midterm_3": 70
    }
  },
  {
    "course": "15-150",
    "scores": {}
  }
]


A document-oriented database allows us to represent a group of *documents* collectively known as a *collection*. In our example, each class is a *document*, and the collection of classes is a *collection*.

Noteably, each document does not have to have the same information, and we are able to add our fourth exam score to our 15-213 document with no problem! But how do we add the data to our document? To do this, we will introduce MongoDB.

# An Introduction to MongoDB

In this section we will learn a popular implementation of a document-oriented database, MongoDB.

Let's first set up our database from our `classes.json`.

Note that this will require a MongoDB instance running on your local machine. To accomplish this, simply run

```
$ mongod
```

Note that you may have to pass additional flags or arguments. Run

```
$ mongod --help
```

for more information.

Notably, you should set up MongoDB to to store information in a directory `data/` and with some subdirectory `db/` (so `<absolute_path>/data/db/`) and record logs in a file within a directory `log/`. You can achieve these with the flags `--dbpath` and `--dbpath`, respectively, as shown below.

```
$ mongod --dbpath <path>data/db/ --logpath <path>log/mongo.log
```

In [3]:
import bson, pymongo
from pymongo import MongoClient

conn = MongoClient()

# create db if it doesn't already exist
db = conn.test_database

# create collection grades
grades = db.grades

# make sure grades is empty, so we can run this function multiple times
grades.delete_many({})

# add entries, initialize list of id's of objects
id_objects = grades.insert_many(classes)
ids = id_objects.inserted_ids

# analogous to "SELECT * FROM classes;"
grades_list = grades.find()

# print all grades
for grade in grades_list:
    print grade

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2f9'), u'scores': {u'midterm_1': 90, u'midterm_2': 70, u'midterm_3': 80}}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb2fa'), u'scores': {u'midterm_1': 100, u'midterm_2': 50, u'midterm_3': 70}}
{u'course': u'15-150', u'_id': ObjectId('581c1d1cd64511962bfbb2fb'), u'scores': {}}


And we have successfully created a collection with our grades. But you want to put your great test score in the collection! Let's take a look at how to modify a document.

If we know the ID of the document, this process is quite simple. We can find a *single* document with `<collection>.find_one()` and update a collection with `<collection>.update_one()`.

In [4]:
# this is the id we want
courseId = ids[1]
print courseId

# print newline and the document
print
print grades.find_one({"_id": courseId})

# update the document
grades.update_one({'_id': courseId}, {"$set": {'scores.midterm_4': 95}}, upsert=True)

# print newline and the updated document
print
print grades.find_one({"_id": courseId})

581c1d1cd64511962bfbb2fa

{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb2fa'), u'scores': {u'midterm_1': 100, u'midterm_2': 50, u'midterm_3': 70}}

{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb2fa'), u'scores': {u'midterm_4': 95, u'midterm_1': 100, u'midterm_2': 50, u'midterm_3': 70}}


Let's make a few notes. Firstly, we use the keyword `$set` to set the key `'4'` to value `95`. Also, we need to pass the parameter upsert=True to indicate that we wish to add this field if it doesn't exist already.

Now you have received a great homework grade for 15-388: 100%! You are anxious to add this score, but unfortunately, we have lost our list of id's!

In [5]:
id_objects = None
ids = []

We will have to find the document using only the information that the `course` is `15-388`. Fortunately, we have an alternate way to select documents!

In [6]:
# print the document
print grades.find_one({"course": '15-388'})

# update the document
grades.update_one({"course": '15-388'}, {"$set": {'scores.homework_1': 100}}, upsert=True)

# print newline and the updated document
print
print grades.find_one({"course": '15-388'})

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2f9'), u'scores': {u'midterm_1': 90, u'midterm_2': 70, u'midterm_3': 80}}

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2f9'), u'scores': {u'homework_1': 100, u'midterm_1': 90, u'midterm_2': 70, u'midterm_3': 80}}


You notice there's an error! In your database, your midterm 2 score in 15-388 is a 70%, but you actually got a 90%. You're going to need to change that! To change a value, we use the same syntax as adding a value. The only difference is we don't need `upsert=True`, and so we will eliminate this parameter (`upsert=False` by default).

In [7]:
# update the document
grades.update_one({"course": '15-388'}, {"$set": {'scores.midterm_2': 90}})

# print the updated document
print grades.find_one({"course": '15-388'})

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2f9'), u'scores': {u'homework_1': 100, u'midterm_1': 90, u'midterm_2': 90, u'midterm_3': 80}}


Finally, we will show you how to remove a document.

You don't like 15-150, and so you are electing to drop the course. You want to remove it from your schedule.

To remove a document, we use the `.delete_one()` function to remove one document or `.delete_many()` to remove several documents. Since we only want to remove one document, we will simply use `.delete_one()`.

In [8]:
# remove the document
grades.delete_one({"course": '15-150'})

# print the updated document
for course in grades.find():
    print course

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2f9'), u'scores': {u'homework_1': 100, u'midterm_1': 90, u'midterm_2': 90, u'midterm_3': 80}}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb2fa'), u'scores': {u'midterm_4': 95, u'midterm_1': 100, u'midterm_2': 50, u'midterm_3': 70}}


And we've removed the course!

# Operations in MongoDB

It's midsemester and you want to calculate your midsemester grade! To do this, you'll need to understand a little about operations in MongoDB.

Let's start by averaging midterm scores for each class. Of course, we can select each class as a python object and iterate through them.

In [9]:
avgs = {}
for class_ in grades.find():
    total = 0
    count = 0
    for _, score in class_['scores'].iteritems():
        total += score
        count += 1
    avgs[class_['course']] = float(total) / count
print avgs

{u'15-388': 90.0, u'15-213': 78.75}


# Aggregation

Here we introducte the topic of the *aggregation*. To simplify this transition, we will define a new database with information stored in a more typical way for MongoDB.

In [10]:
# create collection grades
grades_by_test = db.grades_by_test

# make sure grades is empty, so we can run this function multiple times
grades_by_test.delete_many({})

# add entries, initialize list of id's of objects
classes_by_test_file = open('classes-by-test.json', 'r')
id_objects = grades_by_test.insert_many(json.load(classes_by_test_file))
ids = id_objects.inserted_ids
classes_by_test_file.close()

# analogous to "SELECT * FROM classes;"
grades_by_test_list = grades_by_test.find()

# print all grades
for grade in grades_by_test_list:
    print grade

{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2fc'), u'exam_name': u'midterm_1', u'score': 90}
{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2fd'), u'exam_name': u'midterm_2', u'score': 90}
{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2fe'), u'exam_name': u'midterm_3', u'score': 80}
{u'course': u'15-388', u'_id': ObjectId('581c1d1cd64511962bfbb2ff'), u'exam_name': u'homework_1', u'score': 100}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb300'), u'exam_name': u'midterm_1', u'score': 100}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb301'), u'exam_name': u'midterm_2', u'score': 50}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb302'), u'exam_name': u'midterm_3', u'score': 70}
{u'course': u'15-213', u'_id': ObjectId('581c1d1cd64511962bfbb303'), u'exam_name': u'midterm_4', u'score': 95}


We will use the aggregation framework.

Aggregation is a collection of operations we perform on the data. In this case, we want to look at scores and add them up.

In [11]:
pipeline = [{
        '$group': {
            "_id": "$course",
            "avgScore": {
                "$avg":"$score"
            }
        }
}]

print list(grades_by_test.aggregate(pipeline))

[{u'_id': u'15-213', u'avgScore': 78.75}, {u'_id': u'15-388', u'avgScore': 90.0}]


We'll now go over the aggregation pipeline.

# Map & Reduce

In this section we will discuss Mapping and Reducing in MongoDB. This is a method to aggregate large amounts of data into some convenient representation. Note that map and reduce must be JavaScript functions, and so this section will require some knowledge of JavaScript. [Here](https://developer.mozilla.org/en-US/docs/Web/JavaScript "JavaScript Documentation") is a link to JavaScript Docs, which includes a good tutorial that can be found [here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide "JavaScript Guide"). I also personally endorse [Codecademy's JavaScript tutorial](https://www.codecademy.com/learn/javascript "Codecademy JavaScript") as a brief introduction to the language.

We first discuss the **map** step of this process. In general, *mapping* is a concept in functional programming where we map every element of some list L to some element in a new list L'. In MongoDB, this is achieved by *emitting* key-value pairs for each element in the list L. Note that each element of list L can emit an arbitrary number of times (including 0). After emitting, each key `k` has a corresponding list `L` of values `[l1, ..., ln]` such that for each `li` (`k`, `li`) was emitted by map.

We construct a function `mapper` that will be given each document as input. We can reference the document with the keyword `this`.

After mapping, we **reduce** our key-value pairs. We use a function that processes the accumulated list `L` and returns some value `v` that will be associated with each key.

Let's begin writing our aggregate as a mapreduce! We first want to select which key-value pair to emit. course, score makes sense because a list of scores associated with each course could simply be averaged after the map step. We can then reduce by simply computing the average for each list `L`. Don't worry too much about the JavaScript syntax, but you should be able to understand the meaning of each function.

In [22]:
from bson.code import Code

# emits course, score
mapper = Code("""
              function() {
                emit(this.course, this.score);
              }
              """)

# computes average of each list of test scores
reducer = Code("""
               function(key, values) {
                 var total = 0;
                 var length = values.length;
                 
                 for (var i = 0; i < length; i++) { //compute average
                   total += values[i];
                 }
                 
                 return total / length;
               }
               """)

# mapreduce to find averages
averages = grades_by_test.map_reduce(mapper, reducer, "averages")

for average in averages.find():
    print average

{u'_id': u'15-213', u'value': 78.75}
{u'_id': u'15-388', u'value': 90.0}


And we have a list of averages, like before. What's more interesting though is that mapreduce can be used on our original dataset! Let's try this.

In [24]:
# emits course, score
mapper = Code("""
              function() {
                for (var key in this.scores) { //iterate through dictionary
                  emit(this.course, this.scores[key]);
                }
              }
              """)

# mapreduce to find averages
averages_2 = grades.map_reduce(mapper, reducer, "averages_2")

for average in averages_2.find():
    print average

{u'_id': u'15-213', u'value': 78.75}
{u'_id': u'15-388', u'value': 90.0}


Notice that our map function on this dataset yielded the same result as last time, so we can use the same reduce function.

# Map Reduce Example

Let's walk through a more difficult example.

# Sharding

Sharding is outside the scope of this tutorial, so we will only provide a brief introduction as a reference for future learning.

# References

Document-Oriented Databases:
1. [Wikipedia Document-Oriented Database](https://en.wikipedia.org/wiki/Document-oriented_database "Wikipedia Document-Oriented Database")
2. [Document-Oriented Databases and MongoDB](https://www.mongodb.com/document-databases "Document-Oriented Databases and MongoDB")

MongoDB:
1. [MongoDB Manual](https://docs.mongodb.com/manual/ "MongoDB Manual")
2. [Aggregation Pipeline](https://docs.mongodb.com/manual/core/aggregation-pipeline/ "Aggregation Pipeline")
3. [Map-Reduce](https://docs.mongodb.com/manual/core/map-reduce/ "Map-Reduce")
4. [Sharding](https://docs.mongodb.com/manual/sharding/ "Sharding")

PyMongo:
1. [Pymongo Documentation](https://api.mongodb.com/python/current/ "Pymongo Documentation")
2. [Getting Started MongoDB - Python](https://docs.mongodb.com/getting-started/python/ "Getting Started MongoDB - Python")

JavaScript:
1. [JavaScript Documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript "JavaScript Documentation")
2. [JavaScript Guide](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide "JavaScript Guide")
3. [Codecademy JavaScript](https://www.codecademy.com/learn/javascript "Codecademy JavaScript")