<DIV ALIGN=CENTER>

# Introduction to MongoDB
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In the previous course, we discussed relational databases, SQL, and
using Python to work with relational databases. With the rapid growth
in large data sets, however, there has been an explosion in new database
technologies. In this IPython Notebook, we explore [MongoDB][mdb], one
of the more popular new database technologies.  [MongoDB][mdbw] is a
NoSQL document-oriented database, which means it is _not only SQL_ and
stores data as documents. The data are stored using dynamic schemas that
employ _BSON_ format, which is JSON-like format. For more information,
the [MongoDB documentation website][mdbd] provides a wealth of useful
information.

-----
[mdb]: https://www.mongodb.org
[mdbw]: https://en.wikipedia.org/wiki/MongoDB
[mdbd]: https://docs.mongodb.org/manual/

## Python and MongoDB

To use Python to interact with MongoDB, we need to use a suitable Python
library. The recommended Python library is [_pymongo_][pymdb], which
provides support for establishing a connection between a Python program
and a MongoDB server as well as support tools for working with MongoDB. 

We have already installed _pymongo_ in the course Docker container;
however, you can easily install is by using `pip`, for example to
install _pymongo_ for use with Python3 for the current user, we can
execute:

```bash
pip3 install pymongo --user
```

Once this library is installed, we can import the MongoDB client to
establish a connection and retrieve data and MongoDB information.

```python
from pymongo import MongoClient
```

-----

[pymdb]: http://api.mongodb.org/python/current/

In [1]:
from pymongo import MongoClient

-----

## Local MongoDB Server

To use a local MongoDB server, for instance, a MongoDB server running
inside our course Docker container, we need to first start the server.
To do this, open a terminal window inside the Docker container, most
easily done using the _New_ menu on the JupyterHub Server homepage,
followed by _Terminal_.

Inside this new terminal window, start up the MongoDB server by issuing
the following command:

```bash 
mongod --nojournal 
``` 

This will start the mongo database daemon with no journaling (since we
are not worried about crash safety). This will produce a list of
messages in your terminal window. At this point the local server is
ready to start accepting connections. To open a connection to the
localhost using pymongo, we establish a new MongoDB client:


```python
client = MongoClient()
```

which assumes a local server with default port. Alternatively, we can
explicitly list the hostname and port, which is preferred since it is
easier to recognize the server and port number, which can be easily
changed when we move to a remote MongoDB server.

```python
client = MongoClient("mongodb://localhost:27017")
```

which connects to the local MongoDB daemon using the default local host
name and port.

-----

## Remote MongoDB Server

To connect to a remote MongoDB server, for instance by using the course MongoDB cloud computing system, hosted by NCSA's Nebula cloud computing system, we simply need the IP address for the server and the port number on which the MongoDB daemon is listening. For this course, Notebooks running on the course JupyterHub Server can access a MongoDB server on `141.142.211.6` and the default port number of `27017`:


```python
client = MongoClient("mongodb://141.142.211.6:27017")
```

-----

In [2]:
# Establish a connection to MongoDB (uncomment only one of these lines)

# For remote course server use
client = MongoClient("mongodb://141.142.211.6:27017")

# For local Docker server use (this will not work in INFO490)
#client = MongoClient("mongodb://localhost:27017")

-----
## MongoDB Database

MongoDB provides storage for collections of documents. To manage a set
of related collections, MongoDB uses the concept of a database. Thus a
MongoDB database is similar to a standard relational database, which
contains a collection of tables.

In the next few sections, we explore the _pymongo_ library in a similar
manner as the official [_pymongo_ tutorial][pymt]. In addition, in this
Notebook we use dictionary style access to acquire a database,
collection, or document. There is also an attribute style method to
access these items, but dictionary style is preferred since it reinforces
that concept that MongoDB is a document style database and that Python
dictionaries are used to create document schema. In addition, the
dictionary style enables names to be used that might not be legal Python
names, such as `test-database`. 

Finally, since we are using a shared resource without authentication, we
use your netid to create a database for each student. Do not try to
access other student's databases, the cloud system has logging enabled
that will allow us to identify any such effort, which will be punished
with an instructor determined point reduction.

-----
[pymt]: http://api.mongodb.org/python/current/tutorial.html

In [3]:
# Filename containing user's netid
fname = '/home/data_scientist/users.txt'
with open(fname, 'r') as fin:
    netid = fin.readline().rstrip()

# We will delete our working directory if it exists before recreating.
dbname = 'test-{0}'.format(netid)

if  dbname in client.database_names():
    client.drop_database(dbname)

print('Existing databases:', client.database_names())

Existing databases: []


In [4]:
# now create, by accessing, our new database
db = client[dbname]
print('Existing databases:', client.database_names())

Existing databases: []


----

MongoDB utilizes _lazy evaluation_ when creating databases or
collections, which simply means these objects are not created until
they are actually needed. This is shown previously for databases, where
we create a new `test-database` but the new database does not show up in
the list of active MongoDB databases. This database will not even be
created when we add a collection; instead it will be created when we
first add data to a collection, which is demonstrated in the next few
code cells.

We now create a new collection, entitled `test-collection` into which we
can insert new data.

-----

In [5]:
# Create a test collection in our working directory

collection = db['test_collection']

print('Existing databases:', client.database_names())
print('Existing collections:', db.collection_names())

Existing databases: []
Existing collections: []


-----

## Using MongoDB

Unlike a relational database, MongoDB is schema-less. We insert documents
into a MongoDB database without creating tables or schemas. MongoDB
does, however, support traditional database operations such as inserting
data, querying data, updating data, and deleting data. These operations
typically come in two forms:

- `xxx_one()`, which works on one document.
- `xxx_many()`, which operates on multiple documents.

where the `xxx` can be `insert`, `find`, `update`, or `delete` to add,
search, replace, or remove data from a MongoDB database. 

In the rest of this Notebook, we demonstrate these functionalities with
an example data set similar to the data set used in the relational
database notebook.

-----


## Inserting Data

Given a collection, we can easily add new _documents_ to our MongoDB
collection by employing a Python dictionary to map the document schema
to the document data. In the following code cell, we first create a
`student` document, followed by a `students` collection to hold
`student` documents, and we insert the first student by using the
`insert_one` method on the `students` collection. We retrieve this new
students id, which we display as a validation of this process. After
this process, we display the newly created database and collection.

-----

In [6]:
student = {'fname': 'Jane',
           'lname': 'Doe',
           'company': 'bdg surf shop'}

students = db['students']

jane_id = students.insert_one(student).inserted_id
print("New Student ID: ", jane_id)

New Student ID:  58e01b34d4be7f78c5176cdc


In [7]:
print('Existing databases:', client.database_names())
print('Existing collections:', db.collection_names())

Existing databases: ['test-bigdog']
Existing collections: ['students']


-----

Unlike relational database tables, a MongoDB collection can store
documents that have different schema. We demonstrate this in the next
two code cells where we create two new students that each have different
schema from the original student. Atfer inserting these new students, we
count the number of documents in the `students` collection.

-----

In [8]:
student = {'fname': 'John',
           'lname': 'Doe',
           'company': 'bdg surf shop',
           'lucky_numbers': [2, 5, 9, 13, 27]}

john_id = students.insert_one(student).inserted_id
print("New Student ID: ", john_id)

New Student ID:  58e01b34d4be7f78c5176cdd


In [9]:
import datetime

student = {'fname': 'Pat',
           'lname': 'Doe',
           'company': 'bdg surf shop',
           'hire_date': datetime.datetime.utcnow()}

pat_id = students.insert_one(student).inserted_id
print("New Student ID: ", pat_id)

New Student ID:  58e01b34d4be7f78c5176cde


In [10]:
print("Number of students = ", students.count())

Number of students =  3


-----

We can also insert multiple documents at once by collecting the new
documents in a Python `list` and using the `insert_many` method to
perform a bulk insert.

-----

In [11]:
new_students = [
    {'fname': 'Mike',
     'lname': 'Simone',
     'company': 'Del Ray Enterprises',
    'products': [{'id': 1, 'name': 'eyeware'}, {'id': 2, 'name': 'hat'},]},
    {'fname': 'Clair',
     'lname': 'Hwu',
     'company': 'Hoboken Surfware Incorporated',
     'comment': 'Great supplier, fast, fair, and courteous.'}]

result = students.insert_many(new_students)

print(result.inserted_ids)

[ObjectId('58e01b34d4be7f78c5176cdf'), ObjectId('58e01b34d4be7f78c5176ce0')]


In [12]:
print("Number of students = ", students.count())

Number of students =  5


-----

## Retrieving Data

MongoDB provides `find_one` and `find` methods that can be used to find
one or more documents in a collection. The first method, `find_one`,
simply returns one document (by default the first document in the
collection) unless an argument is supplied that specifically selects
documents. For example, the second code cell is used to find one
document with a specific id value. More generally, the `find` method can
be used to iterate over all (or given a suitable argument, a limited set
of) documents in the collection, as demonstrated in the third code cell.

-----

In [13]:
students.find_one()

{'_id': ObjectId('58e01b34d4be7f78c5176cdc'),
 'company': 'bdg surf shop',
 'fname': 'Jane',
 'lname': 'Doe'}

In [14]:
students.find_one({"_id": pat_id})

{'_id': ObjectId('58e01b34d4be7f78c5176cde'),
 'company': 'bdg surf shop',
 'fname': 'Pat',
 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 16, 445000),
 'lname': 'Doe'}

In [15]:
for student in students.find():
    print(student)

{'_id': ObjectId('58e01b34d4be7f78c5176cdc'), 'company': 'bdg surf shop', 'lname': 'Doe', 'fname': 'Jane'}
{'lucky_numbers': [2, 5, 9, 13, 27], '_id': ObjectId('58e01b34d4be7f78c5176cdd'), 'company': 'bdg surf shop', 'lname': 'Doe', 'fname': 'John'}
{'_id': ObjectId('58e01b34d4be7f78c5176cde'), 'company': 'bdg surf shop', 'lname': 'Doe', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 16, 445000), 'fname': 'Pat'}
{'_id': ObjectId('58e01b34d4be7f78c5176cdf'), 'company': 'Del Ray Enterprises', 'products': [{'name': 'eyeware', 'id': 1}, {'name': 'hat', 'id': 2}], 'lname': 'Simone', 'fname': 'Mike'}
{'_id': ObjectId('58e01b34d4be7f78c5176ce0'), 'comment': 'Great supplier, fast, fair, and courteous.', 'company': 'Hoboken Surfware Incorporated', 'lname': 'Hwu', 'fname': 'Clair'}


-----

As previously mentioned, we can also use the `find` method to quickly
identify specific documents in a collection, over which we can iterate
to perform additional operations. In the following code cells, we first
search for documents with the _last name_ attribute equal to `Hwu`,
after which, we apply the `count` method to the set of documents
returned by searching for _last name_ equal to `Doe`.

-----

In [16]:
for student in students.find({"lname": "Hwu"}):
    print(student)

{'_id': ObjectId('58e01b34d4be7f78c5176ce0'), 'comment': 'Great supplier, fast, fair, and courteous.', 'company': 'Hoboken Surfware Incorporated', 'lname': 'Hwu', 'fname': 'Clair'}


In [17]:
print("Number of students = ", students.find({"lname": "Doe"}).count())

Number of students =  3


-----

Given a document, we can also extract specific value by employing
dictionary style access, which should make sense since the document is
accessed in Python as a dictionary object. In the following example, we
extract the first and last names for all documents. Obviously this
requires that all documents contain these values, if not, an error is
generated. But handling these conditions is beyond the scope of this
Notebook.

-----

In [18]:
for student in students.find():
    print(student['fname'], student['lname'])

Jane Doe
John Doe
Pat Doe
Mike Simone
Clair Hwu


-----

## Modifying Data

We can [modify documents][um] in a MongoDB database by finding the
relevant document(s) and setting the attributes to the new values. Given
the flexible document nature of MongoDB, this actual updating process is
more complicated than in other types of databases. For example, we first
find the document to update, and then we must instruct MongoDB to change
the appropriate values. Finding the relevant document or documents is
identical to the techniques presented previously to find documents.

The second step is to indicate what document attributes should be
modified. First, to modify an existing attribute, we create a dictionary
that defines a `$set` key, followed by a dictionary that contains
mappings between the attribute name and the new value. By using a
dictionary, we are able to modify multiple values in this manner all at
once. For example, we would have the following dictionary to modify the
`fname` attribute to have the value `Peter`:

```python
{'$set': {'fname': 'Peter'}}
```

[Other operators][uo] can be used beyond the `$set` operator, including
the `$inc` to increment a value, or `$rename` to rename an attribute. To
add new attributes and values to a document, we simply pass them in via
a dictionary. These concepts are demonstrated in the following code
cells, where we first insert several new documents. Next, we modify one
document by replacing a value and adding a new attribute.

-----
[um]: https://docs.mongodb.org/getting-started/python/update/
[uo]: https://docs.mongodb.org/manual/reference/operator/update/

In [19]:
# Simple, temporary data that we will modify and delete
temp_students = [{'fname': 'Petr', 'lname': 'Dow', 'company': 'bdg surf shop'},
                 {'fname': 'Paul', 'lname': 'Dow', 'company': 'bdg surf shop'},
                 {'fname': 'Mary', 'lname': 'Dow', 'company': 'bdg surf shop'},
                 {'fname': 'Arthur', 'lname': 'Dow', 'company': 'bdg surf shop'}]

# Insert new temporary data
temp_results = students.insert_many(temp_students)

print(temp_results.inserted_ids)                 
print("Number of students = ", students.count())

[ObjectId('58e01b35d4be7f78c5176ce1'), ObjectId('58e01b35d4be7f78c5176ce2'), ObjectId('58e01b35d4be7f78c5176ce3'), ObjectId('58e01b35d4be7f78c5176ce4')]
Number of students =  9


In [20]:
uo_result = students.update_one({'fname': 'Petr'}, {'$set': {'fname': 'Peter',
                                                             'reason': 'typo in name'}})

print('{0} student records modified.'.format(uo_result.modified_count))
for student in students.find({"fname": "Peter"}):
    print(student)

1 student records modified.
{'_id': ObjectId('58e01b35d4be7f78c5176ce1'), 'reason': 'typo in name', 'company': 'bdg surf shop', 'lname': 'Dow', 'fname': 'Peter'}


-----

We can also update multiple documents by using the `update_many` method.
This method works in the same manner as `update_one`, but it will update
all matching documents. This function is demonstrated in the following
code cell, where the company name is updated and a new `hire_date`
attribute is added to each document that has an original `company` name
of `bdg surf shop`.

-----

In [21]:
um_result = students.update_many({'company': 'bdg surf shop'},
                                 {'$set': {'company': "Bigdog's surf shop",
                                           'hire_date': datetime.datetime.utcnow()}})

print('{0} student records modified.'.format(um_result.modified_count))
for student in students.find({"company": "Bigdog's surf shop"}):
    print(student)

7 student records modified.
{'_id': ObjectId('58e01b34d4be7f78c5176cdc'), 'company': "Bigdog's surf shop", 'lname': 'Doe', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 17, 216000), 'fname': 'Jane'}
{'lucky_numbers': [2, 5, 9, 13, 27], 'lname': 'Doe', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 17, 216000), 'company': "Bigdog's surf shop", '_id': ObjectId('58e01b34d4be7f78c5176cdd'), 'fname': 'John'}
{'_id': ObjectId('58e01b34d4be7f78c5176cde'), 'company': "Bigdog's surf shop", 'lname': 'Doe', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 17, 216000), 'fname': 'Pat'}
{'reason': 'typo in name', 'lname': 'Dow', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 17, 216000), 'company': "Bigdog's surf shop", '_id': ObjectId('58e01b35d4be7f78c5176ce1'), 'fname': 'Peter'}
{'_id': ObjectId('58e01b35d4be7f78c5176ce2'), 'company': "Bigdog's surf shop", 'lname': 'Dow', 'hire_date': datetime.datetime(2017, 4, 1, 21, 27, 17, 216000), 'fname': 'Paul'}
{'_id': ObjectId('58e01b35d4be

-----

## Deleting Data

To delete documents from a collection, we first identify the appropriate
document and pass this identifier into a delete method. pymongo provides
two delete mechanisms: 
- `delete_one` to delete one document, which can be `None` if no matches, or the first one if multiple matches,
- `delete_many` to delete multiple documents.

The following code cell demonstrates the use of the `delete_one` method
to delete the document for _Peter Dow_ by matching the `Peter` value for
the `fname` attribute. Note, that in this collection, there is only
match.

-----

In [22]:
# Display number of students
print('{0} students with last name Dow'.format(students.count({'lname': 'Dow'})))

# Delete one student
do_result = students.delete_one({'fname': 'Peter'})

# Display number of students
print('{0} student records deleted.'.format(do_result.deleted_count))
print('{0} students with last name Dow'.format(students.count({'lname': 'Dow'})))

4 students with last name Dow
1 student records deleted.
3 students with last name Dow


-----

In a similar manner, we can delete multiple documents by using the
`delete_many` method. In this case, all documents that match the
pattern passed in as an argument to the delete method are deleted. For
example, the following code cell will delete all documents that have a
value of `Dow` for the `lname` parameter.

-----

In [23]:
# Display number of students
print('{0} students with last name Dow'.format(students.count({'lname': 'Dow'})))

# Delete one student
do_result = students.delete_many({'lname': 'Dow'})

# Display number of students
print('{0} student records deleted.'.format(do_result.deleted_count))
print('{0} students with last name Dow'.format(students.count({'lname': 'Dow'})))

3 students with last name Dow
3 student records deleted.
0 students with last name Dow


----

## Advanced Querying

MongoDB also supports a [rich query][mdbq] syntax, but it likely will
seem odd to anyone familiar with SQL. The full set includes comparison,
logical, element tests, evaluation methods, geospatial, array, and
projection operations. These operators begin with a `$` character, and
the rest of the name identifies the specific operator. For example,
`$gte` is _greater than or equal to_. 

The format for the query is to encode the target field as the key of a
dictionary, and the operator and any associated values as a second
dictionary that maps to the field's key. For example, to test if the
field `age` is less than 20, we write the following query 
`{age:{ $lt: 20}}`. 
This is demonstrated in the following code cell where we identify the
documents with last name equal to `Doe`, after which we sort the
documents by first name. When using pymongo, we enclose the attributes
and operators in quotes to ensure they are passed correctly to the
MongoDB server.

-----

[mdbq]: http://www.mongodb.org/display/DOCS/Advanced+Queries

In [24]:
for student in students.find({"lname": {'$eq': 'Doe'}}).sort('fname'):
    print('{0} {1}'.format(student['fname'], student['lname']))

Jane Doe
John Doe
Pat Doe


-----

### Dropping Collection

We can easily delete an entire collection by passing an empty dictionary
to the `delete_many` method. This instructs the method to delete all
documents in the collection, which is similar to dropping a table in a
relational database by deleting all rows.  This technique is
demonstrated in the following code cell where we delete all documents in
the `students` collection.

-----

In [25]:
# Display number of students
print('{0} students'.format(students.count()))

# Delete all students
da_result = students.delete_many({})

# Display number of students and number deleted
print('{0} student records deleted.'.format(da_result.deleted_count))
print('{0} students'.format(students.count()))

# Drop the entire collection
students.drop()

5 students
5 student records deleted.
0 students


-----
### Student Activity

In the preceding cells, we introduced MongoDB and the pymongo database
driver. Now that you have run the Notebook, go back and make the
following changes to see how the results change.

1. Create your own collection to hold your friends. Possible attribute
would be first name, last name, age, major, and interest. Insert relevant
data and execute some simple queries.

2. An IPython Notebook is stored as a JSON file on your disk. Try
reading in several course notebooks and adding them to a MongoDB
collection.

3. Can you connect your twitter client you created earlier in this
course with a  MongoDB collection to persist tweets? Why might this be a
good idea?

-----