# MongoDB Workshop

Today we will be experimenting with MongoDB!  

This workshop will have you execute some sample code to get you familiar with using the [pymongo](http://api.mongodb.com/python/current/index.html) library.  Then, we will use the Marvel comics dataset to insert and query data.

The data was obtained from [Kaggle](https://www.kaggle.com/fivethirtyeight/fivethirtyeight-comic-characters-dataset) but was originally sourced from from [Marvel Wikia](http://marvel.wikia.com/Main_Page) and [DC Wikia](http://dc.wikia.com/wiki/Main_Page). It is split into two files, for DC and Marvel, respectively: `dc-wikia-data.csv` and `marvel-wikia-data.csv`. Each file has the following variables:

Variable | Definition
---|---------
`page_id` | The unique identifier for that characters page within the wikia
`name` | The name of the character
`urlslug` | The unique url within the wikia that takes you to the character
`ID` | The identity status of the character (Secret Identity, Public identity, [on marvel only: No Dual Identity])
`ALIGN` | If the character is Good, Bad or Neutral
`EYE` | Eye color of the character
`HAIR` | Hair color of the character
`SEX` | Sex of the character (e.g. Male, Female, etc.)
`GSM` | If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters)
`ALIVE` | If the character is alive or deceased
`APPEARANCES` | The number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.)
`FIRST APPEARANCE` | The month and year of the character's first appearance in a comic book, if available
`YEAR` | The year of the character's first appearance in a comic book, if available

<br>
The first thing we will do is import our libraries.

In [None]:
import csv
import json
from glob import glob
from pprint import pprint
from pymongo import MongoClient

## Let's look at our data

We will use Python's [glob](https://docs.python.org/3/library/glob.html) package in order to quickly view the files in the data directory.  This directory contains a file for all of the characters that have appeared in the Marvel comics.

In [None]:
files = glob("data/*")
files

Let's make a function that will return the characters to us one at a time as a `dictionary` object.  We will make use of [generators](https://docs.python.org/3/howto/functional.html#generators) which use the [yield](https://docs.python.org/3/reference/simple_stmts.html#yield) keyword to pass execution back to the calling code.

One reason to use a `yield`, or generator, is so that you can "stream" data rather than loading it all into memory at once.  In this example, each individual row of the file will be read, converted to a `dict`, and then returned to the calling code.  We also do a little cleanup of the data by converting some fields to `int`.

In [None]:
def get_characters(source):
    assert source in ["marvel", "dc"]
    with open(files[0], 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            row["Year"] = int(row["Year"] ) if row["Year"] else None
            row["APPEARANCES"] = int(row["APPEARANCES"] ) if row["APPEARANCES"] else None
            yield dict(row)

With our generator function, we can now loop through all of the items with ease and without worrying about our memory usage.  Below is an example of how to loop through a generator although a `break` is included so that we stop after a single item.

In [None]:
for item in get_characters("marvel"):
    pprint(item)
    break

Of course, we can force the generator to give us all of the items using `list`.  This is usually a bad idea as a Python programmer would have had good reasons to only parcel out one data item at a time.

In [None]:
data = list(get_characters("marvel"))

print("There are {} items".format(len(data)))
data[7]

We've now reviewed our data and should have a good idea of what to expect.  We can now think about how we would like to store this in MongoDB.

## Let's connect to the server

At this point we want to connect to our server and get ready to insert data.  

In [None]:
client = MongoClient('localhost', 27017)

Let's take a look at the properties and methods available to us.

In [None]:
dir(client)

Now let's use the `list_database_names` method to see what databases already exist.

In [None]:
client.list_database_names()

## Let's experiment with MongoDB

First, choose a (string) name for your database.  If you are sharing the server with other users you want to make sure it is unique.

In [None]:
DATABASE_NAME = 
db = client[DATABASE_NAME]

<br>Let's use the [insert_one](http://api.mongodb.com/python/current/tutorial.html#inserting-a-document) method to add a single object to the `students` collection.

In [None]:
db["students"].insert_one({"name": "Gregor Gregorson"})

<br>
Now let's try inserting a bunch of students at the same time.

In [None]:
db["students"].insert_many([
    {"name": "Bob Bobertson"},
    {"name": "Roberta Robertason"},
    {"name": "Salvatore McFesterson"},
])

<br>Presumably we have 4 student records in our collection.  List get the first one with `find_one` with some search parameters.

In [None]:
db["students"].find_one({'name': 'Allen'})

<br>Now let's query for all of the records and loop through them.  Note that each record now has a unique `_id` field.

In [None]:
for item in db["students"].find():
    print(item)

<br>
You can do lots of things with the collection object.  Let's look at all the properties and methods available to us.

In [None]:
dir(db["students"])

## Add our comics data to MongoDB

Let's create "comics" collection from the database object to use for the rest of this workshop.  Using this `collections` variable we will add our comic book character data.

In [None]:
collection = db["comics"]

<br>Now, use this collection variable to add the comic book charaters to the database like we did earlier with the students.  You may choose to add them one at a time using [insert_one](http://api.mongodb.com/python/current/tutorial.html#inserting-a-document) or all at once using [insert_many](http://api.mongodb.com/python/current/tutorial.html#bulk-inserts).



In [None]:
collection.insert_many(data)

## Now let's query the data

Like before, let's use [find_one]() to query for the first record in the collection.  Then you will start adding code to perform your own queries.

In [None]:
collection.find_one()

<br>Let's find the record for Captain America - the name used by the dataset has been included below.

In [None]:
name = 'Captain America (Steven Rogers)'
collection.find_one({'name': name})

<br>Let's query for all of the "Good Characters", with "Gold Eyes", with a "Secret Identity" and print their names.  For this you will use the [find](http://api.mongodb.com/python/current/tutorial.html#querying-for-more-than-one-document) method.

In [None]:
for item in collection.find({'ALIGN': 'Good Characters', 'EYE': 'Gold Eyes', 'ID': 'Secret Identity'}):
    print(item["name"])

<br>Now, let's query for all of the "Female", "Good Characters" that were introduced since 2010 but appear only once.  We will print out their names but this is a slightly more advanced query.  Check the [pymongo](http://api.mongodb.com/python/current/tutorial.html#range-queries) and/or [mongodb](https://docs.mongodb.com/manual/reference/operator/query/#query-selectors) documentation for help. 

In [None]:
for item in collection.find({'ALIGN': 'Bad Characters', 'SEX': 'Female Characters', 'APPEARANCES': '1', 'Year' : {'$gte': 2010}}):
    print(item["name"])

# All Done!

Want more work?

* Read and insert the data from the DC dataset.
* Come up with 3 different searches for this new data.