If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/lcdm-uiuc/info490-sp17/blob/master/help/act_assign_tab.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_  → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

## Problem 13.1. MongoDB

In this problem, we work with MongoDB from a Python program by using the pymongo database driver.

In [1]:
import os
import datetime
import json
import bson
import pymongo as pm

from nose.tools import assert_equal, assert_true, assert_is_instance

Here, we will be using historical weather data from [Weather Underground](http://www.wunderground.com/) to create a database. This dataset will be from January 1, 2001, collected from O'Hare (KORD). To make life easier for you, I've imported the data for you below:

In [2]:
fpath = '/home/data_scientist/data/weather'
fname = 'weather_kord_2001_0101.json'

with open(os.path.join(fpath, fname)) as f:
    weather_json = json.load(f)

assert_is_instance(weather_json, dict)
assert_equal(set(weather_json.keys()), set(['current_observation', 'response', 'history']))
assert_true('observations' in weather_json['history'])

In [3]:
observations = weather_json['history']['observations']
print('There are {} dictionaries in the list.'.format(len(observations)))
print('The first element is\n{}'.format(observations[0]))

assert_is_instance(observations, list)
assert_true(all(isinstance(o, dict) for o in observations))

There are 24 dictionaries in the list.
The first element is
{'wgusti': '-9999.0', 'tempm': '-10.6', 'pressurei': '30.38', 'windchillm': '-14.9', 'heatindexi': '-9999', 'dewptm': '-11.7', 'date': {'min': '56', 'hour': '00', 'pretty': '12:56 AM CST on January 01, 2001', 'mon': '01', 'tzname': 'America/Chicago', 'year': '2001', 'mday': '01'}, 'tornado': '0', 'icon': 'cloudy', 'wspdm': '7.4', 'thunder': '0', 'precipi': '-9999.00', 'wspdi': '4.6', 'dewpti': '10.9', 'wdird': '360', 'heatindexm': '-9999', 'pressurem': '1028.5', 'wgustm': '-9999.0', 'tempi': '12.9', 'conds': 'Overcast', 'rain': '0', 'windchilli': '5.2', 'utcdate': {'min': '56', 'hour': '06', 'pretty': '6:56 AM GMT on January 01, 2001', 'mon': '01', 'tzname': 'UTC', 'year': '2001', 'mday': '01'}, 'metar': 'METAR KORD 010656Z 36004KT 9SM BKN055 OVC095 M11/M12 A3034 RMK AO2 SLP285 T11061117 $', 'hail': '0', 'fog': '0', 'precipm': '-9999.00', 'visi': '9.0', 'wdire': 'North', 'snow': '0', 'hum': '92', 'vism': '14.5'}


We connect to the course MongoDB cloud computing system, hosted by NCSA's Nebula cloud.

In [4]:
client = pm.MongoClient("mongodb://141.142.211.6:27017")

Since we are using a shared resource without authentication, we use your netid to create a database for each student.

In [5]:
# Filename containing user's netid
fname = '/home/data_scientist/users.txt'
with open(fname, 'r') as fin:
    netid = fin.readline().rstrip()

# We will delete our working directory if it exists before recreating.
dbname = 'assignment-{0}'.format(netid)

if dbname in client.database_names():
    client.drop_database(dbname)

print('Existing databases:', client.database_names())

assert_true(dbname not in client.database_names())

Existing databases: ['admin', 'local']


## Inserting Data

- Create a new collection using the name `collection_name` and add new documents `data` to our MongoDB collection
- Return a list of object IDs as a validation of the insertion process.

In [6]:
def insert_data(db, collection_name, data):
    '''
    Creates a new collection using the name "collection_name" 
    and adds new documents `data` to our MongoDB collection.
    
    Parameters
    ----------
    data: A list of dictionaries.
    db: A pymongo.database.Database instance.
    collection_name: Name of new MongoDB collection.
    
    Returns
    -------
    A list of bson.ObjectId
    '''
    
    # YOUR CODE HERE
    # new collection
    coll = db[collection_name]
    
    # add new documents
    ids = map(lambda x: coll.insert_one(x).inserted_id, data)
    
    # get Ids
    inserted_ids = list(ids) 
    
    return inserted_ids

In [7]:
inserted_ids = insert_data(client[dbname], '0101', observations)

print("New weather ID: ", inserted_ids)
print('Existing databases:', client.database_names())
print('Existing collections:', client[dbname].collection_names())

New weather ID:  [ObjectId('58f21972b870f800c7f9f452'), ObjectId('58f21972b870f800c7f9f453'), ObjectId('58f21972b870f800c7f9f454'), ObjectId('58f21972b870f800c7f9f455'), ObjectId('58f21972b870f800c7f9f456'), ObjectId('58f21972b870f800c7f9f457'), ObjectId('58f21972b870f800c7f9f458'), ObjectId('58f21972b870f800c7f9f459'), ObjectId('58f21972b870f800c7f9f45a'), ObjectId('58f21972b870f800c7f9f45b'), ObjectId('58f21972b870f800c7f9f45c'), ObjectId('58f21972b870f800c7f9f45d'), ObjectId('58f21972b870f800c7f9f45e'), ObjectId('58f21972b870f800c7f9f45f'), ObjectId('58f21972b870f800c7f9f460'), ObjectId('58f21972b870f800c7f9f461'), ObjectId('58f21972b870f800c7f9f462'), ObjectId('58f21972b870f800c7f9f463'), ObjectId('58f21972b870f800c7f9f464'), ObjectId('58f21972b870f800c7f9f465'), ObjectId('58f21972b870f800c7f9f466'), ObjectId('58f21972b870f800c7f9f467'), ObjectId('58f21972b870f800c7f9f468'), ObjectId('58f21972b870f800c7f9f469')]
Existing databases: ['admin', 'assignment-yimingg2', 'local']
Existing

In [8]:
assert_is_instance(inserted_ids, list)
assert_true(all(isinstance(i, bson.objectid.ObjectId) for i in inserted_ids))

assert_true(dbname in client.database_names())
assert_true('0101' in client[dbname].collection_names())
assert_equal(client[dbname]['0101'].count(), len(observations))

## Retrieving Data

- Find all documents that have a given weather `condition` (e.g., `conds == "Clear"` or `conds == "Partly Cloudy"`)
- Return the `_id` values of all documents that match the search query.

In [9]:
def retrieve_data(collection, condition):
    '''
    Finds all documents that have a given weather `condition`
    and return the `_id` values of all documents that match the search query.
    
    Parameters
    ----------
    collection: A pymongo.Collection instance.
    condition: A string, e.g., "Clear", "Partly Cloudy", "Overcast".
    
    Returns
    -------
    A list of bson.ObjectId
    '''
    
    #YOUR CODE HERE
    result = [doc['_id'] for doc in collection.find({"conds": condition})]
    
    return result

In [10]:
clear_ids = retrieve_data(client[dbname]['0101'], 'Clear')
print(clear_ids)

[ObjectId('58f21972b870f800c7f9f455'), ObjectId('58f21972b870f800c7f9f45f'), ObjectId('58f21972b870f800c7f9f460'), ObjectId('58f21972b870f800c7f9f467'), ObjectId('58f21972b870f800c7f9f468'), ObjectId('58f21972b870f800c7f9f469')]


In [11]:
assert_is_instance(clear_ids, list)
assert_true(all(isinstance(i, bson.objectid.ObjectId) for i in clear_ids))

conds = {obs['conds'] for obs in observations}
for cond in conds:
    r = retrieve_data(client[dbname]['0101'], cond)
    n = [obs['_id'] for obs in observations if obs['conds'] == cond]
    assert_equal(len(r), len(n))
    assert_equal(set(r), set(n))

## Modifying Data

- Find all documents whose `conds` value is `"Clear"` and change the `conds` attribute to `Cloudy`.
- Return the number of documents modified as a validation of the process.

In [12]:
def modify_data(collection):
    '''
    Finds all documents whose "conds" value is "Clear"
    and change the "conds" attribute to "Cloudy".

    Parameters
    ----------
    collection: A pymongo.Collection instance.
    
    Returns
    -------
    An int. The number of documents modified.
    '''
    
    #YOUR CODE HERE
    # the number of modified documents
    count = len([doc for doc in collection.find({'conds': 'Clear'})])
    
    # change attribute
    collection.update_many({'conds': 'Clear'}, {'$set':{'conds': 'Cloudy'}})
    
    return count

In [13]:
n_modified = modify_data(client[dbname]['0101'])
print('{0} records modified.'.format(n_modified))

6 records modified.


In [14]:
assert_equal(
    n_modified,
    len([obs['_id'] for obs in observations if obs['conds'] == 'Clear'])
    )

conds = [obs['conds'] for obs in observations]
for cond in conds:
    if cond != 'Clear' and cond != 'Cloudy':
        r = retrieve_data(client[dbname]['0101'], cond)
        n = [obs['_id'] for obs in observations if obs['conds'] == cond]
        assert_equal(len(r), len(n))
        assert_equal(set(r), set(n))

## Advanced Querying

- Find all documents with `visi` equal to `"10.0"` and sort the documents by `conds`.
- Return a list of `conds` as a validation of the process.

In [15]:
def query(collection):
    '''
    Finds all documents with "visi" equal to `"10.0"
    and sort the documents by "conds".
    
    Parameters
    ----------
    collection: A pymongo.Collection instance.

    Returns
    -------
    A list of strings.
    '''
    
    #YOUR CODE HERE
    result = [doc['conds'] for doc in collection.find({"visi": {'$eq': '10.0'}}).sort('conds')]

    return result

In [16]:
query_conds = query(client[dbname]['0101'])
print(query_conds)

['Cloudy', 'Cloudy', 'Cloudy', 'Cloudy', 'Cloudy', 'Cloudy', 'Mostly Cloudy', 'Mostly Cloudy', 'Mostly Cloudy', 'Mostly Cloudy', 'Overcast', 'Overcast', 'Partly Cloudy', 'Partly Cloudy', 'Partly Cloudy', 'Partly Cloudy', 'Partly Cloudy', 'Partly Cloudy', 'Scattered Clouds', 'Scattered Clouds', 'Scattered Clouds']


In [17]:
modified_conds = [obs['conds'] for obs in observations if obs['visi'] == '10.0']
modified_conds = ['Cloudy' if cond == 'Clear' else cond for cond in modified_conds]
modified_conds = sorted(modified_conds)
assert_equal(query_conds, modified_conds)

## Deleting Data

- Delete all documents whose `conds` attribute is equal to `"Cloudy"` from our collection.
- Return the number of documents deleted as a validation of the process.

In [18]:
def delete_data(collection):
    '''
    Deletes all documents whose "conds" == "Cloudy".
    
    Paramters
    ---------
    collection: A pymongo.Collection instance.

    Returns
    -------
    An int. The number of documents deleted.
    '''
    
    #YOUR CODE HERE
    count = len([doc for doc in collection.find({'conds': 'Cloudy'})])
    
    # delete documents
    collection.delete_many({'conds': 'Cloudy'})
    
    return count

In [19]:
n_deleted = delete_data(client[dbname]['0101'])
print('{0} records deleted.'.format(n_deleted))

6 records deleted.


In [20]:
deleted_obs = [obs for obs in modified_conds if obs == 'Cloudy']
assert_equal(n_deleted, len(deleted_obs))

for cond in set(conds):
    if cond != 'Clear' and cond != 'Cloudy':
        r = retrieve_data(client[dbname]['0101'], cond)
        n = [obs['_id'] for obs in observations if obs['conds'] == cond]
        assert_equal(len(r), len(n))
        assert_equal(set(r), set(n))

## Cleanup

When you are done or if you want to start over with a clean database, run the following code cell.

PLEASE MAKE SURE TO RUN THIS BEFORE RESTARTING AND RUNNING YOUR CODE!!!

In [21]:
if dbname in client.database_names():
    client.drop_database(dbname)
    
assert_true(dbname not in client.database_names())