# Flexibly Structured Data

## Intro to MongoDB and the Nobel Prize dataset

MongoDB is a tool that helps you explore data without requiring it to have a strict, known structure. You can handle diverse data together and unify analytics.

Most application programming interfaces (APIs) on the web today expose a certain data format. JavaScript is the language of web browsers. JSON (JavaScript Object Notation) is a common way that web services and clint code pass data. It is also the basis of MongoDB's data format.

JSON has two collection structures; Objects and Array.

Objects map string keys to values and order of values is not important. For array order is important. JSON objects are like Python dictionaries. Arrays are like Python lists. Null maps to None in Python.

A database maps names to collections. Collections can be accessed by name the same way values in Python ditionaries can be accessed. A collection is like a list of dictionaries called "documents" by MongoDB. When a dictionary is a value within a document, that's a subdocument. MongoDB also supports some types native to Python like dates and regular expressions. 

To access databases and collections you can use '[ ]' or you can use dot notation. 

To count documents use the "count_documents" collection method and you can filter by what you want to count. 

To inspect one document use ".find_one()".

In [None]:
# run mongod.exe // https://www.youtube.com/watch?v=D0U8vD8m1I0

In [4]:
import requests
from pymongo import MongoClient

client = MongoClient()
print(client)
db = client["nobel"]

for collection_name in ["prizes", "laureates"]:
    response = requests.get("http://api.nobelprize.org/v1/{}.json".format(collection_name[:-1]))
    documents = response.json()[collection_name]
    db[collection_name].insert_many(documents)

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)


### Count documents in a collection


In [20]:
print(db.prizes.count_documents({}))
# client.nobel.prizes.count_documents({})

print(db.laureates.count_documents({}))

652
955


### Listing databases and collections

In [32]:
db_names = client.list_database_names()
print(db_names)

nobel_col_names = client.nobel.list_collection_names()
print(nobel_col_names)

['admin', 'config', 'local', 'nobel']
['laureates', 'prizes']


In [30]:
client.nobel.list_collection_names()

['laureates', 'prizes']

### List fields of a document

The .find_one() method of a collection can be used to retrieve a single document. 

In [37]:
db = client.nobel

In [42]:
prize = db.prizes.find_one()
laureate = db.laureates.find_one()

print(prize)
print(laureate)
print(type(laureate))

prize_fields = list(prize.keys())
laureate_fields = list(laureate.keys())

print(prize_fields)
print(laureate_fields)

{'_id': ObjectId('614dc00ea86e98712247ad2f'), 'year': '2020', 'category': 'chemistry', 'laureates': [{'id': '991', 'firstname': 'Emmanuelle', 'surname': 'Charpentier', 'motivation': '"for the development of a method for genome editing"', 'share': '2'}, {'id': '992', 'firstname': 'Jennifer A.', 'surname': 'Doudna', 'motivation': '"for the development of a method for genome editing"', 'share': '2'}]}
{'_id': ObjectId('614dc045a86e98712247afbb'), 'id': '1', 'firstname': 'Wilhelm Conrad', 'surname': 'Röntgen', 'born': '1845-03-27', 'died': '1923-02-10', 'bornCountry': 'Prussia (now Germany)', 'bornCountryCode': 'DE', 'bornCity': 'Lennep (now Remscheid)', 'diedCountry': 'Germany', 'diedCountryCode': 'DE', 'diedCity': 'Munich', 'gender': 'male', 'prizes': [{'year': '1901', 'category': 'physics', 'share': '1', 'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"', 'affiliations': [{'name': 'Munich Un

## Finding documents


In [115]:
print("gender = female:", db.laureates.count_documents({'gender': 'female'}))

print("diedCountry = France:",db.laureates.count_documents({'diedCountry': 'France'}))

print("bornCity = Warsaw:",db.laureates.count_documents({'bornCity': 'Warsaw'}))

filter_doc = {'gender': 'female', 'diedCountry': 'France','bornCity': 'Warsaw'} # Marie Curie

print("gender= female, diedCountry = France, bornCity = Warsaw :", db.laureates.count_documents(filter_doc))

# Use $in: <list> for values in a range

print("diedCountry = France or USA:", db.laureates.count_documents({"diedCountry": {"$in": ["France", "USA"]}})) # either France or USA 
# Operator in mongoDB have dollar sign prefix.

# Use $ne for not equal
print("diedCountry not France:",db.laureates.count_documents({"diedCountry": {"$ne": "France"}})) 

# for comparison >: $gt, >=: $gte, <: $lt, <=: $lte

print("diedCountry gt Belgium lt USA:",db.laureates.count_documents({"diedCountry": {"$gt": "Belgium", "$lte": "USA"}}))
# it is ordered lexicographically

gender = female: 57
diedCountry = France: 51
bornCity = Warsaw: 2
gender= female, diedCountry = France, bornCity = Warsaw : 1
diedCountry = France or USA: 278
diedCountry not France: 904
diedCountry gt Belgium lt USA: 489


### "born" approximation

In [131]:
print(db.laureates.count_documents({"born": {"$lt": "1800"}}))
print(db.laureates.count_documents({"born": {"$lt": "1700"}}))

1
1


### Composing filters

It is often useful to incrementally build up a filter document in order to see the effect of adding constraints one at a time.

In [136]:
criteria = {"diedCountry": "USA"}
count = db.laureates.count_documents(criteria)
print(count)

criteria = {"diedCountry": "USA",
           "bornCountry": "Germany"}
count = db.laureates.count_documents(criteria)
print(count)

criteria = {"diedCountry": "USA",
           "bornCountry": "Germany",
           "firstname": "Albert"}
count = db.laureates.count_documents(criteria)
print(count)

227
5
1


### We've got options

In [139]:
criteria = {"bornCountry": {"$in": ["Canada", "Mexico", "USA"]}}
count = db.laureates.count_documents(criteria)
print(count)

criteria = {"diedCountry": "USA", "bornCountry":{"$ne":"USA"}} #  so you don't have to list all other options to $in.
count = db.laureates.count_documents(criteria) 
print(count)

302
73


## Dot notation: reach into substructure

In [165]:
# Dot notation is how MongoDB allows us to query document substructure. 

print(db.laureates.find_one({"firstname": "Walter", "surname": "Kohn"}))

# MongoDB allows you to query document substructure using dot notation.
print("University of California", db.laureates.count_documents({"prizes.affiliations.name":"University of California"}))

print("Berkeley, CA", db.laureates.count_documents({"prizes.affiliations.city": "Berkeley, CA"}))

# Not all the fields in the documents should be present. 
# You can use $exists operator to check this out.

print("bornCountry False", db.laureates.count_documents({"bornCountry":{"$exists": False}}))
print("prizes True", db.laureates.count_documents({"prizes":{"$exists": True}}))

# We can see that all the documents have the prizes field. But this field could be empty 
# arrays for some. To check that use prize.0
print("Non empty array:",db.laureates.count_documents({"prizes.0": {"$exists":True}}))

# Some of the prizes can contain more than one prize
print("More than one:",db.laureates.count_documents({"prizes.1": {"$exists":True}}))

{'_id': ObjectId('614dc045a86e98712247b0d7'), 'id': '290', 'firstname': 'Walter', 'surname': 'Kohn', 'born': '1923-03-09', 'died': '2016-04-19', 'bornCountry': 'Austria', 'bornCountryCode': 'AT', 'bornCity': 'Vienna', 'diedCountry': 'USA', 'diedCountryCode': 'US', 'diedCity': 'Santa Barbara, CA', 'gender': 'male', 'prizes': [{'year': '1998', 'category': 'chemistry', 'share': '2', 'motivation': '"for his development of the density-functional theory"', 'affiliations': [{'name': 'University of California', 'city': 'Santa Barbara, CA', 'country': 'USA'}]}]}
University of California 37
Berkeley, CA 21
bornCountry False 25
prizes True 955
Non empty array: 955
More than one: 6


### Choosing tools


In [191]:
# prize winners from Austia with no affiliation to their country
db.laureates.count_documents({"bornCountry": "Austria", "prizes.affiliations.country": {"$ne":"Austria"}})

11

### Starting our ascent

In [192]:
criteria = {"bornCountry": "Austria", "prizes.affiliations.country": {"$ne": "Austria"}}
count = db.laureates.count_documents(criteria)
print(count)

11


### Our 'born' approximation, and a special laureate

In [195]:
criteria = {"prizes.2": {"$exists": True}}
doc = db.laureates.find_one(criteria)
print(doc)

{'_id': ObjectId('614dc045a86e98712247b196'), 'id': '482', 'firstname': 'International Committee of the Red Cross', 'born': '1863-00-00', 'died': '0000-00-00', 'gender': 'org', 'prizes': [{'year': '1917', 'category': 'peace', 'share': '1', 'motivation': '"for the efforts to take care of wounded soldiers and prisoners of war and their families"', 'affiliations': [[]]}, {'year': '1944', 'category': 'peace', 'share': '1', 'motivation': '"for the great work it has performed during the war on behalf of humanity"', 'affiliations': [[]]}, {'year': '1963', 'category': 'peace', 'share': '2', 'motivation': '"for promoting the principles of the Geneva Convention and cooperation with the UN"', 'affiliations': [[]]}]}
