### MongoDB

 - Create and compose query filters and operators
 - Use dot notation
 - Fetch values, arrays, use regex
 - Project, sort, index
 - Aggregate

 - JSON is the basis of mongoDB's data format
 - JSON has two collection structures
     - objects map key strings to values
     - arrays order values
 - Values are strings, numbers, true, false, null or another object or arrays
 - JSON data types have equivalent in Python as such
 

<img src="assets/mongodb/python_json_maps.png" style="width: 300px;"/>
    
  - JSON/Python data types are expressed in mongoDB as follows:
  - A database maps names to collections. You can access collections by name the same way you can access a Python dictionary.
  - A collection is like a list of dictionaries, called documents by mongoDB
  - When a dictionary is a value within a document, it's called a sub-document
  - Values in a document can be any of the above types including dates or regular expressions

<img src="assets/mongodb/python_mongodb_maps.png" style="width: 600px;"/>

  - Access databases by name as attributes of the client, eg client.my_database
  - Access collections by name as attributes of databases, eg my_database.my_collection

In [1]:
import requests
from pymongo import MongoClient
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path=os.getcwd()+'/.env')

True

In [2]:
# Client connects to "localhost" by default
client = MongoClient(os.environ.get("MONGO_URI"))

# Get database names
client.list_database_names()

['nobel', 'time_series', 'admin', 'local']

In [4]:
# Create local "nobel" database on the fly
db = client["nobel"]

for collection_name in ["prizes", "laureates"]:

    # collect the data from the API
    response = requests.get("http://api.nobelprize.org/v1/{}.json".format(collection_name[:-1] ))
    
    # convert the data to json
    documents = response.json()[collection_name]
    
    # Create collections on the fly
    db[collection_name].insert_many(documents)

In [5]:
client.list_database_names()

['nobel', 'time_series', 'admin', 'local']

In [6]:
# Save a list of names of the databases managed by client
db_names = client.list_database_names()
print(db_names)

# Save a list of names of the collections managed by the "nobel" database
nobel_coll_names = client.nobel.list_collection_names()
print(nobel_coll_names)

['nobel', 'time_series', 'admin', 'local']
['prizes', 'laureates']


In [7]:
# Connect to the "nobel" database
db = client.nobel

# Retrieve sample prize and laureate documents
prize = db.prizes.find_one()
laureate = db.laureates.find_one()

# Print the sample prize and laureate documents
print(prize)
print(laureate)
print(type(laureate))

# Get the fields present in each type of document
prize_fields = list(prize.keys())
laureate_fields = list(laureate.keys())

print(prize_fields)
print(laureate_fields)

{'_id': ObjectId('61fc097abfc7010032f63c6f'), 'year': '2021', 'category': 'chemistry', 'laureates': [{'id': '1002', 'firstname': 'Benjamin', 'surname': 'List', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}, {'id': '1003', 'firstname': 'David', 'surname': 'MacMillan', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}]}
{'_id': ObjectId('61fc097bbfc7010032f63f01'), 'id': '1', 'firstname': 'Wilhelm Conrad', 'surname': 'Röntgen', 'born': '1845-03-27', 'died': '1923-02-10', 'bornCountry': 'Prussia (now Germany)', 'bornCountryCode': 'DE', 'bornCity': 'Lennep (now Remscheid)', 'diedCountry': 'Germany', 'diedCountryCode': 'DE', 'diedCity': 'Munich', 'gender': 'male', 'prizes': [{'year': '1901', 'category': 'physics', 'share': '1', 'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"', 'affiliations': [{'name': 'Munich University', '

#### Comparisons and filtering

In [8]:
# Greater than
db.laureates.count_documents({'born':{'$gt':'1700'}})

# Less than
db.laureates.count_documents({'born':{'$lt':'1700'}})

# Create a filter for Germany-born laureates who died in the USA and with the first name "Albert"
criteria = {'firstname':'Albert', 
            'bornCountry': 'Germany', 
            'diedCountry': 'USA'}

# Save the count
count = db.laureates.count_documents(criteria)
print(count)

1


In [9]:
# Save a filter for laureates born in the USA, Canada, or Mexico
criteria = { 'bornCountry': 
                { "$in": ['USA','Canada','Mexico']}
             }

# Count them and save the count
count = db.laureates.count_documents(criteria)
print(count)

305


In [10]:
# Save a filter for laureates who died in the USA and were not born there
criteria = { 'diedCountry': 'USA',
               'bornCountry': { "$ne": 'USA'}, 
             }

# Count them
count = db.laureates.count_documents(criteria)
print(count)

73


#### Dot notation
 - lets us reach the document's substructure
 - full path to a field from the document's root

In [11]:
# Filter for laureates born in Austria with non-Austria prize affiliation
criteria = {'bornCountry': 'Austria', 
              'prizes.affiliations.country': {"$ne": 'Austria'}}

# Count the number of such laureates
count = db.laureates.count_documents(criteria)
print(count)

11


In [12]:
# Filter for documents without a "born" field
criteria = {"born": {"$exists": False}}

# Save count
count = db.laureates.count_documents(criteria)
print(count)

2


In [13]:
# Filter for laureates with at least three prizes
criteria = {"prizes.2": {'$exists': True}}

# Find one laureate with at least three prizes
doc = db.laureates.find_one(criteria)

# Print the document
print(doc)

{'_id': ObjectId('61fc097bbfc7010032f640dc'), 'id': '482', 'firstname': 'International Committee of the Red Cross', 'born': '1863-00-00', 'died': '0000-00-00', 'gender': 'org', 'prizes': [{'year': '1917', 'category': 'peace', 'share': '1', 'motivation': '"for the efforts to take care of wounded soldiers and prisoners of war and their families"', 'affiliations': [[]]}, {'year': '1944', 'category': 'peace', 'share': '1', 'motivation': '"for the great work it has performed during the war on behalf of humanity"', 'affiliations': [[]]}, {'year': '1963', 'category': 'peace', 'share': '2', 'motivation': '"for promoting the principles of the Geneva Convention and cooperation with the UN"', 'affiliations': [[]]}]}


#### Distinct()

In [14]:
# Countries recorded as countries of death but not as countries of birth
countries = set(db.laureates.distinct('diedCountry')) - set(db.laureates.distinct('bornCountry'))
print(countries)

# The number of distinct countries of laureate affiliation for prizes
count = len(db.laureates.distinct('prizes.affiliations.country'))
print(count)

{'Puerto Rico', 'East Germany (now Germany)', 'Northern Rhodesia (now Zambia)', 'Israel', 'Barbados', 'Tunisia', 'Greece', 'Jamaica', 'Gabon', 'Yugoslavia (now Serbia)', 'Singapore'}
29


In [21]:
# Save a filter for prize documents with three or more laureates
criteria = {"laureates.2": {"$exists": True}}

# Save the set of distinct prize categories in documents satisfying the criteria
triple_play_categories = set(db.prizes.distinct("category", criteria))
assert set(db.prizes.distinct("category")) - triple_play_categories == {"literature"}

In [23]:
set(db.prizes.distinct("category"))

{'chemistry', 'economics', 'literature', 'medicine', 'peace', 'physics'}

In [22]:
triple_play_categories

{'chemistry', 'economics', 'medicine', 'peace', 'physics'}

#### Print 1st level fields for each collection

In [15]:
for i in db.prizes.find_one({}): print(i)

_id
year
category
laureates


In [16]:
for i in db.laureates.find_one({}): print(i)

_id
id
firstname
surname
born
died
bornCountry
bornCountryCode
bornCity
diedCountry
diedCountryCode
diedCity
gender
prizes


#### Filter for laureates with a prize category in physics

In [24]:
db.laureates.count_documents({"prizes.category": "physics"})

218

#### Filter for laureates with a prize category not in physics

In [25]:
db.laureates.count_documents({"prizes.category": {"$ne": "physics"}})


750

#### Filter for laureates with a prize category in either physics, chemistry or medicine

In [26]:
db.laureates.count_documents({"prizes.category": {
        "$in": ["physics", "chemistry", "medicine"]}})

628

#### Filter for laureates with at least one prize not in physics, chemistry and medicine

In [27]:
db.laureates.count_documents({
    "prizes.category": {
    "$nin": ["physics", "chemistry", "medicine"]}})

340

#### Filter for laureates with at least one unshared count in physics

In [28]:
db.laureates.count_documents({
    "prizes": {"$elemMatch": {"category": "physics", "share": "1"}}})

47

In [29]:
db.laureates.count_documents({
    "prizes": {"$elemMatch": {
        "category": "physics", "share": "1", "year": {"$lt": "1945"},}}})

29

In [30]:
db.laureates.count_documents({
    "prizes": {"$elemMatch": {
        "category": "physics", "share": {"$ne": "1"}, "year": {"$gte": "1945"}}}})

152

#### More complex filtering

In [31]:
# Save a filter for laureates with unshared prizes
unshared = {
    "prizes": {"$elemMatch": {
        "category": {"$nin": ["physics", "chemistry", "medicine"]},
        "share": "1",
        "year": {"$gte": "1945"},
    }}}

# Save a filter for laureates with shared prizes
shared = {
    "prizes": {"$elemMatch": {
        "category": {"$nin": ["physics", "chemistry", "medicine"]},
        "share": {"$ne": "1"},
        "year": {"$gte": "1945"},
    }}}

ratio = db.laureates.count_documents(unshared) / db.laureates.count_documents(shared)
print(ratio)

1.2982456140350878


In [32]:
# Save a filter for organization laureates with prizes won before 1945
before = {
    'gender': 'org',
    'prizes.year': {'$lt': "1945"},
    }

# Save a filter for organization laureates with prizes won in or after 1945
in_or_after = {
    'gender': 'org',
    'prizes.year': {'$gte': "1945"},
    }

n_before = db.laureates.count_documents(before)
n_in_or_after = db.laureates.count_documents(in_or_after)
ratio = n_in_or_after / (n_in_or_after + n_before)
print(ratio)

0.8461538461538461


#### Filtering with regex

In [38]:
case_sensitive = db.laureates.distinct(
    "bornCountry",
    {"bornCountry": {"$regex": "Poland"}})
display(case_sensitive)

case_insensitive = db.laureates.distinct(
    "bornCountry",
    {"bornCountry": {"$regex": "poland", "$options": "i"}})

case_insensitive

['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']

['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']

In [39]:
assert set(case_sensitive) == set(case_insensitive)

In [40]:
from bson.regex import Regex

db.laureates.distinct("bornCountry",
                      {"bornCountry": Regex("poland", "i")})

['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']

In [41]:
db.laureates.count_documents({"firstname": Regex('^G'), "surname": Regex('^S')})


10

In [44]:
# Fill in a string value to be sandwiched between the strings "^Germany " and "now"
criteria = {"bornCountry": Regex("^Germany " + '\(' + "now")}

print(set(db.laureates.distinct("bornCountry", criteria)))

{'Germany (now Poland)', 'Germany (now France)', 'Germany (now Russia)'}


In [43]:
#Filter for currently-Germany countries of birth. Fill in a string value to be sandwiched between the strings "now" and "$"
criteria = {"bornCountry": Regex("now Germany\\)$")}

print(set(db.laureates.distinct("bornCountry", criteria)))

{'West Germany (now Germany)', 'Bavaria (now Germany)', 'East Friesland (now Germany)', 'Prussia (now Germany)', 'Schleswig (now Germany)', 'Württemberg (now Germany)', 'Hesse-Kassel (now Germany)', 'Mecklenburg (now Germany)'}


In [42]:
# Save a filter for laureates with prize motivation values containing "transistor" as a substring
criteria = {"prizes.motivation": Regex("transistor")}

# Save the field names corresponding to a laureate's first name and last name
first, last = "firstname", "surname"
print([(laureate[first], laureate[last]) for laureate in db.laureates.find(criteria)])

[('William B.', 'Shockley'), ('John', 'Bardeen'), ('Walter H.', 'Brattain')]


#### Projection in mongoDB
 - reducing multidimensional data
 - fetch projections by specifying the document fields that interest us
 - do this by passing a dictionary to the find method as a second argument
 - for each field we want to include in the projection we give a value of one
 - fields we don't want to include in the projection are not included in the dictionary excluding the _id field
 

In [46]:
# include only prizes.affiliations
# exclude _id

docs = db.laureates.find(
         filter={},
         projection={"prizes.affiliations": 1, "_id": 0})

# convert to list and slice

list(docs)[:3]

[{'prizes': [{'affiliations': [{'name': 'Munich University',
      'city': 'Munich',
      'country': 'Germany'}]}]},
 {'prizes': [{'affiliations': [{'name': 'Leiden University',
      'city': 'Leiden',
      'country': 'the Netherlands'}]}]},
 {'prizes': [{'affiliations': [{'name': 'Amsterdam University',
      'city': 'Amsterdam',
      'country': 'the Netherlands'}]}]}]

In [49]:
# Projection that gives only the firstname, surnname and prize share info
db.laureates.find_one({"prizes":
                       {"$elemMatch": {"category": "physics", "year": "1903"}}
                      },
                      {"firstname": 1,
                       "surname": 1,
                       "prizes.share": 1,
                       "_id": 0} )

{'firstname': 'Henri', 'surname': 'Becquerel', 'prizes': [{'share': '2'}]}

In [50]:
# Use projection to select only firstname and surname
docs = db.laureates.find(
       filter= {"firstname" : {"$regex" : "^G"},
                "surname" : {"$regex" : "^S"}  },
   projection= ["firstname", "surname"]  )

# Iterate over docs and concatenate first name and surname
full_names = [doc["firstname"] + " " + doc['surname']  for doc in docs]

# Print the full names
print(full_names)

['Glenn T. Seaborg', 'George D. Snell', 'Gustav Stresemann', 'George Bernard Shaw', 'Giorgos Seferis', 'George J. Stigler', 'George F. Smoot', 'George E. Smith', 'George P. Smith', 'Gregg Semenza']


In [51]:
# Save documents, projecting out laureates share
prizes = db.prizes.find({}, ["laureates.share"])

# Iterate over prizes
for prize in prizes:
    # Initialize total share
    total_share = 0
    
    # Iterate over laureates for the prize
    for laureate in prize["laureates"]:
        # add the share of the laureate to total_share
        total_share += 1 / float(laureate['share'])
        
    # Print the total share    
    print(total_share)


1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


KeyError: 'laureates'

#### Sorting in mongoDB

In [52]:
for doc in db.prizes.find(
    {"year": {"$gt": "1966", "$lt": "1970"}},
    ["category", "year"],
    sort=[("year", 1), ("category", -1)]):

    print("{year} {category}".format(**doc))

1967 physics
1967 peace
1967 medicine
1967 literature
1967 chemistry
1968 physics
1968 peace
1968 medicine
1968 literature
1968 chemistry
1969 physics
1969 peace
1969 medicine
1969 literature
1969 economics
1969 chemistry


In [53]:
docs = list(db.laureates.find(
    {"born": {"$gte": "1900"}, "prizes.year": {"$gte": "1954"}},
    {"born": 1, "prizes.year": 1, "_id": 0},
    sort=[("prizes.year", 1), ("born", -1)]))

for doc in docs[:5]:
    print(doc)

{'born': '1950-12-14', 'prizes': [{'year': '1954'}, {'year': '1981'}]}
{'born': '1916-08-25', 'prizes': [{'year': '1954'}]}
{'born': '1915-06-15', 'prizes': [{'year': '1954'}]}
{'born': '1901-02-28', 'prizes': [{'year': '1962'}, {'year': '1954'}]}
{'born': '1913-07-12', 'prizes': [{'year': '1955'}]}


In [54]:
from operator import itemgetter

def all_laureates(prize):  
    # sort the laureates by surname
    sorted_laureates = sorted(prize["laureates"], key=itemgetter("surname"))

    # extract surnames
    surnames = [laureate["surname"] for laureate in sorted_laureates]

    # concatenate surnames separated with " and " 
    all_names = " and ".join(surnames)

    return all_names

# find physics prizes, project year and name, and sort by year
docs = db.prizes.find(
           filter= {"category": "physics"}, 
           projection= ["year", "laureates.firstname", "laureates.surname"], 
           sort= [("year", 1)])

# print the year and laureate names (from all_laureates)
for doc in docs:
    print("{year}: {names}".format(year=doc["year"], names=all_laureates(doc)))

1901: Röntgen
1902: Lorentz and Zeeman
1903: Becquerel and Curie and Curie
1904: Rayleigh
1905: Lenard
1906: Thomson
1907: Michelson
1908: Lippmann
1909: Braun and Marconi
1910: van der Waals
1911: Wien
1912: Dalén
1913: Kamerlingh Onnes
1914: von Laue
1915: Bragg and Bragg


KeyError: 'laureates'

In [55]:
# original categories from 1901
original_categories = db.prizes.distinct("category", {"year": "1901"})
print(original_categories)

# project year and category, and sort
docs = db.prizes.find(
        filter={},
        projection={"year":1, "category":1, "_id":0},
        sort=[("year", -1), ("category", 1)]
)

#print the documents
for doc in docs:
    print(doc)

['chemistry', 'literature', 'medicine', 'peace', 'physics']
{'year': '2021', 'category': 'chemistry'}
{'year': '2021', 'category': 'economics'}
{'year': '2021', 'category': 'literature'}
{'year': '2021', 'category': 'medicine'}
{'year': '2021', 'category': 'peace'}
{'year': '2021', 'category': 'physics'}
{'year': '2020', 'category': 'chemistry'}
{'year': '2020', 'category': 'economics'}
{'year': '2020', 'category': 'literature'}
{'year': '2020', 'category': 'medicine'}
{'year': '2020', 'category': 'peace'}
{'year': '2020', 'category': 'physics'}
{'year': '2019', 'category': 'chemistry'}
{'year': '2019', 'category': 'economics'}
{'year': '2019', 'category': 'literature'}
{'year': '2019', 'category': 'medicine'}
{'year': '2019', 'category': 'peace'}
{'year': '2019', 'category': 'physics'}
{'year': '2018', 'category': 'chemistry'}
{'year': '2018', 'category': 'economics'}
{'year': '2018', 'category': 'literature'}
{'year': '2018', 'category': 'medicine'}
{'year': '2018', 'category': 'peac

#### Indexes in mongoDB
 - like a book index
 - each collection a book, each document a page, each field a type of content
 - useful for high specificty, in large documents or collections
 - use index_informaiton to identify the available indices
 - use explain() to look at the query plan

In [56]:
db.laureates.index_information()

{'_id_': {'v': 2, 'key': [('_id', 1)]}}

In [57]:
db.laureates.find(
    {"firstname": "Marie"}, {"bornCountry": 1, "_id": 0}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'nobel.laureates',
  'indexFilterSet': False,
  'parsedQuery': {'firstname': {'$eq': 'Marie'}},
  'winningPlan': {'stage': 'PROJECTION_SIMPLE',
   'transformBy': {'bornCountry': 1, '_id': 0},
   'inputStage': {'stage': 'COLLSCAN',
    'filter': {'firstname': {'$eq': 'Marie'}},
    'direction': 'forward'}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 1,
  'executionTimeMillis': 1,
  'totalKeysExamined': 0,
  'totalDocsExamined': 968,
  'executionStages': {'stage': 'PROJECTION_SIMPLE',
   'nReturned': 1,
   'executionTimeMillisEstimate': 0,
   'works': 970,
   'advanced': 1,
   'needTime': 968,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'transformBy': {'bornCountry': 1, '_id': 0},
   'inputStage': {'stage': 'COLLSCAN',
    'filter': {'firstname': {'$eq': 'Marie'}},
    'nReturned': 1,
    'executionTimeMillisEstimate': 0,
    'works': 970,
    'advanced': 1,
 

In [59]:
db.laureates.create_index([("firstname", 1), ("bornCountry", 1)])
db.laureates.find(
    {"firstname": "Marie"}, {"bornCountry": 1, "_id": 0}).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'nobel.laureates',
  'indexFilterSet': False,
  'parsedQuery': {'firstname': {'$eq': 'Marie'}},
  'winningPlan': {'stage': 'PROJECTION_COVERED',
   'transformBy': {'bornCountry': 1, '_id': 0},
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'firstname': 1, 'bornCountry': 1},
    'indexName': 'firstname_1_bornCountry_1',
    'isMultiKey': False,
    'multiKeyPaths': {'firstname': [], 'bornCountry': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'firstname': ['["Marie", "Marie"]'],
     'bornCountry': ['[MinKey, MaxKey]']}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 1,
  'executionTimeMillis': 1,
  'totalKeysExamined': 1,
  'totalDocsExamined': 0,
  'executionStages': {'stage': 'PROJECTION_COVERED',
   'nReturned': 1,
   'executionTimeMillisEstimate': 0,
   'works': 2,
   'advanced': 1,

In [None]:
# For a distinct query the filter argument is passed as a second 
# argument, whereas the projected field is passed first.
# Which of the following indexes is best suited to speeding up the operation

db.prizes.distinct("category", {"laureates.share": {"$gt": "3"}})

[("laureates.share", 1), ("category", 1)]

 - Specify an index model that indexes first on category (ascending) and second on year (descending).
 - Save a string report for printing the last single-laureate year for each distinct category, one category per line. To do this, for each distinct prize category, find the latest-year prize (requiring a descending sort by year) of that category (so, find matches for that category) with a laureate share of "1".

In [None]:
# Specify an index model for compound sorting
index_model = [('category', 1), ('year', -1)]
db.prizes.create_index(index_model)

# Collect the last single-laureate year for each category
report = ""
for category in sorted(db.prizes.distinct("category")):
    doc = db.prizes.find_one(
        {"category": category, "laureates.share": "1"},
        sort=[("year", -1)]
    )
    report += "{category}: {year}\n".format(**doc)

print(report)

 - Create an index on country of birth ("bornCountry") for db.laureates to ensure efficient gathering of distinct values and counting of documents
 
 - Complete the skeleton dictionary comprehension to construct n_born_and_affiliated, the count of laureates as described above for each distinct country of birth. For each call to count_documents, ensure that you use the value of country to filter documents properly.

In [None]:
from collections import Counter

# Ensure an index on country of birth
db.laureates.create_index([('bornCountry', 1)])

# Collect a count of laureates for each country of birth
n_born_and_affiliated = {
    country: db.laureates.count_documents({
        "bornCountry": country,
        "prizes.affiliations.country": country
    })
    for country in db.laureates.distinct("bornCountry")
}

five_most_common = Counter(n_born_and_affiliated).most_common(5)
print(five_most_common)

#### Limits in mongoDB

In [60]:
list(db.prizes.find({"category": "economics"},
                    {"year": 1, "_id": 0})
     .sort("year")
     .limit(3)
     .limit(5))

[{'year': '1969'},
 {'year': '1970'},
 {'year': '1971'},
 {'year': '1972'},
 {'year': '1973'}]

 - Save to filter_ the filter document to fetch only prizes with one or more quarter-share laureates, i.e. with a "laureates.share" of "4".
 - Save to projection the list of field names so that prize category, year and laureates' motivations ("laureates.motivation") may be fetched for inspection.
 - Save to cursor a cursor that will yield prizes, sorted by ascending year. Limit this to five prizes, and sort using the most concise specification.

In [62]:
from pprint import pprint

# Fetch prizes with quarter-share laureate(s)
filter_ = {'laureates.share': 4}

# Save the list of field names
projection = ['laureates.motivation', 'category', 'year']

# Save a cursor to yield the first five prizes
cursor = db.prizes.find(filter_, projection).sort('year').limit(5)
pprint(list(cursor))

[]


 - Complete the function get_particle_laureates that, given page_number and page_size, retrieves a given page of prize data on laureates who have the word "particle" (use $regex) in their prize motivations ("prizes.motivation"). Sort laureates first by ascending "prizes.year" and next by ascending "surname".
 - Collect and save the first nine pages of laureate data to pages.

In [61]:
from pprint import pprint

# Write a function to retrieve a page of data
def get_particle_laureates(page_number=1, page_size=3):
    if page_number < 1 or not isinstance(page_number, int):
        raise ValueError("Pages are natural numbers (starting from 1).")
    particle_laureates = list(
        db.laureates.find(
            {'prizes.motivation': {'$regex': "particle"}},
            ["firstname", "surname", "prizes"])
        .sort([('prizes.year', 1), ('surname', 1)])
        .skip(page_size * (page_number - 1))
        .limit(page_size))
    return particle_laureates

# Collect and save the first nine pages
pages = [get_particle_laureates(page_number=page) for page in range(1,9)]
pprint(pages[0])

[{'_id': ObjectId('61fc097bbfc7010032f63f21'),
  'firstname': 'C.T.R.',
  'prizes': [{'affiliations': [{'city': 'Cambridge',
                                'country': 'United Kingdom',
                                'name': 'University of Cambridge'}],
              'category': 'physics',
              'motivation': '"for his method of making the paths of '
                            'electrically charged particles visible by '
                            'condensation of vapour"',
              'share': '2',
              'year': '1927'}],
  'surname': 'Wilson'},
 {'_id': ObjectId('61fc097bbfc7010032f63f37'),
  'firstname': 'John',
  'prizes': [{'affiliations': [{'city': 'Harwell, Berkshire',
                                'country': 'United Kingdom',
                                'name': 'Atomic Energy Research '
                                        'Establishment'}],
              'category': 'physics',
              'motivation': '"for their pioneer work on the transmutati

#### Aggregations in mongoDB
 - An aggregation pipeline is an explicit list of stages
 - Each stage involves a stage operator to represent a function
 - Stage operators exist for mathcing, sorting, filtering, limiting, projecting, skipping and others

In [5]:
from collections import OrderedDict

list(db.laureates.aggregate([
    {"$match": {"bornCountry": "USA"}},
    {"$project": {"prizes.year": 1, "_id": 0}},
    {"$sort": OrderedDict([("prizes.year", 1)])},
    {"$skip": 1},
    {"$limit": 3}
]))

[{'prizes': [{'year': '1912'}]},
 {'prizes': [{'year': '1914'}]},
 {'prizes': [{'year': '1919'}]}]

#### Same output: Sequencing stages vs Cursor

In [6]:
cursor = (db.laureates.find(
    projection={"firstname": 1, "prizes.year": 1, "_id": 0},
    filter={"gender": "org"})
 .limit(3).sort("prizes.year", -1))

In [8]:
for c in cursor:
    print(c)

{'firstname': 'World Food Programme', 'prizes': [{'year': '2020'}]}
{'firstname': 'International Campaign to Abolish Nuclear Weapons', 'prizes': [{'year': '2017'}]}
{'firstname': 'National Dialogue Quartet', 'prizes': [{'year': '2015'}]}


In [10]:
project_stage = {"$project": {"firstname": 1, "prizes.year": 1, "_id": 0}}
match_stage = {"$match": {"gender": "org"}}
limit_stage = {"$limit": 3}
sort_stage = {"$sort": {"prizes.year": -1}}
list(db.laureates.aggregate([match_stage, project_stage, sort_stage, limit_stage]))

[{'firstname': 'World Food Programme', 'prizes': [{'year': '2020'}]},
 {'firstname': 'International Campaign to Abolish Nuclear Weapons',
  'prizes': [{'year': '2017'}]},
 {'firstname': 'National Dialogue Quartet', 'prizes': [{'year': '2015'}]}]

#### Sequencing stages

In [None]:
# Translate cursor to aggregation pipeline
pipeline = [
    {"$match": {"gender": {"$ne": "org"}}},
    {"$project": {"bornCountry": 1, "prizes.affiliations.country": 1}},
    {"$limit": 3}
]

for doc in db.laureates.aggregate(pipeline):
    print("{bornCountry}: {prizes}".format(**doc))

In [24]:
from collections import OrderedDict
from itertools import groupby
from operator import itemgetter

original_categories = set(db.prizes.distinct("category", {"year": "1902"}))

# Save an pipeline to collect original-category prizes
pipeline = [
    {"$match": {"category": {"$in": list(original_categories)}}},
    {"$project": {"year": 1, "category": 1}},
    {"$sort": OrderedDict([("year", -1)])}
]

cursor = db.prizes.aggregate(pipeline)



In [25]:
for key, group in groupby(cursor, key=itemgetter("year")):
    
    missing = original_categories - {doc["category"] for doc in group}
    print(key, missing)
    if missing:
        print("{year}: {missing}".format(year=key, missing=", ".join(sorted(missing))))

2021 set()
2020 set()
2019 set()
2018 set()
2017 set()
2016 set()
2015 set()
2014 set()
2013 set()
2012 set()
2011 set()
2010 set()
2009 set()
2008 set()
2007 set()
2006 set()
2005 set()
2004 set()
2003 set()
2002 set()
2001 set()
2000 set()
1999 set()
1998 set()
1997 set()
1996 set()
1995 set()
1994 set()
1993 set()
1992 set()
1991 set()
1990 set()
1989 set()
1988 set()
1987 set()
1986 set()
1985 set()
1984 set()
1983 set()
1982 set()
1981 set()
1980 set()
1979 set()
1978 set()
1977 set()
1976 set()
1975 set()
1974 set()
1973 set()
1972 set()
1971 set()
1970 set()
1969 set()
1968 set()
1967 set()
1966 set()
1965 set()
1964 set()
1963 set()
1962 set()
1961 set()
1960 set()
1959 set()
1958 set()
1957 set()
1956 set()
1955 set()
1954 set()
1953 set()
1952 set()
1951 set()
1950 set()
1949 set()
1948 set()
1947 set()
1946 set()
1945 set()
1944 set()
1943 set()
1942 set()
1941 set()
1940 set()
1939 set()
1938 set()
1937 set()
1936 set()
1935 set()
1934 set()
1933 set()
1932 set()
1931 set()

Field paths in operator expressions are prepended by "$" to distinguish them from literal string values, and JSON/MongoDB "sets" are delimited by square brackets, just like lists.

In [31]:
list(db.prizes.aggregate([{"$project": {"allThree": {"$setEquals": ["$laureates.share", ["3"]]},
                    "noneThree": {"$not": {"$setIsSubset": [["3"], "$laureates.share"]}}}},
                          {"$match": {"$nor": [{"allThree": True}, {"noneThree": True}]}}]))

 - Fill out pipeline to determine the number of prizes awarded (at least partly) to organizations. To do this, you'll first need to $match on the "gender" that designates organizations.
 
 - Then, use a field path to project the number of prizes for each organization as the "$size" of the "prizes" array. Recall that to specify the value of a field "<my_field>", you use the field path "$<my_field>".
 
 - Finally, use a single group {"_id": None} to sum over the values of all organizations' prize counts.

In [32]:
# Count prizes awarded (at least partly) to organizations as a sum over sizes of "prizes" arrays.
pipeline = [
    {"$match": {"gender": "org"}},
    {"$project": {"n_prizes": {"$size": "$prizes"}}},
    {"$group": {"_id": None, "n_prizes_total": {"$sum": "$n_prizes"}}}
]

print(list(db.laureates.aggregate(pipeline)))

[{'_id': None, 'n_prizes_total': 28}]


Implement this as an aggregation pipeline that:

 - Filters for original prize categories (i.e. sans economics),
 - Projects category and year,
 - Groups distinct prize categories awarded by year,
 - Projects prize categories not awarded by year,
 - Filters for years with missing prize categories, and
 - Returns a cursor of documents in reverse chronological order, one per year, each with a list of missing prize categories for that year.
 - Remember to use field paths (precede field names with "$") to extract field values in expressions.

 - Make the $group stage output a document for each prize year (set "_id" to the field path for year) with the set of categories awarded that year.
 
 - Given your intermediate collection of year-keyed documents, $project a field named "missing" with the (original) categories not awarded that year.
 
 - Use a $match stage to only pass through documents with at least one missing prize category.
 
 - Finally, add sort documents in descending order.

In [33]:
from collections import OrderedDict

original_categories = sorted(set(db.prizes.distinct("category", {"year": "1901"})))
pipeline = [
    {"$match": {"category": {"$in": original_categories}}},
    {"$project": {"category": 1, "year": 1}},
    
    # Collect the set of category values for each prize year.
    {"$group": {"_id": "$year", "categories": {"$addToSet": "$category"}}},
    
    # Project categories *not* awarded (i.e., that are missing this year).
    {"$project": {"missing": {"$setDifference": [original_categories, "$categories"]}}},
    
    # Only include years with at least one missing category
    {"$match": {"missing.0": {"$exists": True}}},
    
    # Sort in reverse chronological order. Note that "_id" is a distinct year at this stage.
    {"$sort": OrderedDict([("_id", -1)])},
]
for doc in db.prizes.aggregate(pipeline):
    print("{year}: {missing}".format(year=doc["_id"],missing=", ".join(sorted(doc["missing"]))))

In [35]:
original_categories

['chemistry', 'literature', 'medicine', 'peace', 'physics']

#### Array fields

 - The $$expr operator allows embedding of aggregation expressions in a normal query (or in a $match stage)
 

In [37]:
db.laureates.count_documents({"bornCountry": {"$in": db.laureates.distinct("bornCountry")}})

942

In [38]:
db.laureates.count_documents({"$expr": {"$in": ["$bornCountry", db.laureates.distinct("bornCountry")]}})

942

In [39]:
db.laureates.count_documents({"$expr": {"$eq": [{"$type": "$bornCountry"}, "string"]}})

942

In [40]:
db.laureates.count_documents({"bornCountry": {"$type": "string"}})

942

Build an aggregation pipeline to get the count of laureates who either did or did not win a prize with an affiliation country that is a substring of their country of birth -- for example, the prize affiliation country "Germany" should match the country of birth "Prussia (now Germany)".

 - Use $$unwind stages to ensure a single prize affiliation country per pipeline document.
 
 - Filter out prize-affiliation-country values that are "empty" (null, not present, etc.) -- ensure values are "$$in" the list of known values.
 
 - Produce a count of documents for each value of "affilCountrySameAsBorn" (a field we've projected for you using the $indexOfBytes operator) by adding 1 to the running sum.

In [41]:
key_ac = "prizes.affiliations.country"
key_bc = "bornCountry"
pipeline = [
    {"$project": {key_bc: 1, key_ac: 1}},

    # Ensure a single prize affiliation country per pipeline document
    {"$unwind": "$prizes"},
    {"$unwind": "$prizes.affiliations"},

    # Ensure values in the list of distinct values (so not empty)
    {"$match": {key_ac: {"$in": db.laureates.distinct(key_ac)}}},
    {"$project": {"affilCountrySameAsBorn": {
        "$gte": [{"$indexOfBytes": ["$"+key_ac, "$"+key_bc]}, 0]}}},

    # Count by "$affilCountrySameAsBorn" value (True or False)
    {"$group": {"_id": "$affilCountrySameAsBorn",
                "count": {"$sum": 1}}},
]
for doc in db.laureates.aggregate(pipeline): print(doc)

{'_id': False, 'count': 269}
{'_id': True, 'count': 505}


Some prize categories have laureates hailing from a greater number of countries than do other categories. Build an aggregation pipeline for the prizes collection to collect these numbers, using a $lookup stage to obtain laureate countries of birth.

 - $$unwind the laureates array field to output one pipeline document for each array element.

 - After pulling in laureate bios with a $lookup stage, unwind the new laureate_bios array field (each laureate has only a single biography document).

 - Collect the set of bornCountries associated with each prize category.

 - Project out the size of each category's set of bornCountries.

In [43]:
pipeline = [
    # Unwind the laureates array
    {"$unwind": "$laureates"},
    {"$lookup": {
        "from": "laureates", "foreignField": "id",
        "localField": "laureates.id", "as": "laureate_bios"}},

    # Unwind the new laureate_bios array
    {"$unwind": "$laureate_bios"},
    {"$project": {"category": 1,
                  "bornCountry": "$laureate_bios.bornCountry"}},

    # Collect bornCountry values associated with each prize category
    {"$group": {"_id": "$category",
                "bornCountries": {"$addToSet": "$bornCountry"}}},

    # Project out the size of each category's (set of) bornCountries
    {"$project": {"category": 1,
                  "nBornCountries": {"$size": 1}}},
    {"$sort": {"nBornCountries": -1}},
]

for doc in db.prizes.aggregate(pipeline):
    print(doc)

OperationFailure: Failed to optimize pipeline :: caused by :: The argument to $size must be an array, but was of type: int, full error: {'operationTime': Timestamp(1644007009, 5), 'ok': 0.0, 'errmsg': 'Failed to optimize pipeline :: caused by :: The argument to $size must be an array, but was of type: int', 'code': 17124, 'codeName': 'Location17124', '$clusterTime': {'clusterTime': Timestamp(1644007009, 5), 'signature': {'hash': b'\xb79n\x14\x08v@m\xa9\x9eT\xee\xa6\xb9AFs\x8b\xc5C', 'keyId': 7028260843674402819}}}

 - In your aggregation pipeline pipeline, use the "gender" field to limit results to people (that is, not organizations).

 - Count prizes for which the laureate's "bornCountry" is not also the "country" of any of their affiliations for the prize. Be sure to use field paths (precede a field name with "$") when appropriate.

In [44]:
pipeline = [
    # Limit results to people; project needed fields; unwind prizes
    {"$match": {"gender": {"$ne": "org"}}},
    {"$project": {"bornCountry": 1, "prizes.affiliations.country": 1}},
    {"$unwind": "$prizes"},
  
    # Count prizes with no country-of-birth affiliation
    {"$addFields": {"bornCountryInAffiliations": {"$in": ["$bornCountry", "$prizes.affiliations.country"]}}},
    {"$match": {"bornCountryInAffiliations": False}},
    {"$count": "awardedElsewhere"},
]

print(list(db.laureates.aggregate(pipeline)))

[{'awardedElsewhere': 477}]


In [46]:
db.laureates.find_one()

{'_id': ObjectId('61fc097bbfc7010032f63f01'),
 'id': '1',
 'firstname': 'Wilhelm Conrad',
 'surname': 'Röntgen',
 'born': '1845-03-27',
 'died': '1923-02-10',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'bornCity': 'Lennep (now Remscheid)',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'diedCity': 'Munich',
 'gender': 'male',
 'prizes': [{'year': '1901',
   'category': 'physics',
   'share': '1',
   'motivation': '"in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"',
   'affiliations': [{'name': 'Munich University',
     'city': 'Munich',
     'country': 'Germany'}]}]}

 - Construct a stage added_stage that filters for laureate "prizes.affiliations.country" values that are non-empty, that is, are $in a list of the distinct values that the field takes in the collection.

 - Insert this stage into the pipeline so that it filters out single prizes (not arrays) and precedes any test for membership in an array of countries. Recall that the first parameter to <list>.insert is the (zero-based) index for insertion.

In [47]:
pipeline = [
    {"$match": {"gender": {"$ne": "org"}}},
    {"$project": {"bornCountry": 1, "prizes.affiliations.country": 1}},
    {"$unwind": "$prizes"},
    {"$addFields": {"bornCountryInAffiliations": {"$in": ["$bornCountry", "$prizes.affiliations.country"]}}},
    {"$match": {"bornCountryInAffiliations": False}},
    {"$count": "awardedElsewhere"},
]

# Construct the additional filter stage
added_stage = {"$match": {"prizes.affiliations.country": {"$in": db.laureates.distinct("prizes.affiliations.country")}}}

# Insert this stage into the pipeline
pipeline.insert(3, added_stage)
print(list(db.laureates.aggregate(pipeline)))

[{'awardedElsewhere': 247}]
