# Working with Distinct Values and Sets

## Survey Distinct Values

Using the distinct() method, you can collect the set of values assigned to a field accross all documents. It is a convenience method for a common aggregation like count_documents(). An aggregation processes data across a collection and produces a computed result.

In [2]:
import requests
from pymongo import MongoClient

client = MongoClient()
print(client)
db = client["nobel"]

for collection_name in ["prizes", "laureates"]:
    response = requests.get("http://api.nobelprize.org/v1/{}.json".format(collection_name[:-1]))
    documents = response.json()[collection_name]
    db[collection_name].insert_many(documents)

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)


### Categorical data validation

In [8]:
print(db.prizes.distinct("category"))
print(db.laureates.distinct("prizes.category"))

assert set(db.prizes.distinct("category")) == set(db.laureates.distinct("prizes.category"))

['chemistry', 'economics', 'literature', 'peace', 'physics', 'medicine']
['physics', 'chemistry', 'peace', 'medicine', 'literature', 'economics']


### Never from there, but sometimes there at last

In [9]:
countries = set(db.laureates.distinct("diedCountry")) - set(db.laureates.distinct("bornCountry"))
print(countries)

{'Israel', 'Barbados', 'Greece', 'Tunisia', 'Philippines', 'Puerto Rico', 'Northern Rhodesia (now Zambia)', 'Yugoslavia (now Serbia)', 'Singapore', 'East Germany (now Germany)', 'Jamaica', 'Gabon'}


### Countries of affiliation

In [13]:
count = len(db.laureates.distinct("prizes.affiliations.country"))
print(count)

29


## Distinct Values Given Filters


In [18]:
# Distinct method takes an optional filter argument.

print(db.prizes.distinct("category", {"laureates.share":"4"}))

print(db.laureates.distinct("prizes.category", {"prizes.1":{"$exists": True}})) # prize cat. that won more than one prize.


['physics', 'chemistry', 'medicine']
['chemistry', 'physics', 'peace']


### Born here, went there

In [19]:
db.laureates.distinct("prizes.affiliations.country", {"bornCountry": "USA"})

['USA', 'Denmark', 'Australia', 'United Kingdom']

### Triple plays (mostly) all around


In [32]:
criteria = {"laureates.2": {"$exists": True}}
triple_play_categories = set(db.prizes.distinct("category", criteria))
assert set(db.prizes.distinct("category")) - triple_play_categories == {"literature"}

## Filter Arrays using Distinct Values

In [52]:
# For arrays, the filter matches if any member of the array matches.

print("Physics:", db.laureates.count_documents({"prizes.category" : "physics"}))

print("Not pysics:", db.laureates.count_documents({"prizes.category" : {"$ne": "physics"}}))

print("Phy, Chem and Med:", db.laureates.count_documents({"prizes.category" : {"$in": ["physics", "chemistry", "medicine"]}}))

print("Not Phy, Chem and Med:", db.laureates.count_documents({"prizes.category" : {"$nin": ["physics", "chemistry", "medicine"]}}))

Physics: 430
Not pysics: 1480
Phy, Chem and Med: 1242
Not Phy, Chem and Med: 668


### Sharing in physics after World War II


In [76]:
(db.laureates.count_documents({"prizes":{"$elemMatch":{"category": "physics",
                                                     "share": {"$ne": "1"},
                                                     "year": {"$lt": "1945"}}}})) / (db.laureates.count_documents({"prizes":{"$elemMatch":{"category": "physics",
                                                     "share": {"$ne": "1"},
                                                     "year": {"$gt": "1945"}}}}))

0.12751677852348994

### Meanwhile, in other categories...


In [80]:
unshared = {"prizes": {"$elemMatch": {"category": {"$nin" : ["physics", "chemistry", "medicine"]},
                                     "share": "1", "year": {"$gte": "1945"}}}}

shared = {"prizes": {"$elemMatch": {"category": {"$nin" : ["physics", "chemistry", "medicine"]},
                                     "share": {"$ne": "1"}, "year": {"$gte": "1945"}}}}
ratio = db.laureates.count_documents(unshared) / db.laureates.count_documents(shared)
print(ratio)

1.348623853211009


### Organizations and prizes over time


In [85]:
before = {"gender": "org", "prizes.year": {"$lt": "1945"}}

after = {"gender": "org", "prizes.year": {"$gte": "1945"}}

n_before = db.laureates.count_documents(before)
n_after = db.laureates.count_documents(after)

ratio = n_after / (n_after + n_before)
print(ratio)

0.8461538461538461


### Distinct As You Like It

In [96]:
# use regex to find substings. You can use the regex operator with the options 
# operator. The "i" option ensures case-insensitive matching.

assert set(db.laureates.distinct("bornCountry", {"bornCountry":{"$regex": "Poland"}})) ==set(db.laureates.distinct("bornCountry", {"bornCountry":{"$regex": "poland", "$options": "i"}}))

# to match the beggining of a field's value, use the caret "^" character in the
# beggining of the string. To escape a character use "\". To match the end of a
# field's value use "$".

In [101]:
db.laureates.distinct("bornCountry", {"bornCountry":{"$regex": "Poland \("}})

['German-occupied Poland (now Poland)',
 'Poland (now Ukraine)',
 'Poland (now Lithuania)',
 'Poland (now Belarus)']

### Glenn, George, and others in the G.B. crew

In [106]:
from bson.regex import Regex
print(db.laureates.count_documents({"firstname": {"$regex": "^G"}, "surname":{"$regex": "^S"}}))

print(db.laureates.count_documents({"firstname": Regex("^G"), "surname": Regex("^S")}))

20
20


### Germany, then and now

In [112]:
criteria = {"bornCountry": Regex("Germany")}
print(set(db.laureates.distinct("bornCountry", criteria)))
print()
criteria = {"bornCountry": Regex("^Germany")}
print(set(db.laureates.distinct("bornCountry", criteria)))
print()
criteria = {"bornCountry": Regex("^Germany " + "\(" + "now")}
print(set(db.laureates.distinct("bornCountry", criteria)))
print()
criteria = {"bornCountry": Regex("now" + " Germany\)" + "$")}
print(set(db.laureates.distinct("bornCountry", criteria)))

{'Bavaria (now Germany)', 'Germany', 'Prussia (now Germany)', 'Schleswig (now Germany)', 'Germany (now France)', 'Germany (now Russia)', 'Mecklenburg (now Germany)', 'Germany (now Poland)', 'Hesse-Kassel (now Germany)', 'West Germany (now Germany)', 'East Friesland (now Germany)', 'Württemberg (now Germany)'}

{'Germany (now France)', 'Germany', 'Germany (now Poland)', 'Germany (now Russia)'}

{'Germany (now Russia)', 'Germany (now Poland)', 'Germany (now France)'}

{'Bavaria (now Germany)', 'Prussia (now Germany)', 'Schleswig (now Germany)', 'Mecklenburg (now Germany)', 'Hesse-Kassel (now Germany)', 'West Germany (now Germany)', 'East Friesland (now Germany)', 'Württemberg (now Germany)'}


### The prized transistor


In [117]:
criteria = {"prizes.motivation": Regex("transistor")}

first, last = "firstname", "surname"

print([(laureates[first], laureates[last]) for laureates in db.laureates.find(criteria)])

[('William B.', 'Shockley'), ('John', 'Bardeen'), ('Walter H.', 'Brattain'), ('William B.', 'Shockley'), ('John', 'Bardeen'), ('Walter H.', 'Brattain')]
