# Advanced Python
## `PyMongo` practice with non-relational databases

In this notebook, you can see an example of **data analytics** of student grade data, social media posts, a service for rating restaurants, and banking data on financial transactions. All this analytics will be done by **querying** different collections of **non-relational database** using the `PyMongo` framework.

In [1]:
# !pip install pymongo

In [1]:
import pymongo
from pymongo import MongoClient

cluster = MongoClient('mongodb+srv://admin:admin@pythontest.l4aoup6.mongodb.net/?retryWrites=true&w=majority')

In [2]:
print(cluster.list_database_names())  # let's see what kind of databases we have

['sample_airbnb', 'sample_analytics', 'sample_geospatial', 'sample_guides', 'sample_mflix', 'sample_restaurants', 'sample_supplies', 'sample_training', 'sample_weatherdata', 'test', 'admin', 'local']


# Student grade data analysis

## Task №1

Output all unique city-state rows on the zips dataset in the `sample_training` database (note that just distinct won't help here).

In [5]:
db = cluster["sample_training"]
print(db.list_collection_names())  # let's see what collections there are

collection = db["zips"]

['zips', 'trips', 'inspections', 'grades', 'companies', 'posts', 'myOutput', 'collection', 'routes']


In [6]:
# look at some sample data
collection.find_one()

{'_id': ObjectId('5c8eccc1caa187d17ca6ed19'),
 'city': 'BAILEYTON',
 'zip': '35019',
 'loc': {'y': 34.268298, 'x': 86.621299},
 'pop': 1781,
 'state': 'AL'}

In [7]:
# how many documents
collection.count_documents({})

29470

In [8]:
# source for column concatenation:
# https://stackoverflow.com/questions/12820253/mongodb-concatenate-strings-from-two-fields-into-a-third-field

# make a pipeline to find unique city-state pairs
pipeline = [
    {"$group": {
        "_id": {"city": "$city", 
                "state": "$state"}
        }},
    {"$project": {
        "_id": 0,  # скрыть поле _id
        "city_state": {"$concat": ["$_id.city", " - ", "$_id.state"]}}}
]

# aggregate
unique_city_states = collection.aggregate(pipeline)

# pulling the results to the list
unique_city_state_list = [row['city_state'] for row in unique_city_states]
unique_city_state_list

['IRVINGTON - AL',
 'NASHVILLE - MI',
 'SAN JON - NM',
 'HOLLYWOOD PARK - TX',
 'BARDWELL - KY',
 'JACKSON JUNCTION - IA',
 'PLANO - TX',
 'NEW ERA - MI',
 'MADAWASKA - ME',
 'BLOOMVILLE - NY',
 'STEAMBOAT ROCK - IA',
 'PEKIN - ND',
 'FLORA - MS',
 'MILLIS - MA',
 'MASHPEE - MA',
 'MARION STATION - MD',
 'BYRON - NE',
 'SALEMBURG - NC',
 'HOLMESVILLE - OH',
 'GOODVIEW - VA',
 'ROCKHOLDS - KY',
 'PORT TOWNSEND - WA',
 'CLINTON - MS',
 'TOMBALL - TX',
 'LENORAH - TX',
 'SELAH - WA',
 'BANKS - ID',
 'ABERDEEN - NC',
 'GOLD RIVER - CA',
 'RADIUM - KS',
 'STURDIVANT - MO',
 'BETHEL HEIGHTS - AR',
 'ALVATON - KY',
 'WILDER - MN',
 'BAR HARBOR - ME',
 'PURDUM - NE',
 'MUNSTER - IN',
 'BLACK ROCK - AR',
 'OSAGE - OK',
 'DOYLINE - LA',
 'MERIGOLD - MS',
 'FORT RILEY - KS',
 'VOLBORG - MT',
 'READING - MA',
 'HOOKSETT - NH',
 'NEW ENTERPRISE - PA',
 'WILMOT - SD',
 'MASPETH - NY',
 'BRACEVILLE - IL',
 'HOLLYWOOD - AL',
 'CARMAN - IL',
 'CALMAR - IA',
 'PARK FLETCHER - IN',
 'MANILLA - IN',
 'ALI

In [10]:
print(f"There are a total of {len(unique_city_state_list)} unique city-state pairs")

There are a total of 25818 unique city-state pairs


## Task №2

Collect for each student the average grade for the course (aggregation by `student_id` + `class_id`) from the `grades` dataset in the `sample_training` database. Output students in descending order of average grade, output only the first 100 students.

In [13]:
# set a new database and dataset
db = cluster["sample_training"]
collection = db["grades"]

In [14]:
# look at some sample data
collection.find_one()

{'_id': ObjectId('56d5f7eb604eb380b0d8d8cf'),
 'student_id': 0.0,
 'scores': [{'type': 'exam', 'score': 91.97520018439039},
  {'type': 'quiz', 'score': 95.80410375967175},
  {'type': 'homework', 'score': 89.62485475572984},
  {'type': 'homework', 'score': 51.621532832724846}],
 'class_id': 350.0}

In [16]:
# this is what the data looks like for one student
res = collection.find({'student_id': 0.0})
for k in res:
    print(k)

{'_id': ObjectId('56d5f7eb604eb380b0d8d8cf'), 'student_id': 0.0, 'scores': [{'type': 'exam', 'score': 91.97520018439039}, {'type': 'quiz', 'score': 95.80410375967175}, {'type': 'homework', 'score': 89.62485475572984}, {'type': 'homework', 'score': 51.621532832724846}], 'class_id': 350.0}
{'_id': ObjectId('56d5f7eb604eb380b0d8d8d3'), 'student_id': 0.0, 'scores': [{'type': 'exam', 'score': 11.182574562228819}, {'type': 'quiz', 'score': 8.819662605640733}, {'type': 'homework', 'score': 90.85883793911141}, {'type': 'homework', 'score': 16.263573466709346}], 'class_id': 7.0}
{'_id': ObjectId('56d5f7eb604eb380b0d8d8d7'), 'student_id': 0.0, 'scores': [{'type': 'exam', 'score': 57.44037561654658}, {'type': 'quiz', 'score': 57.0987819661993}, {'type': 'homework', 'score': 11.046726329813572}, {'type': 'homework', 'score': 63.127706923208194}], 'class_id': 331.0}
{'_id': ObjectId('56d5f7eb604eb380b0d8d8ce'), 'student_id': 0.0, 'scores': [{'type': 'exam', 'score': 78.40446309504266}, {'type': 'qu

In [15]:
# how many documents
collection.count_documents({})

100000

In [17]:
# interesting (and obvious) fact I discovered while experimenting:
# instead of pymongo.ASCENDING and pymongo.DESCENDING you can specify 1 and -1 respectively

pymongo.ASCENDING, pymongo.DESCENDING

(1, -1)

In [18]:
# make a pipeline to find the average course grade across students
pipeline = [
    # break down grades by course to then aggregate by student and course
    # and calculate a grade point average for the course
    {"$unwind": "$scores"},
    # группирую по student_id and class_id, считаю среднюю оценку за предмет
    {"$group": {
        "_id": {"student_id": "$student_id", "class_id": "$class_id"},
        "average_score": {"$avg": "$scores.score"}
    }},
    # sort by grade point average
    {"$sort": {"average_score": pymongo.DESCENDING}},
    # limit the first 100 lines (otherwise the results take a long time to print)
    {"$limit": 100}
]

# aggregation
top_100_grades = collection.aggregate(pipeline)

# the result can be saved to a list
top_100_grades_list = list(top_100_grades)

# let's see the result
for k in top_100_grades:
    print(k)

{'_id': {'student_id': 2366.0, 'class_id': 451.0}, 'average_score': 97.5472912437137}
{'_id': {'student_id': 5548.0, 'class_id': 202.0}, 'average_score': 96.49566049977099}
{'_id': {'student_id': 3885.0, 'class_id': 237.0}, 'average_score': 96.02445227759756}
{'_id': {'student_id': 1591.0, 'class_id': 213.0}, 'average_score': 95.89534559668589}
{'_id': {'student_id': 5535.0, 'class_id': 50.0}, 'average_score': 95.46935163406641}
{'_id': {'student_id': 2913.0, 'class_id': 431.0}, 'average_score': 95.2840006524414}
{'_id': {'student_id': 1433.0, 'class_id': 383.0}, 'average_score': 95.2289041631754}
{'_id': {'student_id': 8216.0, 'class_id': 345.0}, 'average_score': 94.8722830346203}
{'_id': {'student_id': 8493.0, 'class_id': 122.0}, 'average_score': 94.67189087560995}
{'_id': {'student_id': 7885.0, 'class_id': 161.0}, 'average_score': 94.48312202896804}
{'_id': {'student_id': 7764.0, 'class_id': 44.0}, 'average_score': 94.3921193079823}
{'_id': {'student_id': 8790.0, 'class_id': 493.0},

# Social media posts

## Task №3

In the `posts` dataset of the `sample_training` database, output the number of posts for each tag (tags are separate). Output all tags that occur only once and output post texts (body) by them.

In [20]:
# set a new database and dataset
db = cluster["sample_training"]
collection = db["posts"]

In [21]:
# look at some sample data
collection.find_one()

{'_id': ObjectId('50ab0f8bbcf1bfe2536dc3f9'),
 'body': 'Amendment I\n<p>Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.\n<p>\nAmendment II\n<p>\nA well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.\n<p>\nAmendment III\n<p>\nNo Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.\n<p>\nAmendment IV\n<p>\nThe right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place

In [22]:
# how many documents
collection.count_documents({})

500

In [24]:
# make a pipeline to find the number of posts for each tag
pipeline = [
    # split an array with tags
    {"$unwind": "$tags"},
    # group by tags and count the number of posts
    {"$group": {
        "_id": "$tags",
        "count": {"$sum": 1}
    }},
    # sort by post count
    {"$sort": {"count": pymongo.DESCENDING}},
    # limit the first 10 lines for clarity
    {"$limit": 10}
]

# aggregate
count_posts_by_tags = collection.aggregate(pipeline)

# let's see the result
for k in count_posts_by_tags:
    print(k)

{'_id': 'toad', 'count': 8}
{'_id': 'hair', 'count': 8}
{'_id': 'forest', 'count': 8}
{'_id': 'footnote', 'count': 7}
{'_id': 'puppy', 'count': 7}
{'_id': 'bead', 'count': 7}
{'_id': 'leo', 'count': 7}
{'_id': 'sphynx', 'count': 7}
{'_id': 'flat', 'count': 7}
{'_id': 'magician', 'count': 7}


There is a feeling that since multiple tags are allowed in the data, posts with the same multiple set of tags ended up in the top.

Now I will add filtering of those tags that occur 1 time and display the texts corresponding to them:

In [45]:
# source for $push function
# https://www.mongodb.com/docs/manual/reference/operator/update/push/

# make a pipeline to find the number of posts for each tag
pipeline = [
    # split an array with tags
    {"$unwind": "$tags"},
    # group by tags and count the number of posts
    {"$group": {
        "_id": "$tags",
        "count": {"$sum": 1},
        # $push function adds items to the array
        # in the current example, the result will always be a single text
        "posts": {"$push": "$body"}  
    }},
    # leave tags that have been seen once
    {"$match": {"count": 1}},
    # leav the first 100 tags to speed things up
    {"$limit": 100}
]

# aggregate
tags_one_post = collection.aggregate(pipeline)

# let's see the result
for tag in tags_one_post:
    print(f"Tag: {tag['_id']}")
    print(f"Post: {tag['posts']}")  # since this is always a list of one element, I tried to print the element with index 0
    print()                         # but it was ugly, because then the internal `\n` characters would mess up the texture of the output

Tag: deal
Post: ['We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.\n<p>Article. I.<p><p>Section. 1.<p><p>All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.<p><p>Section. 2.<p><p>The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.<p><p>No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elect

# Restaurant reviews

## Task №4

For restaurants from the `restaurants` dataset of the `sample_restaurants` database, output the most frequent grade (`grade`) given in 2014. If the restaurant does not have such grades, it should not be output. Use `name`, not `id`, for aggregation.

### Calculate the most frequent grades

In [3]:
# set a new database and dataset
db = cluster["sample_restaurants"]
collection = db["restaurants"]

In [46]:
# look at some sample data
collection.find_one()

{'_id': ObjectId('5eb3d668b31de5d588f4292b'),
 'address': {'building': '7114',
  'coord': [-73.9068506, 40.6199034],
  'street': 'Avenue U',
  'zipcode': '11234'},
 'borough': 'Brooklyn',
 'cuisine': 'Delicatessen',
 'grades': [{'date': datetime.datetime(2014, 5, 29, 0, 0),
   'grade': 'A',
   'score': 10},
  {'date': datetime.datetime(2014, 1, 14, 0, 0), 'grade': 'A', 'score': 10},
  {'date': datetime.datetime(2013, 8, 3, 0, 0), 'grade': 'A', 'score': 8},
  {'date': datetime.datetime(2012, 7, 18, 0, 0), 'grade': 'A', 'score': 10},
  {'date': datetime.datetime(2012, 3, 9, 0, 0), 'grade': 'A', 'score': 13},
  {'date': datetime.datetime(2011, 10, 14, 0, 0), 'grade': 'A', 'score': 9}],
 'name': "Wilken'S Fine Food",
 'restaurant_id': '40356483'}

In [43]:
# how many documents
collection.count_documents({})

25359

In [4]:
from datetime import datetime

In [5]:
# make a pipeline to find the most frequent ratings given in 2014 for each restaurant
pipeline = [
    # immediately filter out restaurants that basically had at least one rating in 2014 to speed things up
    {"$match": {"grades.date": {"$gte": datetime(2014, 1, 1), "$lt": datetime(2015, 1, 1)}}},
    # split the array of ratings
    {"$unwind": "$grades"},
    # leave the ratings for 2014
    {"$match": {"grades.date": {"$gte": datetime(2014, 1, 1), "$lt": datetime(2015, 1, 1)}}},
    # group by restaurant name, count the number of times each rating has occurred
    {"$group": {
        "_id": {"name": "$name", "grade": "$grades.grade"},
        "count": {"$sum": 1}
    }},
    # sort the frequency of ratings to then take the most frequent one
    {"$sort": {
        "count": pymongo.DESCENDING
    }},
    # take the most frequent (first most frequent) rating
    {"$group": {
        "_id": "$_id.name",
        "most_common_grade": {"$first": "$_id.grade"}
    }}
]

# aggregate
most_common_grades = collection.aggregate(pipeline)

# let's see the result
for k in most_common_grades:
    print(k)

{'_id': 'Villa Castillo Restaurant Corp', 'most_common_grade': 'A'}
{'_id': 'Fratelli Pizza', 'most_common_grade': 'A'}
{'_id': 'La Slowteria', 'most_common_grade': 'A'}
{'_id': 'Spicy Mela', 'most_common_grade': 'A'}
{'_id': 'Ting Fai Cuisine Inc', 'most_common_grade': 'A'}
{'_id': 'Brooklyn Mac', 'most_common_grade': 'A'}
{'_id': 'Piquant', 'most_common_grade': 'B'}
{'_id': 'Marcha Cocina Bar', 'most_common_grade': 'A'}
{'_id': 'B-Hive Lounge', 'most_common_grade': 'B'}
{'_id': 'Kavkazkiy Dvorik', 'most_common_grade': 'A'}
{'_id': 'Bay Street Luncheonette & Soda Fountain', 'most_common_grade': 'A'}
{'_id': 'Microchip Cafe', 'most_common_grade': 'Z'}
{'_id': 'Aux Epices', 'most_common_grade': 'A'}
{'_id': 'Van Leeuwen Artisan Ice Cream', 'most_common_grade': 'A'}
{'_id': 'Wanka', 'most_common_grade': 'A'}
{'_id': 'Eatery Restaurant', 'most_common_grade': 'A'}
{'_id': 'Bistro Citron', 'most_common_grade': 'A'}
{'_id': 'China Glatt', 'most_common_grade': 'A'}
{'_id': 'Globe Coffee Shop 

### Add the result to the `test` database and `test` collection

In [6]:
# connect to the sources
test_db = cluster['test']
test_collection = test_db['test']

In [57]:
# someone's already added something
test_collection.find_one()

{'_id': 'Dake Sushi', 'grade': 'A', 'most_common_grade': 'A'}

In [8]:
# make pipeline to find the most frequent scores posted in 2014 for each restaurant
pipeline = [
    # immediately filter out restaurants that basically had at least one rating in 2014 to speed things up
    {"$match": {"grades.date": {"$gte": datetime(2014, 1, 1), "$lt": datetime(2015, 1, 1)}}},
    # split the array of ratings
    {"$unwind": "$grades"},
    # leave the ratings for 2014
    {"$match": {"grades.date": {"$gte": datetime(2014, 1, 1), "$lt": datetime(2015, 1, 1)}}},
    # group by restaurant name, counting how many times each rating was met
    {"$group": {
        "_id": {"name": "$name", "grade": "$grades.grade"},
        "count": {"$sum": 1}
    }},
    # sort the frequency of ratings to then take the most frequent one
    {"$sort": {
        "count": pymongo.DESCENDING
    }},
    # take the most frequent (first most frequent) rating
    {"$group": {
        "_id": "$_id.name",
        "most_common_grade": {"$first": "$_id.grade"}
    }}
]

# aggregate
most_common_grades = collection.aggregate(pipeline)

# collect the result into a list of dictionaries to add to the test collection
most_common_grades_list = list(most_common_grades)

In [9]:
# first example
most_common_grades_list[0]

{'_id': 'Villa Castillo Restaurant Corp', 'most_common_grade': 'A'}

For some reason, when I tried to add all the rows at once using the `insert_many` method, I got a `BulkWriteError` error.

I couldn't figure it out, so I added all the rows in a loop using the `insert_one` method:

In [74]:
# test_collection.insert_many(most_common_grades_list)

# add the result to the test collection
for row in most_common_grades_list:
    test_collection.insert_one(row)

Let's try to find the first line in the `test` collection:

In [10]:
cols = {
    'most_common_grade': 1,
    '_id': 1
}
filters = {'_id': "Villa Castillo Restaurant Corp", 
           "most_common_grade": 'A'}

res = test_collection.find(filters, cols)
for k in res:
  print(k)

{'_id': 'Villa Castillo Restaurant Corp', 'most_common_grade': 'A'}


*It worked!*

# Bank transaction analysis

## Task №5

From the `sample_analytics` database, output for each `customer_id` (also output the `username` of the account):

* The sum of the user's limits (the sum of all his accounts).

* Total number of user's transactions (sum of all accounts).

* Add the sum of all replenishings and spendings by transactions.

In [11]:
# set a new database and dataset
db = cluster["sample_analytics"]

In [13]:
# the database has these collections
db.list_collection_names()

['customers', 'accounts', 'transactions']

Let's look at the sample data in each of the collections:

In [14]:
collection_customers = db["customers"]
collection_accounts = db["accounts"]
collection_transactions = db["transactions"]

In [78]:
collection_customers.find_one()

{'_id': ObjectId('5ca4bbcea2dd94ee58162a6a'),
 'username': 'hillrachel',
 'name': 'Katherine David',
 'address': '55711 Janet Plaza Apt. 865\nChristinachester, CT 62716',
 'birthdate': datetime.datetime(1988, 6, 20, 22, 15, 34),
 'email': 'timothy78@hotmail.com',
 'accounts': [462501, 228290, 968786, 515844, 377292],
 'tier_and_details': {}}

In [79]:
collection_accounts.find_one()

{'_id': ObjectId('5ca4bbc7a2dd94ee58162398'),
 'account_id': 976027,
 'limit': 10000,
 'products': ['Brokerage', 'InvestmentStock']}

In [80]:
collection_transactions.find_one()

{'_id': ObjectId('5ca4bbc1a2dd94ee58161cb3'),
 'account_id': 557378,
 'transaction_count': 56,
 'bucket_start_date': datetime.datetime(1990, 6, 11, 0, 0),
 'bucket_end_date': datetime.datetime(2016, 11, 6, 0, 0),
 'transactions': [{'date': datetime.datetime(2006, 10, 6, 0, 0),
   'amount': 2561,
   'transaction_code': 'sell',
   'symbol': 'adbe',
   'price': '38.236619210617988073863671161234378814697265625',
   'total': '97923.98179839266745716486184'},
  {'date': datetime.datetime(2000, 6, 19, 0, 0),
   'amount': 9153,
   'transaction_code': 'sell',
   'symbol': 'adbe',
   'price': '31.12236744839008650842515635304152965545654296875',
   'total': '284863.0292551144618116154561'},
  {'date': datetime.datetime(2013, 11, 6, 0, 0),
   'amount': 18,
   'transaction_code': 'buy',
   'symbol': 'amzn',
   'price': '356.639066345529272439307533204555511474609375',
   'total': '6419.503194219526903907535598'},
  {'date': datetime.datetime(2006, 5, 5, 0, 0),
   'amount': 6818,
   'transaction_c

We see that the structures are not complex, but there may be many transactions in `transactions`.

### Amount of limits for the users

In [15]:
# make a pipeline to find the sum of the limits for each user
pipeline = [
    # at the first step I do not split the array of accounts using 
    # "$unwind": "$accounts", as $lookup will work properly without it
    # using $lookup to pull up account data
    {"$lookup": {
        "from": "accounts",
        "localField": "accounts",
        "foreignField": "account_id",
        "as": "account_details"
    }},
    # split account information
    {"$unwind": "$account_details"},
    # group by clients using _id (it is more reliable), summarize their limits
    {"$group": {
        "_id": "$_id",
        "username": {"$first": "$username"},
        "name": {"$first": "$name"},
        "total_limit": {"$sum": "$account_details.limit"}
    }},
    # reset _id
    {"$project": {
        "_id": 0,
        "username": 1,
        "name": 1,
        "total_limit": 1
    }},
    # sort for convenience
    {"$sort": {"total_limit": pymongo.DESCENDING}}
]

# aggregate
limits_by_clients = collection_customers.aggregate(pipeline)

# we can compile the result into a list
# limits_by_clients_list = list(limits_by_clients)

for k in limits_by_clients:
    print(k)

{'username': 'zcole', 'name': 'Shawn Austin', 'total_limit': 70000}
{'username': 'tammygonzalez', 'name': 'Ashley Rodriguez', 'total_limit': 70000}
{'username': 'nguyenjulie', 'name': 'Danielle Hancock', 'total_limit': 60000}
{'username': 'xgrant', 'name': 'Angela Jones', 'total_limit': 60000}
{'username': 'utorres', 'name': 'Stanley Bishop', 'total_limit': 60000}
{'username': 'leekara', 'name': 'Lorraine Mullen', 'total_limit': 60000}
{'username': 'alexsanders', 'name': 'Annette Watts', 'total_limit': 60000}
{'username': 'petergilbert', 'name': 'Angela Campbell', 'total_limit': 60000}
{'username': 'johnkrause', 'name': 'Jennifer Keller MD', 'total_limit': 60000}
{'username': 'ianjones', 'name': 'Douglas Johnson', 'total_limit': 60000}
{'username': 'kylejenkins', 'name': 'Christine Brown', 'total_limit': 60000}
{'username': 'xgarcia', 'name': 'Shelley Watson', 'total_limit': 60000}
{'username': 'jacksoncolleen', 'name': 'Susan Davis', 'total_limit': 60000}
{'username': 'nmason', 'name'

Let's check the result for the first client manually:

In [87]:
# look at the accounts
collection_customers.find_one({'name': 'Ashley Rodriguez'})

{'_id': ObjectId('5ca4bbcea2dd94ee58162b90'),
 'username': 'tammygonzalez',
 'name': 'Ashley Rodriguez',
 'address': '94038 Luis Garden\nWilliamsstad, MI 51943',
 'birthdate': datetime.datetime(1969, 11, 11, 11, 57, 37),
 'email': 'gnichols@gmail.com',
 'accounts': [249078, 660047, 627788, 428217, 526519, 814901],
 'tier_and_details': {'95ceb97e3ffc4b47965572259062920e': {'tier': 'Bronze',
   'benefits': ['airline lounge access', 'dedicated account representative'],
   'active': True,
   'id': '95ceb97e3ffc4b47965572259062920e'},
  '64314ecf7ed74cb2ad483f2f5b74dea0': {'tier': 'Silver',
   'benefits': ['shopping discounts', 'financial planning assistance'],
   'active': True,
   'id': '64314ecf7ed74cb2ad483f2f5b74dea0'},
  'a1a45827be424d73bf61ebde661e2ef2': {'tier': 'Silver',
   'benefits': ['airline lounge access', 'travel insurance'],
   'active': True,
   'id': 'a1a45827be424d73bf61ebde661e2ef2'}}}

In [88]:
# search for user accounts manually using the $in operator
filter = {
    'account_id': {'$in': [249078, 660047, 627788, 428217, 526519, 814901]}
}

res = collection_accounts.find(filter)
for k in res:
    print(k)

{'_id': ObjectId('5ca4bbc7a2dd94ee58162718'), 'account_id': 627788, 'limit': 10000, 'products': ['CurrencyService', 'Brokerage', 'Commodity', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee58162812'), 'account_id': 627788, 'limit': 10000, 'products': ['Brokerage', 'InvestmentStock', 'CurrencyService', 'Commodity']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581627bd'), 'account_id': 814901, 'limit': 10000, 'products': ['Brokerage', 'Commodity', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581627e8'), 'account_id': 526519, 'limit': 10000, 'products': ['CurrencyService', 'Brokerage', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581627e6'), 'account_id': 428217, 'limit': 10000, 'products': ['Commodity', 'Brokerage', 'InvestmentFund', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581627e7'), 'account_id': 249078, 'limit': 10000, 'products': ['Derivatives', 'InvestmentFund', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581629e5'), 'account_id': 660047, 'limi

Everything worked correctly!

As I understand, this is an artificial dataset, but apparently it (or the real service) had a maximum account limit of $10,000$.

### Total number of users transactions

In [16]:
# make a pipeline to find the number of transactions for each user
pipeline = [
    # similar to the example above, I don't do "$unwind": "$accounts"
    # doing a $lookup to find information about account transactions
    # as long as their direction (outgoing / incoming) is not important
    {"$lookup": {
        "from": "transactions",
        "localField": "accounts",
        "foreignField": "account_id",
        "as": "transaction_details"
    }},
    # split transaction data by account
    {"$unwind": "$transaction_details"},
    # group by clients using _id (it is more reliable), summarize the number of their transactions
    # especially for me there is already a transaction_count field in the transactions collection :)
    {"$group": {
        "_id": "$_id",
        "username": {"$first": "$username"},
        "name": {"$first": "$name"},
        "total_transactions": {"$sum": "$transaction_details.transaction_count"}
    }},
    # reset _id
    {"$project": {
        "_id": 0,
        "username": 1,
        "name": 1,
        "total_transactions": 1
    }},
    # sort for convenience
    {"$sort": {"total_transactions": pymongo.DESCENDING}}
]

# aggregate
transactions_count = collection_customers.aggregate(pipeline)

# let's see the result
for k in transactions_count:
    print(k)

{'username': 'hrogers', 'name': 'John Williams', 'total_transactions': 443}
{'username': 'ronald62', 'name': 'Phillip Obrien', 'total_transactions': 421}
{'username': 'emilybrooks', 'name': 'Travis White', 'total_transactions': 409}
{'username': 'david77', 'name': 'Aaron Perez', 'total_transactions': 404}
{'username': 'steve73', 'name': 'Stacey Mccall', 'total_transactions': 403}
{'username': 'uortiz', 'name': 'Megan Tanner', 'total_transactions': 401}
{'username': 'kelly68', 'name': 'Angela Leblanc', 'total_transactions': 401}
{'username': 'tammygonzalez', 'name': 'Ashley Rodriguez', 'total_transactions': 395}
{'username': 'skinnercraig', 'name': 'Ashley Lindsey', 'total_transactions': 381}
{'username': 'james47', 'name': 'Brian Griffin', 'total_transactions': 381}
{'username': 'odonnellbrandon', 'name': 'Justin Thompson', 'total_transactions': 373}
{'username': 'sydney77', 'name': 'Anthony Spencer', 'total_transactions': 373}
{'username': 'william51', 'name': 'Daniel Edwards', 'total

Let's check the data on the top 1 again:

In [92]:
# look at the accounts
collection_customers.find_one({'name': 'John Williams'})

{'_id': ObjectId('5ca4bbcea2dd94ee58162c0d'),
 'username': 'hrogers',
 'name': 'John Williams',
 'address': 'USCGC Parker\nFPO AE 20182',
 'birthdate': datetime.datetime(1987, 1, 16, 20, 44, 49),
 'email': 'joseph11@yahoo.com',
 'accounts': [632807, 470615, 630132, 215284, 129932, 879426],
 'tier_and_details': {}}

In [93]:
# search for user accounts manually using the $in operator
filter = {
    'account_id': {'$in': [632807, 470615, 630132, 215284, 129932, 879426]}
}

res = collection_transactions.find(filter)
for k in res:
    print(k)

{'_id': ObjectId('5ca4bbc1a2dd94ee58162285'), 'account_id': 129932, 'transaction_count': 76, 'bucket_start_date': datetime.datetime(1973, 8, 11, 0, 0), 'bucket_end_date': datetime.datetime(2016, 12, 31, 0, 0), 'transactions': [{'date': datetime.datetime(2014, 11, 28, 0, 0), 'amount': 8073, 'transaction_code': 'sell', 'symbol': 'fb', 'price': '77.29894756696734248180291615426540374755859375', 'total': '624034.4037081273558555949421'}, {'date': datetime.datetime(2015, 6, 16, 0, 0), 'amount': 5823, 'transaction_code': 'sell', 'symbol': 'fb', 'price': '80.8882496325885114174525369890034198760986328125', 'total': '471012.2776105629019838261229'}, {'date': datetime.datetime(2015, 11, 17, 0, 0), 'amount': 5847, 'transaction_code': 'sell', 'symbol': 'fb', 'price': '105.446354534842299699448631145060062408447265625', 'total': '616544.8349652229263426761463'}, {'date': datetime.datetime(2012, 8, 2, 0, 0), 'amount': 2354, 'transaction_code': 'sell', 'symbol': 'adbe', 'price': '30.3706909637257780

In [94]:
76 + 39 + 42 + 95 + 100 + 91

443

*Hooray!*

### The sum of all replenishings and spends by transaction

Additionally, I'll leave the number of transactions from the previous step broken down by purchases/sales.

#### Note

As I understood, in the `transactions` collection the `amount` field is responsible for the number of some purchased/sold assets, `price` is the value of each asset, and `total` = `price` * `amount`, i.e. the total amount of the transaction. Therefore, I will use `total` for aggregation.

In [18]:
# source by condition (similar to excel)
# https://www.mongodb.com/docs/manual/reference/operator/aggregation/cond/
# source on toDouble
# https://www.mongodb.com/docs/manual/reference/operator/aggregation/toDouble/

# make a pipeline to find all replenishings and spend transactions for each user
pipeline = [
    # similar to the example above, do not do "$unwind": "$accounts"
    # do a $lookup to find information about account transactions
    {"$lookup": {
        "from": "transactions",
        "localField": "accounts",
        "foreignField": "account_id",
        "as": "transaction_details"
    }},
    # since the transaction details are in a nested structure,
    # I split the transactions at the beginning
    {"$unwind": "$transaction_details"},
    # and then split the details of the transaction to understand their direction
    {"$unwind": "$transaction_details.transactions"},
    # group by users, count the number of transactions (only manually,
    # since I've now moved to a lower level of transactions in the data hierarchy)
    # also summarize the outgoing/incoming transactions by convention 
    {"$group": {
        "_id": "$_id",
        "username": {"$first": "$username"},
        "name": {"$first": "$name"},
        "total_transactions": {"$sum": 1},
        # purchases (buy)
        "count_buy": {
            "$sum": {
                "$cond": [{"$eq": ["$transaction_details.transactions.transaction_code", "buy"]}, 
                          1, 0]
            }
        },
        "sum_buy": {
            "$sum": {
                "$cond": [{"$eq": ["$transaction_details.transactions.transaction_code", "buy"]}, 
                          {"$toDouble": "$transaction_details.transactions.total"}, 0]  # added, because otherwise the sum was 0
            }
        },
        # sales (sell)
        "count_sell": {
            "$sum": {
                "$cond": [{"$eq": ["$transaction_details.transactions.transaction_code", "sell"]}, 
                          1, 0]
            }
        },
        "sum_sell": {
            "$sum": {
                "$cond": [{"$eq": ["$transaction_details.transactions.transaction_code", "sell"]}, 
                          {"$toDouble": "$transaction_details.transactions.total"}, 0]  # added, because otherwise the sum was 0
            }
        }
    }},
    # reset _id
    {"$project": {
        "_id": 0,
        "username": 1,
        "name": 1,
        "total_transactions": 1,
        "count_buy": 1,
        "sum_buy": 1,
        "count_sell": 1,
        "sum_sell": 1
    }},
    # sort for convenience
    {"$sort": {"total_transactions": pymongo.DESCENDING}}
]

# let's see the result
transactions_data = collection_customers.aggregate(pipeline)

# output the results
for k in transactions_data:
    print(k)

{'username': 'hrogers', 'name': 'John Williams', 'total_transactions': 443, 'count_buy': 214, 'sum_buy': 40669474.78686364, 'count_sell': 229, 'sum_sell': 35823960.705088235}
{'username': 'ronald62', 'name': 'Phillip Obrien', 'total_transactions': 421, 'count_buy': 214, 'sum_buy': 158078854.4386863, 'count_sell': 207, 'sum_sell': 152605031.56651777}
{'username': 'emilybrooks', 'name': 'Travis White', 'total_transactions': 409, 'count_buy': 221, 'sum_buy': 182755931.63638356, 'count_sell': 188, 'sum_sell': 173288305.67396381}
{'username': 'david77', 'name': 'Aaron Perez', 'total_transactions': 404, 'count_buy': 209, 'sum_buy': 212795293.36108452, 'count_sell': 195, 'sum_sell': 164182794.20104754}
{'username': 'steve73', 'name': 'Stacey Mccall', 'total_transactions': 403, 'count_buy': 200, 'sum_buy': 126014304.03956747, 'count_sell': 203, 'sum_sell': 118212301.92461564}
{'username': 'kelly68', 'name': 'Angela Leblanc', 'total_transactions': 401, 'count_buy': 200, 'sum_buy': 29416658.4911

I will not check this task manually, I will trust `pymongo` :)