In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("proj4.ipynb")

# Project 4: Mongo 

## Due Date: Wed 11/20, 5:00 PM

In this project, we will be investigating how different database systems handle semi-structured JSON data. In particular, we will be placing emphasis on the use of MongoDB: a database system that stores data in a construct known as documents. These documents are very similar to the JSON objects we've explored in lecture, with a few differences in representation and indexing that we will explore in the following questions. 

In this project, we will be working with the **Yelp Academic Dataset** which contains the datasets `businesses`, `reviews`, and `users`. Due to the limitations of JupyterHub and the Mongo instances we are working with, `reviews` and `users` are truncated to 7500 reviews and 1000 users. We will be using the full `businesses` dataset, however.

Throughout the course of this project, you should understand what Mongo can (and cannot) do with regards to its documents as a NoSQL datastore, and compare and contrast this to other data representation formats such as the relational model.

## Logistics & Scoring Breakdown

Please read the submission instructions carefully and double check that your submission is not throwing any errors. Please ensure that public tests pass upon submission. It is your responsibility to wait until the autograder finishes running. We will not be accepting regrade requests for submission issues.

Each coding question has **both public tests and hidden tests**. Roughly 50% of your coding grade will be made up of your score on the public tests released to you, while the remaining 50% will be made up of unreleased hidden tests. **Free-response questions (marked 'm' in the table below) are manually graded.**

This is an **individual project**. However, you’re welcome to collaborate with any other student in the class as long as it’s within the academic honesty guidelines.

Question | Points
--- | ---
1a	| 1
1b  | 1
1c	| 2
1d	| 1
1e	| 2
1f  | 1
2a	| m: 2
2b	| 1
2c  | 1
2d  | m: 2
3a	| m: 1
3b	| 1
3c	| 1
3d  | 1
3e  | 1
3f  | 3
4a	| 1
4b	| 2
4c	| 2
4d  | 1
**Total** | 28

**Grand Total:** 28 points (autograded: 23, manual: 5) 

## Loading Up Mongo
We will be using [PyMongo](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/), a Python wrapper for MongoDB, for this project. **Note that the method names and syntax in PyMongo might be slightly different compared to pure MongoDB, for example `update_one` instead of `updateOne`.**

Every student should have access to their own MongoDB instance, running on the localhost of your Datahub server. After running the following cell, for the rest of the project, you can use the Python variables `business`, `review`, and `user` to access the corresponding collection.

To prevent bracket mismatches while creating your queries, it is recommended to **turn on "Auto Close Brackets"** via Settings (top left) in JupyterHub.

Furthermore, since we are using Python dictionaries as our query filter, make sure to wrap all keys and values inside quotes.

In [None]:
import pickle
import pandas as pd
import pymongo
from pymongo import TEXT
import numpy as np

myclient = pymongo.MongoClient("mongodb://localhost")
mydb = myclient["yelp"]
business = mydb["business"]
review = mydb["review"]
user = mydb["user"]

## Troubleshooting

**PLEASE READ:** Please avoid printing too many debugging query outputs—it may crash your JupyterHub if your file size becomes too large! It's recommended to use the `limit()` method and delete any debugging query cells if no longer needed as you go through the project.

You might run into issues on the project where you are certain your code works but the output is incorrect. This may be because your collections have been corrupted. Run the following cell and uncomment the specific collections you would like to drop if you would like to remake your collections from scratch. **Be sure to re-run the Load Datasets cells below if you drop your collections so you aren't working with empty collections!**

In [None]:
# UNCOMMENT AND RUN THIS CELL IF YOU WOULD LIKE TO REMAKE YOUR COLLECTIONS FROM SCRATCH. 
# IF YOU DROP ANY COLLECTIONS, RE-RUN THE NEXT TWO CELLS TO LOAD IN THE DATA.

# review.drop()
# business.drop()
# user.drop()

## Load Datasets
The following two cells will load the JSON datasets into the appropriate Mongo collections. You will only need to run them once unless you drop the collections above. The second cell **may take a couple of minutes to run** if you are running it for the first time or are running it after you dropped the collections.

In [None]:
import zipfile
import os.path

if not os.path.isfile('data/yelp_academic_dataset_review.json'):
    with zipfile.ZipFile('data/yelp_academic_dataset_review.json.zip', 'r') as zip_ref:
        zip_ref.extractall('data')

if not os.path.isfile('data/yelp_academic_dataset_user.json'):
    with zipfile.ZipFile('data/yelp_academic_dataset_user.json.zip', 'r') as zip_ref:
        zip_ref.extractall('data')

if not os.path.isfile('data/yelp_academic_dataset_business.json'):
    with zipfile.ZipFile('data/yelp_academic_dataset_business.json.zip', 'r') as zip_ref:
        zip_ref.extractall('data')

In [None]:
# THIS CELL MAY TAKE AT MOST 5 MINUTES. BUT HOPEFULLY YOU WILL ONLY NEED TO RUN IT ONCE.
import json

if business.count_documents({}) == 0:
    print("Loading business collection...")
    with open('data/yelp_academic_dataset_business.json', encoding='utf-8') as f:
        for line in f:
            business.insert_one(json.loads(line))

if review.count_documents({}) == 0:
    print("Loading review collection...")
    with open('data/yelp_academic_dataset_review.json', encoding='utf-8') as f:
        for line in f:
            review.insert_one(json.loads(line))
            
if user.count_documents({}) == 0:
    print("Loading user collection...")
    with open('data/yelp_academic_dataset_user.json', encoding='utf-8') as f:
        for line in f:
            user.insert_one(json.loads(line))

Let's take a quick look at our collections. For the command below, replace `user` with `review` or `business` to count the number of documents in each collection.

In [None]:
user.count_documents({})

Now let's inspect our collections. The code below retrieves the first document in `business`.

In [None]:
business.find_one()

If you see a document containing a business named `Oskar Blues Taproom` when you run the command above, it means that our JSON data has successfully been imported into the collection!

You should be able to similarly view the first document in `user` and `review` by running the code below.

In [None]:
user.find_one()

In [None]:
review.find_one()

Assuming there were no errors, now we can get started with exploring Mongo in a bit more detail.

## Connect to the grader

Run the following cell for grading purposes.

In [None]:
# Just run the following cell, no further action is needed.
from data101_utils import GradingUtil
grading_util = GradingUtil("proj4")
grading_util.prepare_autograder()

In [None]:
# Do not delete/edit this cell
import pickle
import pandas as pd

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 1: Basic MQL

### Question 1a

In lecture, we discussed how one could find specific attributes from a JSON object using dot (`.`) notation. 

- While you can still use the dot notation in queries, PyMongo represents documents returned from Mongo queries using Python dictionaries, making it convenient to manipulate JSON using a mix of Mongo queries and array indexing.
    - Specifically, given the result of a retrieval `find` query, you can look up the third document by indexing with `[2]`. Note, since we are using Python arrays, we will be using 0-based indexing. Then, given this document, you can look up the field `'amount'` by appending `['amount']` etc., adding multiple square brackets as needed to "walk down" the JSON tree representation via `collection.find(...)[2]['amount']`. This will return the `'amount'` field from the 3rd document returned from the query. This combination of query and indexing will be useful in obtaining the necessary information you need for this and other questions.
- In order to get a visual output of the query results, you will need to wrap `collection.find(...)` inside `list()`, e.g. `list(collection.find(...))`. This is because `collection.find(...)` returns a **Cursor** object, which is an iterator. **An important consequence** is that if we set `result = collection.find(...)`, then calling `list(result)` for the first time will get you the expected list of documents in the query result, but calling `list(result)` for a second time will give you an empty list! So wrapping `collection.find(...)` directly inside `list()` would avoid this issue. With that in mind, you may not *always* need to obtain a visual output of the results.
- Be aware of the distinction of when you are querying with Mongo versus Python-based array indexing into your Mongo query results (i.e. you are wrapping your query inside `list()` and *then* indexing into that list.)
- **As a reminder, since we are using Python dictionaries as our query filter, make sure to wrap all keys (and applicable values) inside quotes. Otherwise, errors will occur.**

As a warmup to get you familiarized with PyMongo syntax in combination with Python-based array indexing, find the **Tuesday hours** for the restaurant named **Legal Sea Foods** at **100 Huntington Ave** in **Boston**. Be careful—there are many Legal Sea Foods in Boston!

In [None]:
result_1a = ...
result_1a

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_1a", result_1a);

In [None]:
grader.check("q1a")

<br>

---

### Question 1b
Now, let's get some practice with aggregation and filtering. Our goal is to write a query that computes the average star rating for all businesses in Colorado with 30 reviews or greater. However, this won't be as easy as setting the state to CO! If we inspect this dataset more closely, we will notice that some cities are not matched up with the right states. As an example, run the query below.

In [None]:
list(business.find({"state": "CA"}).limit(3))

Notice how cities like Portland, Atlanta, and Orlando are classified as California cities! However, the latitude and longitude is generally correct. The latitude of Colorado is between 37 and 41 **inclusive** and the longitude is between -109 and -102 **inclusive**. Now, use this to **find the average star rating** of all businesses in this range with **30 or more reviews**.

Recall that in SQL, we would use a `GROUP BY` with the `AVG` aggregation function. In Mongo, we use an aggregation pipeline [(documentation here)](https://www.mongodb.com/docs/manual/reference/method/db.collection.aggregate/), comprised of multiple stages (e.g., `$match` followed by `$group`). Each stage transforms the documents in some way. Pipeline stages do not need to produce one output document for every input document. For example, some stages may generate new documents or filter out documents.

**Hints:**
- As in the previous question, you may find it helpful to use the PyMongo array notation to extract the pertinent information/document once you have composed the right Mongo aggregation query. You are required to wrap `collection.aggregate(...)` inside `list()`, e.g. `list(collection.aggregate(...))` before indexing/visualizing the output. Similar to `collection.find(...)`, `collection.aggregate(...)` also returns a **Cursor** object (which is an iterator).

- You can set multiple conditions for a given field within the same object, e.g. `{"$gte": 0, "$lte": 10}`. This is the recommended approach, or else you may need to worry about the ordering between the conditions.

In [None]:
result_1b = ...
result_1b

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_1b", result_1b);

In [None]:
grader.check("q1b")

<br>

---
### Question 1c

In this question, we will explore aggregation and grouping further. We will also make use of the `$project` operator which allows us to output documents with certain fields of our choosing. 

For this question, we would like to create an aggregation pipeline to find the town in each state with the highest average number of stars. **We will only consider towns with greater than or equal to 5 reviews in total across all the businesses in that town so that the average is meaningful (i.e. there should be at least 5 reviews in that city).** Your final output should contain exactly two fields:

- `average_stars` which contains the average number of stars for the corresponding town.
- `city_state` which is the name of the town with the highest value of average stars in the state concatenated with a comma followed by the state initials. For example, `Berkeley, CA`.


To ensure your output is consistent with the autograder, **sort in descending order by `average_stars` and break ties by sorting on `city_state` in alphabetical (ascending) order.** This is real-world data and may be messy and filled with typos, including your output. However, there is no need to clean the data. You may also ignore upper and lower case discrepancies.

As a concrete example, imagine that Berkeley and Austin have the highest average stars in California and Texas respectively (and both have more than or equal to 5 total reviews in this *truncated* dataset). If Berkeley and Austin both have an average star rating of 5.0, your final output should be:

```
[
    {'average_stars': 5.0, 'city_state': 'Austin, TX'},
    {'average_stars': 5.0, 'city_state': 'Berkeley, CA'}
]

```

**Note:** You will provide a pipeline to `business.aggregate(...)` as your solution. Save your pipeline to `q1c_pipeline`.

**Hint:** You may find the `concat` operator helpful [(documentation here)](https://docs.mongodb.com/manual/reference/operator/aggregation/concat/).

In [None]:
q1c_pipeline = ...

result_1c = list(business.aggregate(q1c_pipeline))
result_1c

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_1c", result_1c);

In [None]:
grader.check("q1c")

<br>

---
### Question 1d

In class, we've described structured (rectangular) data as well as semi-structured data. We haven't quite covered unstructured data—this is basically free-form text. Often, in semi-structured JSON you may have unstructured text data embedded within, such as the text field in the review collection.

MongoDB allows us to build a so-called **text index** to retrieve the relevant document based on keywords found in text in a predefined field. This index converts our free-form text into a structure that allows us to easily look up documents by its contents. To leverage this text search capability, we build a text index on the `text` field in the `review` collection. This has been done for you.

We will then use this text index to do basic sentiment analysis and find all the restaurants we should avoid! Using the text index given, write a query to find all the reviews with "disgusting", "horrible", "horrid", "gross", "bad", or "hate". To use the text index, use the keywords `$text` and `$search` as detailed [here](https://www.mongodb.com/docs/manual/core/text-search-operators/).

Fill in your query into `result_1d` to count how many reviews contain any of these 6 words.

**Hint:** In general, you can count the number of documents returned by a `find` query result via `len(list(collection.find(...)))` or more simply `collection.count_documents(...)`. To count the number of documents returned by an `aggregate` query result, the best way is to directly use `len(list(collection.aggregate(...)))`.

In [None]:
# We create a text index here
if 'text_text' not in review.index_information():
    review.create_index([('text', TEXT)])

result_1d = ...
result_1d

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_1d", result_1d);

In [None]:
grader.check("q1d")

<br>

---
### Question 1e

Now, let's learn Mongo updates, deletions, and creation. Create a new collection called `review_boolean` which is the exact same as `reviews` EXCEPT there is a new field called `to_avoid` which is the ***string*** `"true"`  if the review `text` contains the words "disgusting", "horrid", "horrible", "gross", "bad", or "hate" and the ***string*** `"false"` if not.  

This is a tricky task! We have not discussed creation, updates, or insertions in great detail during lecture but luckily, Mongo uses a similar approach to SQL.

***Insertions***: In order to insert into a document, you may use the functions [review_boolean.insert_one(...)](https://docs.mongodb.com/manual/reference/method/db.collection.insertOne/) or [review_boolean.insert_many(...)](https://docs.mongodb.com/manual/reference/method/db.collection.insertMany/). These functions take in a document or a list of documents and inserts them into the collection. 

***Updates***: In order to update a document, you may use the functions [review_boolean.update_one(...)](https://docs.mongodb.com/manual/reference/method/db.collection.updateOne/) or [review_boolean.update_many(...)](https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/). These functions take in two parameters.
1. The first parameter specifies which documents should be modified. For `update_many`, if the first parameter is `{}`, this indicates that all documents should be updated. However, you can add a more specific filter here if you would like.
2. The second parameter specifies what you would like to update your field to (the [$set](https://docs.mongodb.com/manual/reference/operator/update/set/) operator may come in handy here). Recall that in our SQL model, updates are performed as `UPDATE ... SET ... WHERE ...`. In our case, the first ellipsis corresponds to `review_boolean`, the second ellipsis corresponds to the second parameter of `update_*` where `*` can be `one` or `many`, and the third ellipsis corresponds to the first parameter of `update_*`.

***Creation***: We handle creation of the collection for you. But in PyMongo, creation of a collection is as simple as writing `variable_name = db[collection_name]` where `db` is the the PyMongo database object variable you have already created.

Some additional reminders and hints:
- The empty collection `review_boolean` has already been created for you and is stored in the variable of the same name.
- A text index has been created for you. You can use a similar search approach as the last question.
- We want to start by inserting the documents from the `review` collection into the `review_boolean` collection.
- Don't forget that in order to pass the hidden tests, the `to_avoid` field must exist for every document in `review_boolean`! The [$exists](https://www.mongodb.com/docs/manual/reference/operator/query/exists/) operator may be helpful.

In [None]:
review_boolean = mydb["review_boolean"]
review_boolean.drop()

# We create a text index here
if 'text_text' not in review_boolean.index_information():
    review_boolean.create_index([('text', TEXT)])

# YOUR ANSWER BEGINS HERE

In [None]:
review_boolean = mydb["review_boolean"]
review_boolean.find_one()

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
review_boolean = mydb["review_boolean"]
grading_util.save_results("result_1e", list(review_boolean.find({}, {'_id': 0})));

In [None]:
grader.check("q1e")

<br>

---
### Question 1f

Now, you have a change of heart: you decide that it's unfair to label restaurants as `to_avoid` without at least giving them a chance! Remove the `to_avoid` field from the `review_boolean` collection. Calculate the `difference` between the data size of `review_boolean` with the `to_avoid` field and without it. The code for making this calculation is provided but it is up to you to actually remove the field.

***Deletions***: Deletions in Mongo make use of the `review_boolean.update_one(...)` or `review_boolean.update_many(...)` functionality discussed in Question 1e. However, this time, instead of using the `$set` operator which allows for the creation of new fields, we will use the [$unset](https://docs.mongodb.com/manual/reference/operator/update/unset/) operator which deletes them! (Very tidy!)

**Before running the next cell, make sure to re-run your cell for Question 1e so you don't get a difference of 0!**

In [None]:
with_avoid = mydb.command("collstats", "review_boolean")['size']

# YOUR ANSWER BEGINS HERE
# END

without_avoid = mydb.command("collstats", "review_boolean")['size']
difference = with_avoid - without_avoid
difference

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_1f", difference);

In [None]:
grader.check("q1f")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 2: JSON and Relational Models

### Question 2a

Now we have a good idea of how to do retrieval, aggregation, and updates in Mongo. But, we haven't talked about why we
would want to use Mongo to store JSON! In order to explore this, let's take another look at the `business`
collection. We will look at the first two entries.

In [None]:
list(business.find({}).limit(2))

<!-- BEGIN QUESTION -->

What are **two** benefits of storing this data in MongoDB with JSON over a relational database management system such as Postgres?
Please reference specific examples from the `business` collection to back up your claims. 

Format your answer as follows:

1. Benefit #1, Example #1.
2. Benefit #2, Example #2.

**Limit each benefit to one sentence and each example to one sentence for a total of at most four sentences.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br>

---
### Question 2b

It seems like MongoDB is getting all the love when it comes to JSON support! However, modern iterations of relational databases
such as Postgres 9.3+ also have [excellent JSON functionality](https://www.postgresql.org/docs/9.3/functions-json.html) as we will soon explore in this task. First, let's set up a
bit of scaffolding. The following cell will import the `yelp_academic_dataset_review.json` data into a table called `reviews` in the Postgres Yelp database.

In [None]:
%reload_ext sql
%sql postgresql://jovyan@127.0.0.1:5432/postgres

!psql -h localhost -c 'DROP DATABASE IF EXISTS yelp'
!psql -h localhost -c 'CREATE DATABASE yelp'
!psql -h localhost -d yelp -c 'DROP TABLE IF EXISTS reviews'
!psql -h localhost -d yelp -c 'CREATE TABLE reviews(data TEXT);'
!cat data/yelp_academic_dataset_review.json | psql -h localhost -d yelp -c "COPY reviews (data) FROM STDIN;"
%sql \l

Now, run the following cell to connect to the Postgres Yelp database. There should be no errors after running the following cell.

In [None]:
%sql postgresql://jovyan@127.0.0.1:5432/yelp

Run the following cell to observe how this new `reviews` table looks. Note that the `data` column is stored as TEXT and not as JSON.

In [None]:
%%sql
SELECT * FROM reviews LIMIT 2;

Observe how the reviews table consists of one column named `data`. This column contains all the JSON documents in the 
reviews collection *in text format*. Use [Postgres' JSON functions](https://www.postgresql.org/docs/9.3/functions-json.html) to write a query that converts the JSON object fields into their own `TEXT` columns. (**Hint:** One of the operators in Table 9-40 may be useful). To be more concrete, your query should contain 8 columns in this particular order:

| review_id | user_id | business_id | stars | useful | funny | cool | text |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Each row should correspond to one JSON document. Some skeleton code (that does the mundane work of converting data to JSON properly) is provided to you. You will only need to fill in the `SELECT` clause.

Hint: The `values` column is what stores the JSON data for each row. 

In [None]:
%%sql --save query_2b result_2b <<
...
FROM (SELECT CAST(regexp_replace(data, E'[\\n\\r]+', '','g') AS JSON) AS values FROM reviews) AS b
ORDER BY review_id
LIMIT 10;

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_2b = %sqlcmd snippets query_2b
grading_util.save_results("result_2b", query_2b, result_2b)
result_2b.DataFrame()

In [None]:
grader.check("q2b")

<br>

---
### Question 2c

One important aspect of data engineering that we have not referred to yet is joins. We have seen that through the use of indices, selection/projection predicate pushdowns, and various physical implementations (as well as orderings) that joins can be done quite efficiently in relational SQL based databases. How do joins fare in Mongo where the data stored is inherently semi-structured? Let's investigate! For this question, we have provided you access to the tables `business_complete` and `review_complete` which contain the collections `business` and `review`  in relational form as described in Question 2b (the columns of the relations
are fields in the JSON document). Each relation has its respective `id` (`business_id` or `review_id`) column as its primary key.

In [None]:
!psql -h localhost -d yelp -c 'DROP TABLE IF EXISTS business_complete'
!psql -h localhost -d yelp -c 'CREATE TABLE business_complete(business_id TEXT PRIMARY KEY, name TEXT, address TEXT, city TEXT, state TEXT, postal_code TEXT, latitude TEXT,longitude TEXT, stars TEXT, review_count TEXT, is_open TEXT, attributes TEXT, categories TEXT, hours TEXT);'
!psql -h localhost -d yelp -c 'DROP TABLE IF EXISTS review_complete'
!psql -h localhost -d yelp -c 'CREATE TABLE review_complete(review_id TEXT PRIMARY KEY, user_id TEXT, business_id TEXT, stars TEXT, useful TEXT, funny TEXT, cool TEXT,text TEXT);'
!cat data/business.csv | psql -h localhost -d yelp -c "COPY business_complete (business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours) FROM STDIN CSV HEADER;"
!cat data/review.csv | psql -h localhost -d yelp -c "COPY review_complete (review_id, user_id, business_id, stars, useful, funny, cool, text) FROM STDIN CSV HEADER;"

Let's take a look at how `review_complete` looks.

In [None]:
%%sql
SELECT * FROM review_complete LIMIT 1;

At this current moment in time, Mongo only supports left joins (via the lookup aggregation stage). This is what we will compare against SQL.

Let's start by writing a SQL query (as a Python string below) that displays all the reviews along with their associated business information. You should perform a **left join** between the `review_complete` table and the `business_complete` table on the `business_id` column, and then project all columns. Keep a mental note of the **execution time** that you see in the query plan.

In [None]:
result_2c_str = ...
!psql -h localhost -d yelp -c "explain analyze $result_2c_str"

Now, let's perform the equivalent left join in Mongo between `review` and `business`. **The output array field should be named as `business_info`**. Feel free to refer to the `$lookup` [documentation](https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/).

**Note:** You will provide a single-stage pipeline to `review.aggregate(...)` as your solution. Save your pipeline to `q2c_pipeline`.

In [None]:
# We first create an index on business_id in the business collection
business.create_index('business_id', unique=True)

q2c_pipeline = ...

result_2c = list(review.aggregate(q2c_pipeline))[:5]
# Uncomment the line below to see your output
# result_2c

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
result_2c = list(review.aggregate(q2c_pipeline))[:5]
grading_util.save_results("result_2c", result_2c);

In [None]:
grader.check("q2c")

Run the following cell to examine the query plan for the Mongo query that you just wrote. Again, make a mental note of the execution time that you see. (Look at the value corresponding to the key `executionTimeMillis`, **NOT TO BE CONFUSED WITH `executionTimeMillisEstimate`**).

In [None]:
mydb.command('explain', {'aggregate': 'review', 'pipeline': q2c_pipeline, 'cursor': {}}, verbosity='executionStats')

<!-- BEGIN QUESTION -->

<br>

---
### Question 2d

In the last question, you performed equivalent left joins in both Postgres and Mongo. Now, examine their query plans, paying special attention to `executionTimeMillis`. Which database system was faster? What gives the database system you chose an advantage over the other? Keep your response to at most three sentences.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 3: Dataframes / Pandas

### Question 3a

So far, we've talked about NoSQL / document databases like Mongo and relational databases like Postgres. Now, we will explore data transformation with a different data model: dataframes. Dataframes are similar to relations; we will dive into the differences here. To this end, we will use Pandas, a Python package that allows us to work with dataframes. Pandas is widely adopted by data scientists for data loading, wrangling, cleaning, and analysis. To start, let us export our MongoDB collections into Pandas using a function called `json_normalize`. We need to truncate
`business` before we can use it to meet the memory constraints set by Jupyter. The variable `business_trunc` will contain the reference to the truncated business collection.

In [None]:
business_trunc = mydb["business_trunc"]
business_trunc.drop()
count = 0
for document in business.find({}):
    business_trunc.insert_one(document)
    count += 1
    if count == 1000:
        break

business_cursor = business_trunc.find({})
review_cursor = mydb["reviews"].find({})
user_cursor = mydb["users"].find({})

# Load the collections into Pandas. 
from pandas import json_normalize
user_df = json_normalize(user_cursor)
review_df = json_normalize(review_cursor)
business_df = json_normalize(business_cursor)

For the rest of Question 3, please use the three dataframes we just created: `user_df`, `review_df`, and `business_df`. Let's take a look at the first five rows of `business_df`.

In [None]:
business_df.head()

<!-- BEGIN QUESTION -->

1. What do you notice about how the columns of `business_df` are constructed, e.g. how are fields and subfields represented in the dataframe?
2. How are values that are not found in every document handled in the pandas dataframe?

Hint: You will need to horizontally scroll to view all the column names and values. 

**Format your answer as follows and use one sentence per question:**

1. Sentence 1
2. Sentence 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br>

---
### Question 3b

In the previous question, we talked about how Mongo and Postgres approach joins. Pandas is also capable of performing joins using the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function! For this task, perform an inner join on `business_df` with itself on the column `stars`. The final dataframe should be saved to a variable called `result_3b` and should only contain 3 columns in this particular order: the name of the first restaurant, the name of the second restaurant, and the number of the stars. Your dataframe header should look like this:

| name_x | name_y | stars |
| :--- | :--- | :--- |

Note that the `_x` and `_y` get appended automatically by Pandas during the `merge`. You don't have to do anything special to achieve this.

**Hint:** Check out [this tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) on selecting a subset of the dataframe. This will be helpful in the rest of Question 3 as well!

In [None]:
result_3b = ...
result_3b

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_3b", result_3b.sort_values(['name_x', 'name_y', 'stars'])[:50]);

In [None]:
grader.check("q3b")

<br>

---
### Question 3c

Due to the nested representation of the data, there are a lot of missing fields with `NaN` values in the `business_df` dataframe as you may have noticed in Question 3a. Construct a dataframe `missing_value_df` with two columns: `column_name` and `percent_missing`. `percent_missing` should be the percentage of `NaN` values in the corresponding column in `business_df`. 

For example, if 25% of values are `NaN` in the `name` column, then `percent_missing` would have the value `25.0` (***not*** 0.25).

Your dataframe header should look like this:

| column_name | percent_missing |
| :--- | :--- |

**Hint:** Use Pandas' [isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) function followed by [sum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html).

In [None]:
missing_value_df = pd.DataFrame({'column_name': business_df.columns,
                                 ...
missing_value_df.reset_index(drop=True, inplace=True)
missing_value_df

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_3c", missing_value_df);

In [None]:
grader.check("q3c")

<br>

---
### Question 3d

Plot a histogram distribution of the percentage of NaN values across all columns (via Pandas' [hist](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) function). Don't worry about adding titles / making it look nice—we won't be grading the plot.

In [None]:
# PLOT HERE

Examine the histogram that you just plotted. How many columns are 90%+ `NaN`? Input your answer into `result_q3d` as an integer (e.g. if your answer is 6, then **explicitly** declare/hard-code `result_q3d = 6`. Do not refer to any values in the dataframe; self-created dataframes are not saved in the export file, so any references to them will not work.)

In [None]:
result_q3d = ...

In [None]:
grader.check("q3d")

<br>

---
### Question 3e

Let us now alter `business_df` to exclude the columns with more than 80%+ `NaN` values (keeping columns with 80% `NaN` values or less). This likely means the corresponding attributes are not an important factor for most businesses so we can get rid of them in our `business_df`. Create a new dataframe called `important_attribute_business_df` which only contains these columns.

**Hint:** Check out [this section](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-specific-rows-and-columns-from-a-dataframe) from the tutorial linked in Question 3b.

In [None]:
important_attribute_business_df = ...
important_attribute_business_df.head()

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_3e", important_attribute_business_df);

In [None]:
grader.check("q3e")

<br>

---
### Question 3f

At this point, you have had experience with manipulating data on Mongo, Postgres, and Pandas. In this question, we will provide you with three scenarios and using the lessons you've learned so far, please specify which of the three (Mongo, Postgres, or Pandas) would work best for this specific use case.

#### Question 3fi

You are doing a data journalism piece on college sports. You collect a list of colleges and for each collegiate sport program within that college, you find the budget assigned for that program. Each college may offer a different variety of sports programs. You have a choice between the following:

A) Representing this data in JSON (e.g. 
```
{
    "UC Berkeley": {
        "football": "10000000", 
        "wrestling": "344582", 
        ...}
}
```
) and importing into Mongo.

B) Representing this data as a schema in Postgres where the columns are the names of the sports.

C) Representing this data as a dataframe in Pandas where the columns are the names of the sports.

You would like to find the aggregate of budgets across different sports (average, sum, median, mode). What would be the best option for storing this data?

**Set the value of `q3fi` to `'A'`, `'B'`, or `'C'`.**

In [None]:
q3fi = ...

In [None]:
grader.check("q3fi")

#### Question 3fii

You would now like to investigate the effect of budgets on student-athlete scholarships. After doing some research, you find a dataset that contains a list of every single athlete at every single college and their sport and scholarship levels (this is a massive 10GB+ dataset with millions of rows). You find another dataset that contains a list of colleges, their sports programs, and the program budget. This is another massive dataset with hundreds of thousands of rows. You would like to perform an inner join between the two datasets on school and program so you can view each student-athlete's scholarship with their sport's budget. You have a choice between the following:

A) Representing each dataset in JSON (e.g. 
```
{"athletes": [
    {"Chase Garbers": {
        "school": "UC Berkeley", 
        "scholarship": "full", 
        "sport": "football", 
        ...
        }
    }, 
    ...
]}
```
and 
```
{"schools": [
    {"UC Berkeley": {
        "football": {
            "budget": "10000000"
         }, 
         ...
         }
    }, 
    ...
 ]}
 ```
), importing into Mongo, and doing a join there.

B) Representing this data as 2 schemas in Postgres where the columns for the first schema are 
[`student_name`, `school`, `sport`, `scholarship`] and for the second [`school`, `sport`, `budget`].

C) Representing this data as 2 dataframes in Pandas with the same columns as Postgres.

What would be the best option for storing this data?

**Set the value of `q3fii` to `'A'`, `'B'`, or `'C'`.**

In [None]:
q3fii = ...

In [None]:
grader.check("q3fii")

#### Question 3fiii

Finally, you are ready to start writing your article! You decide to focus on just the data from UC Berkeley. You have access to a dataset of just UC Berkeley athletes along with their sports and scholarship levels. The scholarship level data was improperly cleaned: some scholarships are recorded as strings "full", "half", or "none" and some are recorded as integer percentages 0-100. You would like to provide this data to your readers in a format that is susceptible to easy visualizations: e.g. graphs that show how many athletes have a full vs. half vs. no scholarship, which sports have the highest percentages of athletes with full scholarships etc. What is the best way to store this data for this purpose?

A) Represent the dataset in JSON e.g.
```
{"athletes": [
    {
       "Chase Garbers": {
         "scholarship": "full", 
         "sport": "football"
       }
    },
    {
        "Danielle Vosk": {
          "scholarship": 25,
          "sport": "basketball"
        }
    },
    ...
    ]
}
```
B) Represent this data as a schema in Postgres where the columns are [`student_name`, `sport`, `scholarship`]

C) Represent this data as a dataframe in Pandas with the same columns as Postgres.
    
**Set the value of `q3fiii` to `'A'`, `'B'`, or `'C'`.**

In [None]:
q3fiii = ...

In [None]:
grader.check("q3fiii")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 4: Messy JSON

Many of the queries you've seen or written thus far were relatively reliable: aggregating and collecting over fields
that you know exist for sure. But the nature of Mongo documents is that they are inherently flexible and semi-structured. Not every document will share every single field! In this question, we will explore how Mongo handles these use cases using the `business` collection.

### Question 4a

Imagine you are in charge of managing your family reunion. You would like to book a private room at a restaurant.
However, you would also like to optimize for chaos. You notice that there is an attribute called `RestaurantsGoodForGroups`. You would like to write a query that returns all restaurants that **do not** have the `RestaurantsGoodForGroups` attribute so that the trajectory of the reunion is determined by fate. (**Hint:** Search up the `$exists` keyword). 

How many restaurants do not have the `RestaurantsGoodForGroups` attribute? Ensure that your output for the autograder is the **number of restaurants that do not have the `RestaurantsGoodForGroups` attribute** stored in `q4a` as an integer.

**Note:** You would like this list to consist solely of restaurants. We define a restaurant as a business that has the word `Restaurants` in the `categories` field. You should perform a similar `$text` `$search` as in Question 1d. **We will work with this definition for the rest of the Question 4 as well!**

In [None]:
# The following text index may be useful!
if 'categories_text' not in business.index_information():
    business.create_index([('categories', TEXT)])

q4a_cursor = ...
q4a = len(list(q4a_cursor))

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_4a", q4a)[0]

In [None]:
grader.check("q4a")

<br>

---
### Question 4b

Your relatives inform you that they would like to be at the restaurant when it opens to beat the crowds. Furthermore, after gathering their availabilities, most of your relatives would prefer for the meal to be on a Friday and for the start time of the meal to be between 5:00 - 6:59 PM (17:00-18:59). Find the number of restaurants that open on Fridays between 17:00-18:59 and store this in a variable labeled `q4b`.

As a reminder, **in order for a business to be a restaurant, it must have `Restaurants` in its categories.**

Also, be aware that **`hours` can either be a dictionary or `None`**. Depending on how you write your aggregation pipeline, you may or may not need to explicitly deal with this. Hint: Read the [\$match](https://www.mongodb.com/docs/manual/reference/operator/aggregation/match/) documentation carefully. **What does it automatically filter out?**

**Hint**:
- Once you have the Friday hours of a particular restaurant, you only need to look at the first time listed. For example, suppose Restaurant A has Friday hours like so: `15:00-16:00`. Restaurant A opens at 15:00, which is outside the range from 17:00-18:59, inclusive. Thus, we would not want to include this in our count. This is because your relatives want to be **at the restaurant when it opens.**
- Set up an aggregation pipeline using the [\$set](https://www.mongodb.com/docs/manual/reference/operator/aggregation/set/) and [\$match](https://www.mongodb.com/docs/manual/reference/operator/aggregation/match/) stage operators. You may also want to use the [\$split](https://www.mongodb.com/docs/manual/reference/operator/aggregation/split/) operator to parse out the Friday hours as an integer and then use comparison operators to find the restaurants that are open during the specified time. Note that using dot notation for array indexing in aggregation pipelines may not work as expected, so we recommend using [\$arrayElemAt](https://www.mongodb.com/docs/manual/reference/operator/aggregation/arrayElemAt/) operator.
- You may assume that the open hour times are well formed, e.g. the format is HH:MM-HH:MM where HH represents the hour (from 0-23) and MM represents the minutes (from 0-59).

In [None]:
q4b_cursor = ...
q4b = len(list(q4b_cursor))

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_4b", q4b)[0]

In [None]:
grader.check("q4b")

<br>

---
### Question 4c

Some members of your family are vegetarian so you would like to only eat at restaurants with the `Vegetarian` category. 
However, the `categories` are stored as a single string! You would like to make it easy to access `Vegetarian` as a separate field. Write a query that does the following: for every category in `categories`, add a new document that contains the `ObjectId` of the restaurant (labeled `_id`), the name of the business (labeled `name`), and the category (labeled `category`).

For example, a document 
```
{
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's", 
     ..., 
     categories" : "Salad, Vegetarian, Restaurants"
} 
```
would become 
```
{
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's",
    “category”: "Salad"
}
```
and 
```
{
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's",
    “category”: "Vegetarian"
}
```
and
```
{
    "_id": ObjectId('606ffb0123cf2e5079dbd91f'), 
    "name": "Wendy's",
    “category”: "Restaurants"
}
```

Finally, to ensure your output is consistent with the autograder, sort in ascending order by `name` and break ties on `category`. Save your pipeline to a variable called `q4c_pipeline`.

**Hints:**
- Before doing anything else, **make sure that you filter businesses to only include restaurants, as defined in Question 4a and 4b**
- The `$unwind` operator may be helpful here. You can find the documentation [here](https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/). Be sure to check what object type `$unwind` operates on and watch out to make sure you don't have any unnecessary space in the `category` field.

In [None]:
q4c_pipeline = ...
result_4c = list(business.aggregate(q4c_pipeline))

In [None]:
result_4c[:5]

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
result_4c = list(business.aggregate(q4c_pipeline))[:50]
grading_util.save_results("result_4c", result_4c);

In [None]:
grader.check("q4c")

<br>

---
### Question 4d
This change in representation has made it super easy to view all the vegetarian restaurants and count them without the use of an index; this is because we can now simply filter by whether or not `'Vegetarian'` is the value of the `category` field in our document! Write some code to count how many vegetarian restaurants are in our dataset—notice that there is only one more pipeline stage you'll need to add on top of the pipeline stages in Question 4c.

In [None]:
q4d_pipeline = q4c_pipeline[:] # copy pipeline from Q4c

extra_stage = ...
q4d_pipeline.append(extra_stage)

result_4d = list(business.aggregate(q4d_pipeline))

veg_count = len(result_4d)

In [None]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
grading_util.save_results("result_4d", veg_count)[0]

In [None]:
grader.check("q4d")

## Congratulations! You have finished Project 4.

Run the following cell to zip and download the results of your queries. You will also need to run the export cell at the end of the notebook.

**Please save your notebook before exporting (this is a good time to do it!)** Otherwise, we may not be able to export your written responses to `proj4.pdf`. We will not be accepting regrade requests for failure to render written responses.

**For your submission on Gradescope, you will only need to submit the single `proj4.zip` file generated by the export cell.** Please ensure that your submission `proj4.zip` file includes `proj4.pdf`, `proj4.ipynb`, and `results.zip`. 

**Please ensure that public tests pass upon submission.** It is your responsibility to wait until the autograder finishes running. We will not be accepting regrade requests for submission issues.

**Common submission issues:** You MUST submit the generated zip file to the autograder. However, Safari is known to automatically unzip files upon downloading. You can fix this by going into Safari preferences, and deselect the box with the text "Open safe files after downloading" under the "General" tab. If you experience issues with downloading via clicking on the link, you can also navigate to the project 3 directory within JupyterHub (remove `proj4.ipynb` from the url), and manually download the generated zip files. Please post on Ed if you encounter any other submission issues.

Run the following cell to zip and download the results of your queries. You will also need to run the export cell at the end of the notebook.

In [None]:
grading_util.prepare_submission_and_cleanup()

Here's some cute pet photos from Data 101 students and staff for all your hard work!

If you have pet photos to submit, you can upload them to [this Ed post](https://edstem.org/us/courses/63937/discussion/5425084).

<div>
<img src="https://static.us.edusercontent.com/files/h0bgtw1IJe2uVeKe5hMJ0Ssf" alt="gray cat on windowsill" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/9ygfWbUocYqiuDZE6CIsfdXJ" alt="gray cat on desk" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/xY3ttl3FgNVl2hrY37mJBBt3" alt="sleeping dog" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/4l4Y8SXIlE9bCmCq2VtNLeLJ" alt="border collie next to tree" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/KeBFG4hReyeggsPIAstZoWyZ" alt="orange cat on desk" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/hdTEmYpesthqbLhvaXd3wJuv" alt="smiling akita dog" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/HX4fUThex0fvkUPog0AAOoqX" alt="puppy in the snow" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/5cJZhXpRAn2wT3DhGQv9cpsJ" alt="happy dog looking up at you" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/RYNels3EAOl3S4ebPJaDf7es" alt="happy dog looking up at you" width="200">
</div>

<div>
<img src="https://static.us.edusercontent.com/files/QMJdSuSKKW4V3fxBfVdAcNvK" alt="happy dog looking up at you" width="200">
</div>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(files=['results.zip'])