# Lab 8 - MongoDB

## Disclaimer

This lab has you work with current data about COVID-19 infections in the
United States. This includes information about both infections, and deaths
due to COVID-19. Some of the data is used to analyze the fatalities
due to COVID-19 and compare them across time, and across different states.
I am giving you this assignment because I feel learning about engineering and technology
should be done with a greater purpose in mind. I understand fully that this
may be a sensitive issue to some of you. Please reach out to me.

## Introduction to pymongo

We will be using pymongo for our mongo labs. This lab focuses on queries and aggregation in MongoDB.

I have inserted ``data/daily.json`` into the database in the collection called ``daily`` in a database called ``csc-369``. You may gain access to it using the following commands:

In [1]:
from pymongo import MongoClient
client = MongoClient()

In [2]:
db = client["csc-369"]

col = db["daily"]

You can take a look at one of the records using

In [3]:
import pprint
pprint.pprint(col.find_one())

{'_id': ObjectId('60392e3656264fee961ca816'), 'state': 'AK'}


## Information about the data
The collection contains information about COVID-19
infections in the United States. The data comes from the COVID Tracking
Project web site, specifically, from this URL:

https://covidtracking.com/api

We will be using the JSON version of the daily US States data, available
directly at this endpoint:

https://covidtracking.com/api/states/daily

For the sake of reproducibility, we will be using a data file Dr. Dekhtyar downloaded
on April 5, that includes all available data from the beginning of the tracking (March 3, 2020) through April 5, 2020. 

The data file is available for download from the course web site.
The COVID Tracking project Website describes the format of each JSON
object in the collection as follows:
* state - State or territory postal code abbreviation.
* positive - Total cumulative positive test results.
* positiveIncrease - Increase from the day before.
* negative - Total cumulative negative test results.
* negativeIncrease - Increase from the day before.
* pending - Tests that have been submitted to a lab but no results have
been reported yet.
* totalTestResults - Calculated value (positive + negative) of total test
results.
* totalTestResultsIncrease - Increase from the day before.
* hospitalized - Total cumulative number of people hospitalized.
* hospitalizedIncrease - Increase from the day before.
* death - Total cumulative number of people that have died.
* deathIncrease - Increase from the day before.
* dateChecked - ISO 8601 date of the time we saved visited their website
* total - DEPRECATED Will be removed in the future. (positive + negative + pending). Pending has been an unstable value and should not count in any totals.

In addition to these attributes, the JSON objects will contain the following
attributes (explained elsewhere in the API documentation):
* date - date for which the data is provided in the YYYYMMDD format
(note: JSON treats this value as a number - make sure you parse
correctly).
* fips - Federal Information Processing Standard state code
* hash - the hash code of the record
* hospitalizedCurrently - number of people currently hospitalized
* hospitalizedCumulative - appears to be the new name for the hospitalized attribute
* inIcuCurrently - number of people currently in the ICU
* inIcuCumulative - total cumulative number of people who required ICU hospitalization
* onVentilatorCurrently - number of people currently on the ventilator
* onVentilatorCumulative - total cumulative number of people who at some point were on ventilator
* recovered - total cumulative number of people who recovered from COVID-19

Note: ”DEPRECATED” attribute means an attribute that can be found
in some of the earlier JSON records, that that is not found in the most
recent ones.

I've noticed during interactions that some folks are skipping the line below. It is my fault for not explaining it. In Python when you import a file it is never reloaded even if the contents change on disk. If you run the cell below before an import, then it will reload automatically for you.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
# make sure your run the cell above before running this
import Lab8_helper

**Exercise 1:** Use find_one to find a record with an object ID equal to 60392e3656264fee961ca817. 

In [6]:
record = Lab8_helper.exercise_1(col,'60392e3656264fee961ca817')
record

{'_id': ObjectId('60392e3656264fee961ca817'),
 'date': 20200405,
 'state': 'AL',
 'positive': 1796,
 'negative': 11282,
 'pending': None,
 'hospitalizedCurrently': None,
 'hospitalizedCumulative': 231,
 'inIcuCurrently': None,
 'inIcuCumulative': None,
 'onVentilatorCurrently': None,
 'onVentilatorCumulative': None,
 'recovered': None,
 'hash': '3f2c1f28926eeadf623d04aeb3716d29c5394d3c',
 'dateChecked': '2020-04-05T20:00:00Z',
 'death': 45,
 'hospitalized': 231,
 'total': 13078,
 'totalTestResults': 13078,
 'posNeg': 13078,
 'fips': '01',
 'deathIncrease': 2,
 'hospitalizedIncrease': 19,
 'negativeIncrease': 2009,
 'positiveIncrease': 216,
 'totalTestResultsIncrease': 2225}

**Exercise 2:** Use count_documents to count the number of records/documents that have ``state`` equal to 'CA'. 

In [7]:
record = Lab8_helper.exercise_2(col,'CA')
record

33

**Exercise 3:** Write a function that returns all of the documents that have a date less than ``d``. Sort the documents by the date, and convert the result to a list.

In [8]:
d = 20200315 # YYYY-MM-DD
record = Lab8_helper.exercise_3(col,d)
record[:3]

[{'_id': ObjectId('60392e3656264fee961caeb5'),
  'date': 20200304,
  'state': 'AZ',
  'positive': 2,
  'negative': 27,
  'pending': 5,
  'hash': 'f9b5336be00388e0549a6e35cfbe7ec911597df2',
  'dateChecked': '2020-03-04T21:00:00Z',
  'total': 34,
  'totalTestResults': 29,
  'posNeg': 29,
  'fips': '04',
  'deathIncrease': None,
  'hospitalizedIncrease': None,
  'negativeIncrease': None,
  'positiveIncrease': None,
  'totalTestResultsIncrease': None},
 {'_id': ObjectId('60392e3656264fee961caeb6'),
  'date': 20200304,
  'state': 'CA',
  'positive': 53,
  'negative': 462,
  'hash': 'e89c69dcaf7f202257af579a58f86c340eee886a',
  'dateChecked': '2020-03-04T21:00:00Z',
  'total': 515,
  'totalTestResults': 515,
  'posNeg': 515,
  'fips': '06',
  'deathIncrease': None,
  'hospitalizedIncrease': None,
  'negativeIncrease': None,
  'positiveIncrease': None,
  'totalTestResultsIncrease': None},
 {'_id': ObjectId('60392e3656264fee961caeb7'),
  'date': 20200304,
  'state': 'FL',
  'positive': 2,
  'n

**Exercise 4:** Write a function that returns the total number of positive cases and the number of new cases
in New York state on April 1.

In [9]:
record = Lab8_helper.exercise_4(col)
record

83712

**Exercise 5:** Write a function that returns how many deaths were in the state of New Jersey on the earliest day when the total cumulative number of deaths exceeded 500 (i.e., ``death`` column).

> .sort(), in pymongo, takes key and direction as parameters.
> So if you want to sort by, let's say, id then you should .sort("_id", 1)

In [10]:
record = Lab8_helper.exercise_5(col)
record

537

**Exercise 6:** Write a function using ``aggregate``. The function reports the count and the cumulative increase in positive cases (when there were positive cases) within the date range (inclusive). Do not include missing days or values (i.e., positive cases > 0). I used ``$match``, ``$group``, and ``$and`` within aggregate. The columns I used are date, state, and positiveIncrease.

In [11]:
result = list(Lab8_helper.exercise_6(col,20200401,20200402))
import pprint
pprint.pprint((result))

record = Lab8_helper.process_exercise_6(result)
record

[{'_id': {'state': 'NH'}, 'count': 1, 'sum': 101},
 {'_id': {'state': 'WA'}, 'count': 2, 'sum': 1088},
 {'_id': {'state': 'IA'}, 'count': 2, 'sum': 117},
 {'_id': {'state': 'VI'}, 'count': 1, 'sum': 3},
 {'_id': {'state': 'MS'}, 'count': 2, 'sum': 240},
 {'_id': {'state': 'LA'}, 'count': 2, 'sum': 3913},
 {'_id': {'state': 'KY'}, 'count': 2, 'sum': 200},
 {'_id': {'state': 'IL'}, 'count': 2, 'sum': 1701},
 {'_id': {'state': 'GA'}, 'count': 2, 'sum': 1419},
 {'_id': {'state': 'ME'}, 'count': 2, 'sum': 73},
 {'_id': {'state': 'AR'}, 'count': 2, 'sum': 120},
 {'_id': {'state': 'OH'}, 'count': 2, 'sum': 703},
 {'_id': {'state': 'MA'}, 'count': 2, 'sum': 2346},
 {'_id': {'state': 'SD'}, 'count': 2, 'sum': 57},
 {'_id': {'state': 'WY'}, 'count': 2, 'sum': 41},
 {'_id': {'state': 'ID'}, 'count': 2, 'sum': 254},
 {'_id': {'state': 'AL'}, 'count': 2, 'sum': 252},
 {'_id': {'state': 'NV'}, 'count': 2, 'sum': 345},
 {'_id': {'state': 'PA'}, 'count': 2, 'sum': 2173},
 {'_id': {'state': 'VA'}, 'cou

{'NH': 101.0,
 'WA': 544.0,
 'IA': 58.5,
 'VI': 3.0,
 'MS': 120.0,
 'LA': 1956.5,
 'KY': 100.0,
 'IL': 850.5,
 'GA': 709.5,
 'ME': 36.5,
 'AR': 60.0,
 'OH': 351.5,
 'MA': 1173.0,
 'SD': 28.5,
 'WY': 20.5,
 'ID': 127.0,
 'AL': 126.0,
 'NV': 172.5,
 'PA': 1086.5,
 'VA': 228.0,
 'CT': 348.0,
 'AZ': 154.5,
 'MD': 335.5,
 'CO': 357.5,
 'KS': 62.0,
 'IN': 440.0,
 'AK': 12.0,
 'DC': 79.0,
 'NC': 179.5,
 'VT': 22.5,
 'DE': 37.0,
 'GU': 6.5,
 'FL': 836.0,
 'MO': 253.5,
 'UT': 93.5,
 'TN': 303.0,
 'MP': 3.0,
 'WI': 189.5,
 'MT': 21.5,
 'MI': 1588.0,
 'NJ': 3447.0,
 'ND': 16.5,
 'WV': 27.5,
 'RI': 84.5,
 'NE': 37.0,
 'PR': 38.5,
 'SC': 235.5,
 'NM': 41.0,
 'MN': 56.5,
 'NY': 8293.0,
 'CA': 854.5,
 'HI': 27.0,
 'OK': 157.0,
 'OR': 68.0,
 'TX': 701.5}

In [12]:
record['AZ'],record['AL']

(154.5, 126.0)

**Exercise 7:** Repeat exercise 6, but instead of using aggregate you must use map-reduce.

In [13]:
result = list(Lab8_helper.exercise_7(col,20200401,20200402).find())
import pprint
pprint.pprint((result))

record = Lab8_helper.process_exercise_7(result)
record

[{'_id': 'AK: count', 'value': 2.0},
 {'_id': 'AK: sum', 'value': 24.0},
 {'_id': 'AL: count', 'value': 2.0},
 {'_id': 'AL: sum', 'value': 252.0},
 {'_id': 'AR: count', 'value': 2.0},
 {'_id': 'AR: sum', 'value': 120.0},
 {'_id': 'AZ: count', 'value': 2.0},
 {'_id': 'AZ: sum', 'value': 309.0},
 {'_id': 'CA: count', 'value': 2.0},
 {'_id': 'CA: sum', 'value': 1709.0},
 {'_id': 'CO: count', 'value': 2.0},
 {'_id': 'CO: sum', 'value': 715.0},
 {'_id': 'CT: count', 'value': 2.0},
 {'_id': 'CT: sum', 'value': 696.0},
 {'_id': 'DC: count', 'value': 2.0},
 {'_id': 'DC: sum', 'value': 158.0},
 {'_id': 'DE: count', 'value': 2.0},
 {'_id': 'DE: sum', 'value': 74.0},
 {'_id': 'FL: count', 'value': 2.0},
 {'_id': 'FL: sum', 'value': 1672.0},
 {'_id': 'GA: count', 'value': 2.0},
 {'_id': 'GA: sum', 'value': 1419.0},
 {'_id': 'GU: count', 'value': 2.0},
 {'_id': 'GU: sum', 'value': 13.0},
 {'_id': 'HI: count', 'value': 2.0},
 {'_id': 'HI: sum', 'value': 54.0},
 {'_id': 'IA: count', 'value': 2.0},
 {

{'AK': 12.0,
 'AL': 126.0,
 'AR': 60.0,
 'AZ': 154.5,
 'CA': 854.5,
 'CO': 357.5,
 'CT': 348.0,
 'DC': 79.0,
 'DE': 37.0,
 'FL': 836.0,
 'GA': 709.5,
 'GU': 6.5,
 'HI': 27.0,
 'IA': 58.5,
 'ID': 127.0,
 'IL': 850.5,
 'IN': 440.0,
 'KS': 62.0,
 'KY': 100.0,
 'LA': 1956.5,
 'MA': 1173.0,
 'MD': 335.5,
 'ME': 36.5,
 'MI': 1588.0,
 'MN': 56.5,
 'MO': 253.5,
 'MP': 3.0,
 'MS': 120.0,
 'MT': 21.5,
 'NC': 179.5,
 'ND': 16.5,
 'NE': 37.0,
 'NH': 101.0,
 'NJ': 3447.0,
 'NM': 41.0,
 'NV': 172.5,
 'NY': 8293.0,
 'OH': 351.5,
 'OK': 157.0,
 'OR': 68.0,
 'PA': 1086.5,
 'PR': 38.5,
 'RI': 84.5,
 'SC': 235.5,
 'SD': 28.5,
 'TN': 303.0,
 'TX': 701.5,
 'UT': 93.5,
 'VA': 228.0,
 'VI': 3.0,
 'VT': 22.5,
 'WA': 544.0,
 'WI': 189.5,
 'WV': 27.5,
 'WY': 20.5}

In [14]:
record['AZ'],record['AL']

(154.5, 126.0)

In [15]:
# Good job!
# Don't forget to push with ./submit.sh