# Homework 2 - MapReduce

This homework explore [Safegraph data](https://www.safegraph.com/covid-19-data-consortium) to better understand how NYC response to the COVID-19 pandemic. We will be looking at the [Places](https://docs.safegraph.com/docs/core-places) data set and the [Weekly Pattern](https://docs.safegraph.com/docs/weekly-patterns) data set to answer the following inquiry:

> How many restaurants were closed from 03/17/20 (when the lock down started), and how many were closed from 04/01/20?

### Notes

* *NYC*: we only consider restaurants in NYC, which means those with the city listed as `'New York'`, `'Brooklyn'`, `'Queens'`, `'Bronx'`, or `'Staten Island'` (we will miss a lot of Queens restaurants where cities are listed in names other than `'Queens'`).

* *Closed*: a restaurant is closed from a date, e.g. 03/17/20, if there were visits to the restaurants right before that date (03/16/20) but none afterwards. Note that if the restaurant is closed for an entire week, there would be no report (instead of 7 zeros `[0,0,0,0,0,0,0]`) in the *Weekly Pattern* data set. 

### Requirements: 
You must use MRJob and MapReduce in a similar fashion as in Lab.

### INPUT:
To make it easier, we have already joined (and filtered) the two provided data sets into `nyc_restaurant_pattern.csv`, which has the visits pattern of all NYC restaurants. In other words, you only need to deal with a single input file `nyc_restaurant_pattern.cvs`, and would not need to fetch the original Safegraph data.

### OUTPUT:
Your MRJob only needs to output two rows as follows, each consists of a label (e.g. `"The number ..."`), and a count (e.g. `"49"`):
```
"The number of restaurants in NYC closed from March 17, 2020" 49
"The number of restaurants in NYC closed from April 01, 2020" 496
```

## Download Data and Packages

In [1]:
!gdown --id 1NeXqsAeIJ8zukHt5cR2s19beDoz2Xw5d -O nyc_restaurant_pattern.csv
!curl -L "https://drive.google.com/uc?id=1TVhZgb1SWZbQB21J1hadcW-AIMnRiCL4&confirm=t" -o mapreduce.py 
!pip install mrjob

!head -n 3 nyc_restaurant_pattern.csv

Downloading...
From: https://drive.google.com/uc?id=1NeXqsAeIJ8zukHt5cR2s19beDoz2Xw5d
To: /content/nyc_restaurant_pattern.csv
100% 101M/101M [00:00<00:00, 145MB/s]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2663  100  2663    0     0   5221      0 --:--:-- --:--:-- --:--:--  5221
Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[K     |████████████████████████████████| 439 kB 8.7 MB/s 
Installing collected packages: mrjob
Successfully installed mrjob-0.7.4
"placekey","safegraph_place_id","parent_placekey","parent_safegraph_place_id","location_name","street_address","city","region","postal_code","iso_country_code","safegraph_brand_ids","brands","date_range_start","date_range_end","raw_visit_counts","raw_visitor_counts","visits_by_day","visits_by_each_hour","poi_cbg",

# Task 1
You must complete the **MRFindReciprocal** class below (which is inherited from MRJob), and your code must run with the **mapreduce.py** package **mr.runJob()** as provided. Th expected output is:
```
"The number of restaurants in NYC closed from March 17, 2020" 49
"The number of restaurants in NYC closed from April 01, 2020" 496
```

####Rationale
Please edit the cell below to add the rational for your `map`'s and `reduce`'s logic below. This will help if your output is different from the expected one. A few setences to explain your strategy for map and reduce would be sufficient.

*your rational*\
\
The mapper splits the data by comma, but ignoring commas within arrays and the like. This way all readers will have the same indexs and visits_by_day which seperates days by commmas will still be kept together.\
The row is read into a csv reader which is then turned into an enumerated list which provides the index where each column can be accessed.\ 
Next, json.loads() converts the json string visists_by_day string into a python dict of numbers so later on mathamatical operations can be performed. \
The goal is to check how many resturaunts closed March 17, and how many closed April 1, so the mapper looks at rows that represent those two weeks and checks if there were visitor the previous day, March 16 and March 30. If the visitor count is higher than 0 that day but zero the rest of the week, it can be concluded that the resturant closed March 17 or April 1.
For each row, representing a resturaunt, if the resturant closed March 17, or April 1, the corrosponding day is passed as the key and the number 1 is passed as the value to the reducer. \
\
The reducer groups all of the rows by date and sums the 1s together to calculate how many resturaunts closed on the corrosponding date. 


In [22]:
import csv
import datetime
import json
import mapreduce as mr
from mrjob.job import MRJob
from mrjob.step import MRStep

################################
### YOUR WORK SHOULD BE HERE ###
################################
class MRHW2(MRJob):
    '''
    PLEASE COMPLETE THIS CLASS. THIS SHOULD BE THE ONLY PLACE THAT YOU CAN EDIT.
    THE INPUT OF YOUR MAPREDUCE JOB WOULD BE LINE OF TEXT WITHOUT '\n'.
    '''  
    def mapper(self, _, row):
      line = row.split(',(?![^\(]*[\)])')
      reader = csv.reader(line)
      reader = list(enumerate(reader))

      visits_by_day = reader[0][1][16]
      visits_by_day = json.loads(visits_by_day)
      date = reader[0][1][12]

      if '2020-03-16' in date and visits_by_day[0] > 0 and sum(visits_by_day[1:7]) == 0:
          yield date, 1
      elif '2020-03-30' in date and visits_by_day[1] > 0 and sum(visits_by_day[2:7]) == 0:
          yield date, 1
      else:
        pass

    def reducer(self, day, count):
      if '2020-03-16' in day: 
        yield f"The number of restaurants in NYC closed from March 17, 2020", sum(count)
      if '2020-03-30' in day: 
        yield f"The number of restaurants in NYC closed from April 01, 2020", sum(count)

###################################
### DO NOT EDIT BELOW THIS LINE ###
###################################
job = MRHW2(args=[])
with open('nyc_restaurant_pattern.csv', 'r') as fi:
  next(fi)
  output = list(mr.runJob(enumerate(map(lambda x: x.strip(), fi)), job))

print(len(output))
output

2


[('The number of restaurants in NYC closed from March 17, 2020', 203),
 ('The number of restaurants in NYC closed from April 01, 2020', 277)]

# Task 2
You are asked to convert the MR Job Class in Task 1 into a stand-alone `BDM_HW2_NetID.py` file that can be run directly with `python` similar to our Lab 3 and 4.


In [23]:
# !python BDM_HW2_NetID.py nyc_restaurant_pattern.csv
!python BDM_HW2_Atkins.py nyc_restaurant_pattern.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/BDM_HW2_Atkins.root.20220316.154323.306790
Running step 1 of 1...
job output is in /tmp/BDM_HW2_Atkins.root.20220316.154323.306790/output
Streaming final output from /tmp/BDM_HW2_Atkins.root.20220316.154323.306790/output...
"The number of restaurants in NYC closed from April 01, 2020"	277
"The number of restaurants in NYC closed from March 17, 2020"	203
Removing temp directory /tmp/BDM_HW2_Atkins.root.20220316.154323.306790...
