# Homework 2 - MapReduce

This homework explore [Safegraph data](https://www.safegraph.com/covid-19-data-consortium) to better understand how NYC response to the COVID-19 pandemic. We will be looking at the [Places](https://docs.safegraph.com/docs/core-places) data set and the [Weekly Pattern](https://docs.safegraph.com/docs/weekly-patterns) data set to answer the following inquiry:

> How many restaurants were closed from 03/17/20 (when the lock down started), and how many were closed by 04/01/20?

### Notes

* *NYC*: we only consider restaurants in NYC, which means those with the city listed as `'New York'`, `'Brooklyn'`, `'Queens'`, `'Bronx'`, or `'Staten Island'` (we will miss a lot of Queens restaurants where cities are listed in names other than `'Queens'`).

* *Closed*: a restaurant is closed for the listed period if there were visits to the restaurants before 03/17/20 but none afterwards. Note that if the restaurant is closed for an entire week, there would be no report (instead of 7 zeros `[0,0,0,0,0,0,0]`) in the *Weekly Pattern* data set. 

### Requirements: 
You must use MRJob and MapReduce in a similar fashion as in Lab.

### INPUT:
To make it easier, we have already joined (and filtered) the two provided data sets into `nyc_restaurant_pattern.csv`, which has the visits pattern of all NYC restaurants. In other words, you only need to deal with a single input file `nyc_restaurant_pattern.cvs`, and would not need to fetch the original Safegraph data.

### OUTPUT:
Your MRJob only needs to output two rows as follows, each consists of a label (e.g. `"The number ..."`), and a count (e.g. `"49"`):
```
"The number of restaurants in NYC closed by March 17, 2020" "49"
"The number of restaurants in NYC closed by April 01, 2020" "496"
```

## Download Data and Packages

In [1]:
!gdown --id 1NeXqsAeIJ8zukHt5cR2s19beDoz2Xw5d -O nyc_restaurant_pattern.csv
!curl -L "https://drive.google.com/uc?id=1TVhZgb1SWZbQB21J1hadcW-AIMnRiCL4&confirm=t" -o mapreduce.py 
!pip install mrjob

!head -n 3 nyc_restaurant_pattern.csv

Downloading...
From: https://drive.google.com/uc?id=1NeXqsAeIJ8zukHt5cR2s19beDoz2Xw5d
To: /content/nyc_restaurant_pattern.csv
100% 101M/101M [00:01<00:00, 73.8MB/s] 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2663  100  2663    0     0   6511      0 --:--:-- --:--:-- --:--:--  6511
"placekey","safegraph_place_id","parent_placekey","parent_safegraph_place_id","location_name","street_address","city","region","postal_code","iso_country_code","safegraph_brand_ids","brands","date_range_start","date_range_end","raw_visit_counts","raw_visitor_counts","visits_by_day","visits_by_each_hour","poi_cbg","visitor_home_cbgs","visitor_daytime_cbgs","visitor_country_of_origin","distance_from_home","median_dwell","bucketed_dwell_times","related_same_day_brand","related_same_week_brand","device_type"
22f-225@

# Task 1
You must complete the **MRFindReciprocal** class below (which is inherited from MRJob), and your code must run with the **mapreduce.py** package **mr.runJob()** as provided. Th expected output is:
```
"The number of restaurants in NYC closed by March 17, 2020" "49"
"The number of restaurants in NYC closed by April 01, 2020" "496"
```

####Rationale
Please edit the cell below to add the rational for your `map`'s and `reduce`'s logic below. This will help if your output is different from the expected one. A few setences to explain your strategy for map and reduce would be sufficient.

map func: get key, value. key is placekey, value is (date_range_start, visit_by_day)

combiner: filter and get the data we need, only add 1 to the data that meets the requirement

reducer:sum all the data that meets the requirement and get the result

In [2]:
import csv
import re
import mapreduce as mr
from datetime import date
from mrjob.job import MRJob
from mrjob.step import MRStep

################################
### YOUR WORK SHOULD BE HERE ###
################################
class MRHW2(MRJob):
    '''
    PLEASE COMPLETE THIS CLASS. THIS SHOULD BE THE ONLY PLACE THAT YOU CAN EDIT.
    THE INPUT OF YOUR MAPREDUCE JOB WOULD BE LINE OF TEXT WITHOUT '\n'.
    '''
    def mapper(self, _, line):
      info = line.split('[')
      placekey = info[0].split(',')[0]
      NYC = ['New York', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
      if any(city in info[0] for city in NYC):
        date_range_start = re.search(r'\d{4}-\d{2}-\d{2}', info[0]).group()
        visit_by_day = info[1].split(']')[0].split(',')
        yield (placekey, (date_range_start, visit_by_day))

    def combiner(self, _, start_time_and_visit_by_day):
      visit_flag = [False, False, False, False, False, False]

      for date_range_start, visit_by_day in start_time_and_visit_by_day:
        visit_by_day = list(map(int, visit_by_day))
        day = date_range_start.split('-')
        diff = date(int(day[0]), int(day[1]), int(day[2])) - date(2020,3,2)

        if diff.days < 14 or (diff.days == 14 and visit_by_day[0] > 0): visit_flag[0] = True
        if diff.days < 28 or (diff.days == 28 and sum(visit_by_day[:2]) > 0): visit_flag[2] = True
        if diff.days > 14 or (diff.days == 14 and sum(visit_by_day[1:]) > 0): visit_flag[1] = True
        if diff.days == 28 and sum(visit_by_day[2:]) > 0: visit_flag[3] = True
        if diff.days == 14: visit_flag[4] = True
        if diff.days == 28: visit_flag[5] = True

      by17, by1 = 0, 0
      if (visit_flag[0] is True and visit_flag[1] is False and visit_flag[4] is True): by17 = int(1)
      if (visit_flag[2] is True and visit_flag[3] is False and visit_flag[5] is True): by1 = int(1)

      if by17 == 1:  yield 'March 17, 2020', by17
      if by1 == 1:  yield 'April 01, 2020', by1
      

    def reducer(self, times, types):
      yield "The number of restaurants in NYC closed by " + times, sum(types)

###################################
### DO NOT EDIT BELOW THIS LINE ###
###################################
job = MRHW2(args=[])
with open('nyc_restaurant_pattern.csv', 'r') as fi:
  next(fi)
  output = list(mr.runJob(enumerate(map(lambda x: x.strip(), fi)), job))

print(len(output))
output

2


[('The number of restaurants in NYC closed by April 01, 2020', 496),
 ('The number of restaurants in NYC closed by March 17, 2020', 49)]

# Task 2
You are asked to convert the MR Job Class in Task 1 into a stand-alone `BDM_HW2_NetID.py` file that can be run directly with `python` similar to our Lab 3 and 4.


In [3]:
!python BDM_HW2_xl4230.py nyc_restaurant_pattern.csv 2>/dev/null

('The number of restaurants in NYC closed by March 17, 2020', 49)
('The number of restaurants in NYC closed by April 01, 2020', 496)
