## Data Check and Reclassify
### Primary Author
Maia Guo

### Description:
This notebook prints the data extracted from HDFS, reclassifies the POI category, and splits aggregated weekly patterns and core places into smaller tables for analysis use.

### Inputs:
weekly_and_core_with_area.csv

### Output:

poi_info.csv
weekly_patterns_with_general_info.csv
weekly_brand_info.csv

### PySpark Running Instructions 

**Way 1:** run python file on hadoop cluster (extract data from HDFS).

`cluster $ spark-submit --num-executors 10 --executor-cores 5 safegraph_process.py`

**Wsy 2:** run in pyspark shell on cluster (more convenient to debug). 

`cluster $ pyspark`  it will let you choose version

`cluster $ 2`  enter 2 to choose python3 version

`>>> ` write or paste python codes to run

**Way 3:** run pyspark juoyter notebook locally on docker.

`terminal $ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook`

more information: https://medium.com/@suci/running-pyspark-on-jupyter-notebook-with-docker-602b18ac4494

In [1]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import csv
import warnings
warnings.filterwarnings("ignore")

## weekly_and_core.csv

Joined weekly patterns and core pois' information (in NYC & food related).

In [2]:
weekly_and_core = pd.read_csv("../weekly_and_core_with_area.csv", dtype={'brands': 'object',
                                                            'distance_from_home': 'float64',
                                                            'opened_on': 'object',
                                                            'parent_placekey': 'object',
                                                            'safegraph_brand_ids': 'object',
                                                            'tracking_opened_since': 'object'}) 

In [20]:
weekly_and_core.head(1)

Unnamed: 0,placekey,parent_placekey,safegraph_brand_ids,date_range_start,date_range_end,raw_visit_counts,raw_visitor_counts,visits_by_day,visits_by_each_hour,poi_cbg,...,open_hours,category_tags,opened_on,closed_on,tracking_opened_since,tracking_closed_since,category,date,safegraph_place_id,area_square_feet
0,222-222@627-s94-nwk,,,2020-12-21 05:00:00+00:00,2020-12-28T00:00:00-05:00,39,24,"[7,9,6,5,3,5,4]","[0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,4,0,0,0,0,0,0,0...",360470395002,...,"{ ""Mon"": [[""8:00"", ""19:00""]], ""Tue"": [[""8:00"",...",,,,,2019-07,Supermarkets and Specialty Food Stores,2020-12-21,sg:bbe025bf97774f46b165507367517013,3177.0


In [14]:
weekly_and_core.columns

Index(['placekey', 'parent_placekey', 'safegraph_brand_ids',
       'date_range_start', 'date_range_end', 'raw_visit_counts',
       'raw_visitor_counts', 'visits_by_day', 'visits_by_each_hour', 'poi_cbg',
       'visitor_home_cbgs', 'visitor_daytime_cbgs',
       'visitor_country_of_origin', 'distance_from_home', 'median_dwell',
       'bucketed_dwell_times', 'related_same_day_brand',
       'related_same_week_brand', 'location_name', 'brands', 'top_category',
       'sub_category', 'naics_code', 'latitude', 'longitude', 'street_address',
       'city', 'region', 'postal_code', 'iso_country_code', 'open_hours',
       'category_tags', 'opened_on', 'closed_on', 'tracking_opened_since',
       'tracking_closed_since', 'category', 'date', 'safegraph_place_id',
       'area_square_feet'],
      dtype='object')

## reclassify categories

In [15]:
NAICS = {'Specialty Food Stores': [445210, 445220, 445230, 445291, 445292, 445299],
         'Supermarkets': [445110],
         'Convenience Stores': [445120],
         'General Merchandise Stores': [452319, 453998],
         'Big Box Grocers': [452210],
         'Full-Service Restaurants': [722511],
         'Limited-Service Restaurants': [722513],
         'Snack and Bakeries': [722515, 311811],
         'Food Services': [624210],
         'Pharmacies and Drug Stores': [446110, 446191],
         'Beer, Wine, and Liquor Stores': [445310],
         'Tobacco Stores': [453991],
         'Drinking Places': [722410]
        }

In [16]:
weekly_and_core = weekly_and_core[weekly_and_core.naics_code!=722320] # exclude 272 caterers 
for label, codes in NAICS.items():
    for i in weekly_and_core[weekly_and_core.naics_code.isin(codes)].index:
        weekly_and_core.at[i, 'category'] = label

In [17]:
cateCount = weekly_and_core[['placekey', 'category']].drop_duplicates()
cateCount = cateCount.groupby('category').count()
cateCount

Unnamed: 0_level_0,placekey
category,Unnamed: 1_level_1
"Beer, Wine, and Liquor Stores",1012
Big Box Grocers,190
Convenience Stores,1058
Drinking Places,2831
Food Services,2
Full-Service Restaurants,13363
General Merchandise Stores,680
Limited-Service Restaurants,4642
Pharmacies and Drug Stores,3244
Snack and Bakeries,5316


In [18]:
cateCount.placekey.sum()

36195

### Split into tables
### 1. Weekly brand information

In [24]:
wk_col = ['placekey', 'date_range_start', 'date_range_end',
            'location_name', 'latitude', 'longitude',
            'city', 'region', 'postal_code',
            'safegraph_brand_ids', 'brands', 'top_category', 'sub_category', 'naics_code',
            'related_same_day_brand', 'related_same_week_brand',
            'category_tags']
weekly_brand_info = weekly_and_core[wk_col]

In [36]:
weekly_brand_info.shape

(3664420, 17)

In [25]:
weekly_brand_info.to_csv("../weekly_brand_info.csv", header=True, sep=',', quoting=csv.QUOTE_ALL, index=None)

### 2. Unique POI information

In [19]:
poi_col = ['placekey', 'poi_cbg', 'category', 'location_name', 'latitude', 'longitude', 'street_address',
            'city', 'region', 'postal_code', 'open_hours',
            'safegraph_brand_ids', 'brands', 'top_category', 'sub_category', 'naics_code',
            'category_tags', 'opened_on', 'closed_on', 'tracking_opened_since',
            'tracking_closed_since']
poi = weekly_and_core[poi_col].drop_duplicates(subset=['placekey'])
poi.shape

(36195, 21)

In [20]:
poi.head(3)

Unnamed: 0,placekey,poi_cbg,category,location_name,latitude,longitude,street_address,city,region,postal_code,...,safegraph_brand_ids,brands,top_category,sub_category,naics_code,category_tags,opened_on,closed_on,tracking_opened_since,tracking_closed_since
0,222-222@627-s94-nwk,360470395002,Specialty Food Stores,Broadway Meats,40.691436,-73.924891,1259 Broadway,Brooklyn,NY,11221,...,,,Specialty Food Stores,Meat Markets,445210,,,,,2019-07
110,223-222@627-rw6-zfz,360050386008,Supermarkets,Foodtown,40.87689,-73.847776,3471 Boston Rd,Bronx,NY,10469,...,SG_BRAND_6370839ae545be53e6ac733009a92d31,Foodtown,Grocery Stores,Supermarkets and Other Grocery (except Conveni...,445110,,,,,2019-07
220,223-222@627-rwq-vcq,360050117001,Supermarkets,Kirsch Mushroom Company,40.816779,-73.883401,751 Drake St,Bronx,NY,10474,...,,,Grocery Stores,Supermarkets and Other Grocery (except Conveni...,445110,,,,,2019-07


In [21]:
poi.to_csv("../poi_info.csv", header=True, sep=',', quoting=csv.QUOTE_ALL, index=None)

### 3. Weekly trips and general information

In [None]:
trim_col = ['placekey', 'date_range_start', 'date_range_end', 
            'raw_visit_counts', 'raw_visitor_counts', 'visits_by_day', 'visits_by_each_hour', 
            'poi_cbg', 'visitor_home_cbgs', 'visitor_daytime_cbgs',
            'distance_from_home', 'median_dwell', 'bucketed_dwell_times', 
            'naics_code', 
            'latitude', 'longitude', 'city', 'postal_code', 'area_square_feet']
weekly_trips = weekly_and_core[trim_col]

In [None]:
# reformat dates
weekly_trips.date_range_start = pd.to_datetime(weekly_trips.date_range_start, utc=True)
weekly_trips.date_range_start = weekly_trips.date_range_start.dt.date
weekly_trips.date_range_end = pd.to_datetime(weekly_trips.date_range_end, utc=True)
weekly_trips.date_range_end = weekly_trips.date_range_end.dt.date

In [17]:
weekly_trips.shape

(3664420, 20)

In [18]:
weekly_trips.head(3)

Unnamed: 0,placekey,date_range_start,date_range_end,raw_visit_counts,raw_visitor_counts,visits_by_day,visits_by_each_hour,poi_cbg,visitor_home_cbgs,visitor_daytime_cbgs,distance_from_home,median_dwell,bucketed_dwell_times,naics_code,latitude,longitude,city,postal_code,area_square_feet,category
0,222-222@627-s94-nwk,2020-12-21,2020-12-28,39,24,"[7,9,6,5,3,5,4]","[0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,4,0,0,0,0,0,0,0...",360470395002,"{""420950106007"":5,""360470399002"":4,""3604711420...","{""360470411002"":5,""360470385001"":4,""3604703950...",1911.0,49.0,"{""<5"":0,""5-10"":4,""11-20"":12,""21-60"":6,""61-120""...",445210,40.691436,-73.924891,Brooklyn,11221,3177.0,Specialty Food Stores
1,222-222@627-s94-nwk,2021-01-11,2021-01-18,41,27,"[3,5,7,4,6,13,3]","[0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0...",360470395002,"{""360470397003"":8,""360810142011"":4,""3604703950...","{""360470395002"":4,""360810008002"":4,""3608110290...",5109.0,34.0,"{""<5"":0,""5-10"":7,""11-20"":9,""21-60"":9,""61-120"":...",445210,40.691436,-73.924891,Brooklyn,11221,3177.0,Specialty Food Stores
2,222-222@627-s94-nwk,2021-01-18,2021-01-25,39,21,"[3,6,7,6,7,8,2]","[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1...",360470395002,"{""360470377002"":4,""360470251001"":4,""3604703710...","{""360470363002"":4,""360470573001"":4,""5165001140...",4406.0,22.0,"{""<5"":0,""5-10"":4,""11-20"":15,""21-60"":5,""61-120""...",445210,40.691436,-73.924891,Brooklyn,11221,3177.0,Specialty Food Stores


In [29]:
weekly_trips.to_csv("../weekly_patterns_with_general_info.csv", header=True, sep=',', quoting=csv.QUOTE_ALL, index=None)

### Dates check

In [33]:
datesu = weekly['date_range_start'].unique()
dates = pd.DataFrame(data=datesu, columns=['date_range_start'])
dates.date_range_start = pd.to_datetime(dates.date_range_start)
dates['week'] = dates.date_range_start.dt.week
dates['year'] = dates.date_range_start.dt.year
dates['month'] = dates.date_range_start.dt.month
dates = dates.sort_values(by='date_range_start', ascending=True)
dates.head()

Unnamed: 0,date_range_start,week,year,month
66,2018-12-31,1,2018,12
4,2019-01-07,2,2019,1
5,2019-01-14,3,2019,1
67,2019-01-21,4,2019,1
68,2019-01-28,5,2019,1


In [34]:
dates[dates.year==2021]

Unnamed: 0,date_range_start,week,year,month
62,2021-01-04,1,2021,1
1,2021-01-11,2,2021,1
2,2021-01-18,3,2021,1
3,2021-01-25,4,2021,1
63,2021-02-01,5,2021,2
64,2021-02-08,6,2021,2
65,2021-02-15,7,2021,2


In [35]:
dates[dates.year==2020] # two weeks missing

Unnamed: 0,date_range_start,week,year,month
87,2020-01-06,2,2020,1
88,2020-01-13,3,2020,1
89,2020-01-20,4,2020,1
90,2020-01-27,5,2020,1
91,2020-02-03,6,2020,2
92,2020-02-10,7,2020,2
93,2020-02-17,8,2020,2
38,2020-02-24,9,2020,2
39,2020-03-02,10,2020,3
40,2020-03-09,11,2020,3


In [36]:
dates[dates.year==2019]

Unnamed: 0,date_range_start,week,year,month
4,2019-01-07,2,2019,1
5,2019-01-14,3,2019,1
67,2019-01-21,4,2019,1
68,2019-01-28,5,2019,1
69,2019-02-04,6,2019,2
70,2019-02-11,7,2019,2
71,2019-02-18,8,2019,2
6,2019-02-25,9,2019,2
7,2019-03-04,10,2019,3
72,2019-03-11,11,2019,3


Check the brige between backfill and new version data

In [31]:
nov20 = pd.read_csv('2020-11-23-weekly-patterns.csv') #backfill
nov20.date_range_start.unique()

array(['2020-11-23T00:00:00-06:00', '2020-11-23T00:00:00-08:00',
       '2020-11-23T00:00:00-05:00', '2020-11-23T00:00:00-07:00',
       '2020-11-23T00:00:00-09:00', '2020-11-23T00:00:00-10:00',
       '2020-11-23T00:00:00-04:00', '2020-11-23T00:00:00+10:00',
       '2020-11-23T00:00:00-11:00', 'date_range_start'], dtype=object)

In [32]:
dec20 = pd.read_csv('2020-12-02-19-weekly-patterns.csv') # new version
dec20.date_range_start.unique()

array(['2020-11-23T00:00:00-06:00', '2020-11-23T00:00:00-08:00',
       '2020-11-23T00:00:00-05:00', '2020-11-23T00:00:00-07:00',
       '2020-11-23T00:00:00-10:00', '2020-11-23T00:00:00-04:00',
       '2020-11-23T00:00:00-09:00', '2020-11-23T00:00:00+10:00',
       'date_range_start', '2020-11-23T00:00:00-11:00'], dtype=object)

## brand_food.csv

Food related brand information.

In [7]:
brand_food = pd.read_csv("../brand_info_food.csv")
brand_food.shape

(2661, 6)

In [8]:
brand_food.columns

Index(['safegraph_brand_id', 'brand_name', 'parent_safegraph_brand_id',
       'naics_code', 'top_category', 'sub_category'],
      dtype='object')

In [4]:
brand_food.head()

Unnamed: 0,safegraph_brand_id,brand_name,parent_safegraph_brand_id,naics_code,top_category,sub_category
0,SG_BRAND_978f38b5082c6e3f,Alfred,,722515,Restaurants and Other Eating Places,Snack and Nonalcoholic Beverage Bars
1,SG_BRAND_cb7d114f36c4df5,Andy's Frozen Custard,,722511,Restaurants and Other Eating Places,Full-Service Restaurants
2,SG_BRAND_2f1156f5e6773b54,Bill's Dollar Store,SG_BRAND_fdf4ed272e37e611,452319,"General Merchandise Stores, including Warehous...",All Other General Merchandise Stores
3,SG_BRAND_16d951ef6fde6d7a,Brewer Convenience & Gas,,445120,Grocery Stores,Convenience Stores
4,SG_BRAND_69579a9540cd05a2a08e66204792944c,Capriotti's Sandwich Shop,,722513,Restaurants and Other Eating Places,Limited-Service Restaurants


## social_distancing_nyc.csv

In [31]:
social = dd.read_csv('../social_distancing_nyc.csv',
                     dtype={'mean_distance_traveled_from_home': 'float64',
                               'mean_home_dwell_time': 'float64',
                               'mean_non_home_dwell_time': 'float64',
                               'distance_traveled_from_home': 'float64'})

In [32]:
a = social.shape
a[0].compute(),a[1]  

(4938323, 23)

In [33]:
social.columns

Index(['origin_census_block_group', 'date_range_start', 'date_range_end',
       'device_count', 'distance_traveled_from_home',
       'bucketed_distance_traveled',
       'median_dwell_at_bucketed_distance_traveled',
       'completely_home_device_count', 'median_home_dwell_time',
       'bucketed_home_dwell_time', 'at_home_by_each_hour',
       'part_time_work_behavior_devices', 'full_time_work_behavior_devices',
       'destination_cbgs', 'delivery_behavior_devices',
       'median_non_home_dwell_time', 'candidate_device_count',
       'bucketed_away_from_home_time', 'median_percentage_time_home',
       'bucketed_percentage_time_home', 'mean_home_dwell_time',
       'mean_non_home_dwell_time', 'mean_distance_traveled_from_home'],
      dtype='object')

In [34]:
social.head()

Unnamed: 0,origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,...,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
0,360470064002,2019-01-01T00:00:00-05:00,2019-01-02T00:00:00-05:00,60,620.0,"{""16001-50000"":7,""0"":25,"">50000"":1,""2001-8000""...","{""16001-50000"":77,"">50000"":1086,""<1000"":38,""20...",29,795,"{""721-1080"":6,""361-720"":9,""61-360"":13,""<60"":8,...",...,"{""360810846021"":1,""360470138001"":1,""3604700660...",4,16,159,"{""21-45"":6,""481-540"":2,""721-840"":1,""301-360"":2...",98,"{""0-25"":4,""76-100"":41,""51-75"":5,""26-50"":5}",772.0,174.0,2320.0
1,360810384001,2019-01-01T00:00:00-05:00,2019-01-02T00:00:00-05:00,105,2253.0,"{""16001-50000"":3,""0"":55,"">50000"":2,""2001-8000""...","{""16001-50000"":153,"">50000"":338,""<1000"":264,""2...",55,910,"{""721-1080"":13,""361-720"":10,""61-360"":17,""<60"":...",...,"{""360470293002"":1,""361031587081"":1,""3608106800...",1,0,298,"{""21-45"":6,""481-540"":1,""46-60"":2,""721-840"":2,""...",100,"{""0-25"":16,""76-100"":70,""51-75"":13,""26-50"":6}",812.0,156.0,3211.0
2,360850170103,2019-01-01T00:00:00-05:00,2019-01-02T00:00:00-05:00,199,1952.0,"{""16001-50000"":14,""0"":88,"">50000"":11,""2001-800...","{""16001-50000"":82,"">50000"":82,""<1000"":258,""200...",89,984,"{""721-1080"":32,""361-720"":19,""61-360"":26,""<60"":...",...,"{""360470140001"":1,""360850319012"":1,""1209900690...",2,16,425,"{""21-45"":14,""481-540"":1,""541-600"":5,""46-60"":4,...",98,"{""0-25"":26,""76-100"":146,""51-75"":17,""26-50"":10}",812.0,152.0,52224.0
3,360050177022,2019-01-01T00:00:00-05:00,2019-01-02T00:00:00-05:00,85,1859.0,"{""16001-50000"":1,""0"":46,""2001-8000"":10,""1-1000...","{""16001-50000"":679,""<1000"":205,""2001-8000"":231...",45,638,"{""721-1080"":14,""361-720"":11,""61-360"":14,""<60"":...",...,"{""360050431008"":1,""360050155001"":1,""3608104670...",1,0,270,"{""21-45"":1,""721-840"":1,""1201-1320"":1,""301-360""...",100,"{""0-25"":19,""76-100"":59,""51-75"":3,""26-50"":1}",669.0,240.0,3549.0
4,360050248002,2019-01-01T00:00:00-05:00,2019-01-02T00:00:00-05:00,51,428.0,"{""16001-50000"":1,""0"":27,""2001-8000"":9,""1-1000""...","{""16001-50000"":21,""<1000"":188,""2001-8000"":145,...",30,487,"{""721-1080"":5,""361-720"":6,""61-360"":12,""<60"":10...",...,"{""360050248001"":5,""360050110001"":1,""3600500430...",3,0,172,"{""21-45"":3,""481-540"":4,""541-600"":1,""46-60"":1,""...",100,"{""0-25"":4,""76-100"":32,""51-75"":2,""26-50"":7}",653.0,121.0,3623.0
