# Extract NYC Sample 
YOU'LL NOT BE ABLE TO RUN THIS CODE AND REPRODUCE THE RESULTS.
This file serves the only purpose of documenting how I extract NYC sample data. 

Input files include:
    - NYC_sample.csv: a list of sample AreaIds, selected and generated using QGIS
    - ny_hexagon_geometries.csv: full hexagons set. Visit https://github.com/teralytics/geohex for hexagon generation.
    - nyc_flow: the footfall data. This data is not provided.
Output files include:
    - NYC_sample_geometries.csv: 
        - AreaId: Unique identifier of a hexagon.
        - Geometry: Well-known text (WKT) representation of the hexagons.
    - NYC_sample_flow.csv:
        - AreaId: Hexagon ID as in NYC_sample_geometries.csv.
        - NewCount: Number of persons which entered the hexagon during the time interval [StartTime, StartTime + 1 hour].
        - StartTime: "yyyy-mm-dd HH:MM:SS" representation in local time (daylight saving time, UTC−04:00).


In [1]:
import os
import csv
import pandas as pd
import datetime as dt
import numpy as np

# Create the sample geometry file

In [2]:
# read in NYC_sample.csv, where only the sample AreaId is included - selected and generated using QGIS
sample_AreaId = pd.read_csv('NYC_sample.csv')
sample_Ids = set(sample_AreaId['AreaId'])

In [3]:
full_geom = pd.read_csv('ny_hexagon_geometries.csv')

In [4]:
sample_geom = full_geom[full_geom.AreaId.isin(sample_Ids)]

In [5]:
sample_geom.to_csv('NYC_sample_geometries.csv')

In [6]:
sample_geom.head()

Unnamed: 0,AreaId,Geometry
17387,229777,MULTIPOLYGON(((-74.017713561525 40.69983361954...
17388,229778,MULTIPOLYGON(((-74.017713561525 40.70210146199...
17389,229779,MULTIPOLYGON(((-74.017713561525 40.70436922723...
17390,229780,MULTIPOLYGON(((-74.017713561525 40.70663691526...
17391,229781,MULTIPOLYGON(((-74.017713561525 40.70890452607...


# Create the sample flow file

In [7]:
# read in the full flow file, including AreaId, counts and starttime
full_flow = pd.read_csv('ny_flow.csv')

In [8]:
sample_flow = full_flow[full_flow.AreaId.isin(sample_Ids)]

In [9]:
# Drop original flow data
sample_flow = sample_flow.drop('Count', 1)

In [10]:
sample_flow.head()

Unnamed: 0,AreaId,StartTime
3882927,229777,2015-09-01 00:00:00
3882928,229777,2015-09-01 01:00:00
3882929,229777,2015-09-01 02:00:00
3882930,229777,2015-09-01 03:00:00
3882931,229777,2015-09-01 04:00:00


In [11]:
sample_flow.dtypes

AreaId        int64
StartTime    object
dtype: object

In [12]:
# change StartTime type to be datetime
dt_time = pd.to_datetime(sample_flow.loc[:,'StartTime'])
sample_flow.loc[:,'StartTime'] = dt_time

In [13]:
end_time = dt.datetime(2015,9,8,0,0)

In [14]:
sample_flow = sample_flow[sample_flow.StartTime < end_time]

In [15]:
# Randomize the count
sample_flow['NewCount'] = np.random.randint(0, 500, size=len(sample_flow))

In [16]:
# pick up 5 AreaIds randomly 
high_flow_ids = set(np.random.choice(sample_flow['AreaId'],size=5))

In [17]:
print "high flow ids are: ", high_flow_ids

high flow ids are:  set([232200, 229784, 236330, 231460, 233511])


In [18]:
# Assign random numbers from 1000 to 5000 to these rows
for id in high_flow_ids:
    sample_flow.ix[sample_flow['AreaId']==id, 'NewCount'] = np.random.randint(1000, 5000, 
                                                                             size=len(sample_flow[sample_flow['AreaId']==id]))

In [19]:
sample_flow.head()

Unnamed: 0,AreaId,StartTime,NewCount
3882927,229777,2015-09-01 00:00:00,304
3882928,229777,2015-09-01 01:00:00,100
3882929,229777,2015-09-01 02:00:00,52
3882930,229777,2015-09-01 03:00:00,304
3882931,229777,2015-09-01 04:00:00,248


In [22]:
sample_flow[sample_flow.AreaId==232200].head()

Unnamed: 0,AreaId,StartTime,NewCount
4459084,232200,2015-09-01 00:00:00,3516
4459085,232200,2015-09-01 01:00:00,3130
4459086,232200,2015-09-01 02:00:00,3058
4459087,232200,2015-09-01 03:00:00,4884
4459088,232200,2015-09-01 04:00:00,2268


In [24]:
sample_flow.to_csv('NYC_sample_flow.csv', index=False)