This notebook performs the following tasks:
    1. Download .csv file(s) from a AWS S3 Bucket a range of dates/period for an ARTCC to a single pandas dataframe
    2. The datasets already have geoHash encoding
    3. The datasets are 6-hour ISSR counts
    4. Date/Times in the datasets - example: 
        2020-05-31 0:00:00 means a 6-hour period on May 31, 2020 local hours: midnight to 6AM (6AM not included)
        2020-05-31 6:00:00 means a 6-hour period on May 31, 2020 local hours: 6AM to noon (noon not included)
        2020-05-31 12:00:00 means a 6-hour period on May 31, 2020 local hours: noon to 6PM (6PM not included)
        2020-05-31 18:00:00 means a 6-hour period on May 31, 2020 local hours: 6PM to midnight (midnight not included)

### 1. Install/Load Libraries

In [17]:
pip install awswrangler

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [18]:
import sagemaker
import boto3
import awswrangler as wr

import pandas as pd
import numpy as np
from datetime import datetime

In [19]:
from sagemaker import get_execution_role
role = get_execution_role()

### 2. Specify Input/Output S3 Buckets

Always, output_bucket = 'partly-cloudy-common-area' or output_bucket = your own bucket

Please do NOT store anything new (i.e., setting output_bucket as) in the following buckets: (1) partly-cloudy-asdb, (2) partly-cloudy-rap-csv, (3) partly-cloudy-rap-parquet and (4) partly-cloudy-common-area/proof-of-concept/ and (5) partly-cloudy-ml-inputs

In [20]:
input_bucket = 'partly-cloudy-ml-inputs'
subfolder = ""

### 3. For reading specific file(s) from Input S3 Bucket to make a single Dataframe

In [21]:
# Examine the list of files are in the input_bucket (i.e., "partly-cloudy-ml-inputs")
conn = boto3.client('s3')
contents = conn.list_objects(Bucket= input_bucket, Prefix= subfolder)['Contents']
for f in contents:
    print(f['Key'])

JACKSONVILLE_6_hour_flat.csv
MIAMI_6_hour_flat.csv
SEATTLE_6_hour_flat.csv
WASHINGTON_6_hour_flat.csv


In [22]:
data_locations = ['s3://partly-cloudy-ml-inputs/JACKSONVILLE_6_hour_flat.csv'] # <---------------- Specify

for data_location in data_locations:
    print(data_location)

s3://partly-cloudy-ml-inputs/JACKSONVILLE_6_hour_flat.csv


### 5. Ingest Selected Files into a Single Dataframe
The section works for both Sections 3 or 4.

In [23]:
%%time
df = wr.s3.read_csv(path= data_locations, parse_dates= ['LocalDateTimePeriod'])

CPU times: user 3.02 s, sys: 533 ms, total: 3.55 s
Wall time: 2.83 s


In [24]:
df

Unnamed: 0,LocalDateTimePeriod,Nx,Ny,Lat,Lon,geoEncode,NAME,pdISSRs200430,pdISSRs300370
0,2020-05-31 18:00:00,233,75,32.531814,-80.364342,djyc5,JACKSONVILLE,0,0
1,2020-05-31 18:00:00,228,75,32.625822,-81.433045,djy1d,JACKSONVILLE,0,0
2,2020-05-31 18:00:00,205,59,30.068127,-86.582947,dj67j,JACKSONVILLE,3,2
3,2020-05-31 18:00:00,213,49,28.151308,-85.059156,dj580,JACKSONVILLE,4,4
4,2020-05-31 18:00:00,216,59,29.926908,-84.276855,djk46,JACKSONVILLE,0,0
...,...,...,...,...,...,...,...,...,...
1839305,2021-05-31 18:00:00,240,81,33.466085,-78.717015,dmbn2,JACKSONVILLE,0,0
1839306,2021-05-31 18:00:00,234,86,34.487209,-79.888016,dnphn,JACKSONVILLE,0,0
1839307,2021-05-31 18:00:00,200,60,30.302749,-87.621421,dj3sq,JACKSONVILLE,0,0
1839308,2021-05-31 18:00:00,204,52,28.806484,-86.880965,dj45y,JACKSONVILLE,0,0


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1839310 entries, 0 to 1839309
Data columns (total 9 columns):
 #   Column               Dtype         
---  ------               -----         
 0   LocalDateTimePeriod  datetime64[ns]
 1   Nx                   int64         
 2   Ny                   int64         
 3   Lat                  float64       
 4   Lon                  float64       
 5   geoEncode            object        
 6   NAME                 object        
 7   pdISSRs200430        int64         
 8   pdISSRs300370        int64         
dtypes: datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 126.3+ MB


### 6. Data Filter

In [34]:
start_period = '2021-01-01 0:00:00'
end_period = '2021-01-01 18:00:00'

df_select = df.loc[(df['LocalDateTimePeriod'] >= pd.to_datetime(start_period)) & (df['LocalDateTimePeriod'] <= pd.to_datetime(end_period))].copy()
df_select

Unnamed: 0,LocalDateTimePeriod,Nx,Ny,Lat,Lon,geoEncode,NAME,pdISSRs200430,pdISSRs300370
1080310,2021-01-01 00:00:00,232,82,33.809626,-80.416475,dnnb6,JACKSONVILLE,1,1
1080311,2021-01-01 00:00:00,233,72,31.991301,-80.433450,djwvc,JACKSONVILLE,6,6
1080312,2021-01-01 00:00:00,221,56,29.307158,-83.284950,djhy8,JACKSONVILLE,2,2
1080313,2021-01-01 00:00:00,233,73,32.171550,-80.410487,djwz4,JACKSONVILLE,6,6
1080314,2021-01-01 00:00:00,220,72,32.220099,-83.202559,djsz6,JACKSONVILLE,2,2
...,...,...,...,...,...,...,...,...,...
1085365,2021-01-01 18:00:00,201,63,30.837379,-87.375508,dj3z7,JACKSONVILLE,3,3
1085366,2021-01-01 18:00:00,219,59,29.882497,-83.648942,djk9b,JACKSONVILLE,0,0
1085367,2021-01-01 18:00:00,224,59,29.802874,-82.603518,djm38,JACKSONVILLE,2,2
1085368,2021-01-01 18:00:00,233,63,30.365916,-80.636928,djqst,JACKSONVILLE,0,0


In [35]:
df_select['LocalDateTimePeriod'].value_counts()

2021-01-01 00:00:00    1265
2021-01-01 18:00:00    1265
2021-01-01 12:00:00    1265
2021-01-01 06:00:00    1265
Name: LocalDateTimePeriod, dtype: int64