This notebook performs the following tasks:
    1. Download .csv file(s) from a AWS S3 Bucket for specific date(s)/hour(s) or a range of dates/hours to a single pandas dataframe
    2. Filter the records to choose the records determined to be ISSR ('IsISSR' == 1)
    3. Save the filtered dataframe to the "partly-cloudy-common-area" S3 Bucket for idv input

References:

https://github.com/awslabs/aws-data-wrangler

### 1. Install/Load Libraries

In [1]:
pip install awswrangler

Note: you may need to restart the kernel to use updated packages.


In [2]:
import sagemaker
import boto3
import awswrangler as wr

import pandas as pd
import numpy as np
from datetime import datetime

In [3]:
from sagemaker import get_execution_role
role = get_execution_role()

### 2. Specify Input/Output S3 Buckets

Always, output_bucket = 'partly-cloudy-common-area' or output_bucket = your own bucket

Please do NOT store anything new (i.e., setting output_bucket as) in the following buckets: (1) partly-cloudy-asdb, (2) partly-cloudy-rap-csv, or (3) partly-cloudy-rap-parquet

In [4]:
input_bucket = 'partly-cloudy-rap-csv' # <<<<<<<<<<<<<<<<<<<<< Use when reading from master .csv RAP data
#input_bucket = 'partly-cloudy-common-area' # <<<<<<<<<<<<<<<<< Use when reading file(s) from the team common area

output_bucket = 'partly-cloudy-common-area'
subfolder = ''

### 3. For reading specific file(s) from Input S3 Bucket to make a single Dataframe (Run this cell and skip Step 4)

For reading RAP .csv data for a range of dates/times, skip Step 3 and Run Step 4 instead

In [5]:
data_locations = ['s3://partly-cloudy-rap-csv/2021_05_01_13.csv',
                  's3://partly-cloudy-common-area/2021_05_02_20.csv'] # <---------------- Specify

for data_location in data_locations:
    print(data_location)

s3://partly-cloudy-rap-csv/2021_05_01_13.csv
s3://partly-cloudy-common-area/2021_05_02_20.csv


### 4. Reading RAP .csv data for a range of dates/times

In [6]:
def get_dataLocations(bucket_name, firstDT, lastDT):
    if (firstDT >= lastDT):
        lastDT = firstDT

    dateTimes = pd.date_range(firstDT, lastDT, freq= 'H')
    dateTimesSer = pd.Series([str(dateTime) for dateTime in dateTimes], name= 'temp')
    dateTimesDF = pd.DataFrame({'yr':list(dateTimesSer.str.slice(0,4)),
                                'mo':list(dateTimesSer.str.slice(5,7)), 
                                'day':list(dateTimesSer.str.slice(8,10)), 
                                'hr':list(dateTimesSer.str.slice(11,13))})

    dtDF = dateTimesDF.iloc[0:len(dateTimesDF)].copy()
    dat_locs = ['s3://' + bucket_name + '/' + dtDF.iloc[i,0] + "_" + dtDF.iloc[i,1] + "_" + dtDF.iloc[i,2] + "_" + dtDF.iloc[i,3] +'.csv' for i in range(len(dtDF))]
    
    return dat_locs

In [7]:
beginDT = '2020-06-01 00:00:00' # <---------------- Specify (between '2020-06-01 00:00:00' to '2021-05-30 23:00:00')
endDT =   '2020-06-01 23:00:00' # <-----------------Specify (between '2020-06-01 00:00:00' to '2021-05-30 23:00:00')

data_locations = get_dataLocations(input_bucket, beginDT, endDT)

for data_location in data_locations:
    print(data_location)

s3://partly-cloudy-rap-csv/2020_06_01_00.csv
s3://partly-cloudy-rap-csv/2020_06_01_01.csv
s3://partly-cloudy-rap-csv/2020_06_01_02.csv
s3://partly-cloudy-rap-csv/2020_06_01_03.csv
s3://partly-cloudy-rap-csv/2020_06_01_04.csv
s3://partly-cloudy-rap-csv/2020_06_01_05.csv
s3://partly-cloudy-rap-csv/2020_06_01_06.csv
s3://partly-cloudy-rap-csv/2020_06_01_07.csv
s3://partly-cloudy-rap-csv/2020_06_01_08.csv
s3://partly-cloudy-rap-csv/2020_06_01_09.csv
s3://partly-cloudy-rap-csv/2020_06_01_10.csv
s3://partly-cloudy-rap-csv/2020_06_01_11.csv
s3://partly-cloudy-rap-csv/2020_06_01_12.csv
s3://partly-cloudy-rap-csv/2020_06_01_13.csv
s3://partly-cloudy-rap-csv/2020_06_01_14.csv
s3://partly-cloudy-rap-csv/2020_06_01_15.csv
s3://partly-cloudy-rap-csv/2020_06_01_16.csv
s3://partly-cloudy-rap-csv/2020_06_01_17.csv
s3://partly-cloudy-rap-csv/2020_06_01_18.csv
s3://partly-cloudy-rap-csv/2020_06_01_19.csv
s3://partly-cloudy-rap-csv/2020_06_01_20.csv
s3://partly-cloudy-rap-csv/2020_06_01_21.csv
s3://partl

### 5. Ingest Selected Files into a Single Dataframe
The section works for both Sections 3 or 4.

In [8]:
df = wr.s3.read_csv(path= data_locations)

In [9]:
df

Unnamed: 0,dateTime,hPa,FLevel,Nx,Ny,Lat,Lon,Temperature,RH_ice,IsISSR
0,2020-06-01 00:00:00,150,440,1,1,16.281000,-126.138000,206.817,56.84,0
1,2020-06-01 00:00:00,150,440,2,1,16.322011,-125.954684,206.817,56.39,0
2,2020-06-01 00:00:00,150,440,3,1,16.362789,-125.771252,206.880,55.47,0
3,2020-06-01 00:00:00,150,440,4,1,16.403332,-125.587705,206.942,54.55,0
4,2020-06-01 00:00:00,150,440,5,1,16.443642,-125.404045,206.942,53.44,0
...,...,...,...,...,...,...,...,...,...,...
21130195,2020-06-01 23:00:00,450,210,297,225,55.648911,-58.431595,251.913,74.46,0
21130196,2020-06-01 23:00:00,450,210,298,225,55.607604,-58.167947,252.225,82.05,0
21130197,2020-06-01 23:00:00,450,210,299,225,55.565986,-57.904583,252.475,88.87,0
21130198,2020-06-01 23:00:00,450,210,300,225,55.524058,-57.641507,252.600,92.11,0


### 6. Data Manipulation

In [10]:
df_select = df.loc[df['IsISSR'] == 1].copy()

In [11]:
df_select

Unnamed: 0,dateTime,hPa,FLevel,Nx,Ny,Lat,Lon,Temperature,RH_ice,IsISSR
67910,2020-06-01 00:00:00,175,410,186,1,19.654416,-90.942443,213.866,100.55,1
67911,2020-06-01 00:00:00,175,410,187,1,19.648843,-90.749348,213.616,100.73,1
68211,2020-06-01 00:00:00,175,410,186,2,19.836295,-90.936654,213.804,100.17,1
68212,2020-06-01 00:00:00,175,410,187,2,19.830712,-90.743284,213.616,100.30,1
69191,2020-06-01 00:00:00,175,410,263,5,19.192999,-76.074239,212.429,100.29,1
...,...,...,...,...,...,...,...,...,...,...
20926743,2020-06-01 23:00:00,375,250,20,225,55.087685,-134.973239,232.474,100.51,1
20926744,2020-06-01 23:00:00,375,250,21,225,55.132717,-134.713112,232.474,100.51,1
20926745,2020-06-01 23:00:00,375,250,22,225,55.177439,-134.452684,232.411,100.56,1
20926746,2020-06-01 23:00:00,375,250,23,225,55.221851,-134.191956,232.349,100.62,1


### 6. Store Output to "partly-cloudy-common-area" Bucket

In [12]:
# Examine the list of files are already in the output_bucket (i.e., "partly-cloudy-common-area")
conn = boto3.client('s3')
contents = conn.list_objects(Bucket= output_bucket, Prefix= subfolder)['Contents']
for f in contents:
    print(f['Key'])

JuneFirst2020_24hr_issr.csv
hourly_issr_summary.csv


In [13]:
outputFileName = 'selectAfileName.csv' # <--------------------------------------------- Specify

wr.s3.to_csv(df_select, f"s3://{output_bucket}/{outputFileName}", index=False)