<a href="https://colab.research.google.com/github/canfielder/DSBA-6190_Proj4_Serverless-Pipeline/blob/master/notebooks/Lambda_Output_Formatting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import

## Install

In [0]:
!pip -q install boto3
!pip -q install stringcase

## Packages

In [0]:
# General
import os
import io
import pandas as pd
import stringcase
from IPython.display import display, HTML

# AWS Connection
from google.colab import drive
import boto3


# Set Up AWS Connection

## Mount Google Drive

In [3]:
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [0]:
path = "/content/gdrive/My Drive/aws/credentials/"

aws_dir = os.listdir(path)

aws_credentials = aws_dir.pop(0)

Establish the source location of the AWS credential file.

In [5]:
aws_credentials_src = os.path.join(path, aws_credentials)
aws_credentials_src = aws_credentials_src.replace(" ", "\ ")
print(aws_credentials_src)

/content/gdrive/My\ Drive/aws/credentials/credentials


Establish the destination of where to copy the AWS credential file.

In [6]:
aws_credentials_dst = "~/.aws/credentials"
print(aws_credentials_dst)    

~/.aws/credentials


Copy the credentials from the my mounted Google Drive to the local folder.

In [0]:
#!/usr/bin/env python3
mkdir -p ~/.aws &&\
  cp -r {aws_credentials_src} {aws_credentials_dst} 

Verify the credentials were correctly copied.

In [8]:
#!/usr/bin/env python3
ls -R {aws_credentials_dst}

/root/.aws/credentials


# Establish Boto3 Session

By establishing a Boto3 session with Region, all downstream uses of Boto3 will import these associated values. We do not have to define the region multiple times.

In [0]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'

session = boto3.Session(profile_name=profile, region_name=region)

# Processing

## Generate List of All Files

In [0]:
s3 = session.client(service_name="s3")
bucket = "dsba-6190-project4-serverless-data-engineering-pipeline"

The following code chunk comes from this blog post, accessed on 3/16/2020:

[https://alexwlchan.net/2017/07/listing-s3-keys/](https://alexwlchan.net/2017/07/listing-s3-keys/)


The code generates a list of all files in a bucket. It is not recursive. An issue with S3 buckets is 

In [0]:
def get_all_s3_keys(bucket):
    """Get a list of all keys in an S3 bucket."""
    keys = []

    kwargs = {'Bucket': bucket}
    while True:
        resp = s3.list_objects_v2(**kwargs)
        for obj in resp['Contents']:
            keys.append(obj['Key'])

        try:
            kwargs['ContinuationToken'] = resp['NextContinuationToken']
        except KeyError:
            break

    return keys

In [12]:
lambda_output = get_all_s3_keys(bucket)
lambda_output

['entity_barkley_marathons.csv',
 'entity_hardrock_hundred_mile_endurance_run.csv',
 'entity_leadville_trail_100.csv',
 'entity_ultra-trail_du_mont-blanc.csv',
 'entity_vermont_100_mile_endurance_run.csv',
 'entity_western_states_endurance_run.csv']

## Import CSV Files
The following function returns a distionary of race names and dataframes, read from the associated race CSV files.

In [0]:
def import_csv(s3_bucket):
  # Generate List of Files
  s3_key_list = get_all_s3_keys(s3_bucket)

  # Import Files, Appending Each into a list of dictionaries
  df_dict = {}
  for item in s3_key_list:
    kwargs = {
      'Bucket': bucket,
      'Key': item
    }
    item_clipped = item.replace("entity_","").replace(".csv","")
    obj = s3.get_object(**kwargs)
    df_buffer = pd.read_csv(io.BytesIO(obj['Body'].read()))
    df_dict.update(
        {item_clipped:df_buffer}
    )
  
  return df_dict

In [0]:
df_dict = import_csv(bucket)

# Print Output
With the data for each race imported, we print the dataframes for each race. I will use some of these images in the README for the github repo.

In [15]:
for race in df_dict:
  race_print = race.replace("_"," ")
  race_print = stringcase.titlecase(race_print)
  print(race_print)
  df = df_dict[race]
  display(HTML(df.to_html()))
  print("")
  print("")

Barkley Marathons


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Barkley Marathons,ORGANIZATION,0.877157,4,21
1,Frozen Head State Park,LOCATION,0.99956,61,83
2,"Wartburg, Tennessee",LOCATION,0.941214,89,108




Hardrock Hundred Mile Endurance Run


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Hardrock Hundred Mile Endurance Run,EVENT,0.996339,4,39
1,100.5 miles,QUANTITY,0.999797,60,71
2,161.7 km,QUANTITY,0.999766,73,81
3,"33,000 feet",QUANTITY,0.999909,99,110
4,"10,000 m",QUANTITY,0.999669,112,120
5,"over 11,000 feet",QUANTITY,0.670727,158,174
6,"3,400 m",QUANTITY,0.999529,176,183




Leadville Trail 100


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Leadville Trail 100 Run,EVENT,0.880859,4,27
1,The Race Across The Sky,EVENT,0.886408,33,56
2,LT100,EVENT,0.921593,64,69
3,annually,QUANTITY,0.549734,96,104
4,"Leadville, Colorado",LOCATION,0.949732,142,161
5,Rocky Mountains,LOCATION,0.994487,188,203




Ultra Trail Du Mont Blanc


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Ultra,EVENT,0.635489,4,9
1,-Trail du Mont-Blanc,ORGANIZATION,0.662964,9,29
2,UTMB,ORGANIZATION,0.710219,31,35
3,single-stage,QUANTITY,0.8118,42,54
4,first,QUANTITY,0.937365,78,83
5,2003,DATE,0.999278,92,96




Vermont 100 Mile Endurance Run


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Vermont 100 Mile Endurance Run,EVENT,0.998929,4,34
1,Vermont 100,EVENT,0.978381,37,48
2,100-mile,QUANTITY,0.999935,56,64
3,162 km,QUANTITY,0.999739,66,72
4,July,DATE,0.998822,110,114
5,Silver Hill Meadow,LOCATION,0.998826,118,136
6,"West Windsor, Vermont",LOCATION,0.703432,140,161




Western States Endurance Run


Unnamed: 0,Text,Type,Score,BeginOffset,EndOffset
0,Western States Endurance Run,EVENT,0.993702,4,32
1,Western States 100,EVENT,0.996007,56,74
2,100-mile,QUANTITY,0.999894,81,89
3,161 km,QUANTITY,0.998989,91,97
4,California,LOCATION,0.997917,133,143
5,Sierra Nevada Mountains,LOCATION,0.986145,146,169
6,each year,QUANTITY,0.988641,177,186
7,last full weekend of June,DATE,0.899184,194,219




