<!--
#  Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
#    Licensed under the Apache License, Version 2.0 (the "License").
#    You may not use this file except in compliance with the License.
#    You may obtain a copy of the License at
#
#        http://www.apache.org/licenses/LICENSE-2.0
#
#    Unless required by applicable law or agreed to in writing, software
#    distributed under the License is distributed on an "AS IS" BASIS,
#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#    See the License for the specific language governing permissions and
#    limitations under the License.
-->

# Orcherstration notebook for building the lake
***End-to-End Orchestration for Building Out a Data Lake in Orbit Workbench***

---
---

## Content
1. [Introduction](#Orcherstration-notebook-for-building-the-lake)
2. [Set Up](#Set-Up)
 1. [Imports](#Imports)
 2. [Locate Bucket Paths](#Locate-Bucket-Paths)
 3. [Create Databases](#Create-Databases)
 4. [Get Parameters](#Get-Parameters)
3. [S3 Clean Up](#Step-2:-S3-Clean-Up)
4. [Extract Zip Files in Parallel](#Step-3:-Extract-Zip-Files-in-Parallel)
5. [Read the CSV Files and Create Glue tables with Parquet format according to schema](#Step-4:-Read-the-CSV-Files-and-Create-Glue-tables-with-Parquet-format-according-to-schema)
 1. [Connect to Spark and Access Cluster](#Connect-to-Spark-and-Access-Cluster)
 2. [Create Glue Tables](#Create-Glue-Tables)
 3. [Check Tables are Created](#Check-that-Glue-Tables-are-Created)

---

## Introduction
This notebook orchestrates the Data Lake creation. It performs end-to-end functionality starting with handling zipped csv data files and ultimately creating a data lake on AWS. In order to do so, we execute a set of notebooks for the different steps of creating the lake, including:

* **Example-2-Extract-Files** - Extracting Zip Files Data to a Target s3 Bucket in Parallel
* **Example-3-Load-Database-Athena** - Find File Schema and Create Glue Tables with Parquet Output

To successfully run this notebook, you must be the "Lake Creator" user in your environment. Give a look at the following steps for how to go about orchestrating your own data lake build and feel free to look at the other 2 notebooks to get a more in-depth understanding of the process!


## Set Up

#### Imports
First, let's import all of the modules we will need for building our Data Lake, including Spark EMR Cluster, JSON, etc. Lets store our session state so that we can create service clients to s3 and glue.

Next, lets define the location of our notebooks in s3 and check our team space (we **MUST** be the lake-creator to orchestrate our data lake!):


In [1]:
import os
import sys
import boto3
import time
import json
from aws_orbit_sdk import controller
from aws_orbit_sdk.common import get_workspace
from pathlib import Path

!env | grep AWS

# import aws.utils.notebooks.controller as controller
# from aws.utils.notebooks.common import get_workspace
# import aws.utils.notebooks.spark.emr as sparkConnection
my_session = boto3.session.Session()
my_region = my_session.region_name
s3 = boto3.client('s3')
glue = boto3.client('glue')

AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::495869084367:role/orbit-dev-env-lake-creator-role
AWS_ORBIT_S3_BUCKET=orbit-dev-env-toolkit-495869084367-004229
AWS_ORBIT_ENV=dev-env
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_ORBIT_TEAM_SPACE=lake-creator
AWS_STS_REGIONAL_ENDPOINTS=regional


In [2]:
workspace = get_workspace()
notebook_bucket = workspace['ScratchBucket']
team_space = workspace['team_space']
env_name = workspace['env_name']
workspace

{'base-image-address': '495869084367.dkr.ecr.us-west-2.amazonaws.com/orbit-dev-env-jupyter-user',
 'base-spark-image-address': '495869084367.dkr.ecr.us-west-2.amazonaws.com/orbit-dev-env-jupyter-user-spark',
 'bootstrap-s3-prefix': 'teams/lake-creator/bootstrap/',
 'container-defaults': {'cpu': 4, 'memory': 16384},
 'container-runner-arn': 'arn:aws:lambda:us-west-2:495869084367:function:orbit-dev-env-lake-creator-container-runner',
 'ecs-cluster-name': 'orbit-dev-env-lake-creator-cluster',
 'efs-ap-id': 'fsap-06f23cdb204106822',
 'efs-id': 'fs-ca758dce',
 'eks-k8s-api-arn': 'arn:aws:states:us-west-2:495869084367:stateMachine:orbit-dev-env-lake-creator-eks-k8s-api',
 'eks-nodegroup-role-arn': 'arn:aws:iam::495869084367:role/orbit-dev-env-lake-creator-role',
 'elbs': {},
 'final-image-address': '495869084367.dkr.ecr.us-west-2.amazonaws.com/orbit-dev-env-lake-creator',
 'final-spark-image-address': '495869084367.dkr.ecr.us-west-2.amazonaws.com/orbit-dev-env-lake-creator-spark',
 'grant-su

#### Locate Bucket Paths
Now, let's use Amazon Systems Manager (SSM) to get the bucket names for our users, unsecured lake, and secured lake buckets 
(**Note:** we will use these bucket names to locate and store data later on):

In [3]:
ssm = boto3.client('ssm')
def get_ssm_parameters(ssm_string, ignore_not_found=False):
    ssm = boto3.client('ssm')
    
    try:
        return json.loads(ssm.get_parameter(Name=ssm_string)['Parameter']['Value'])
    except Exception as e:
        if ignore_not_found:
            return {}
        else:
            raise e

        
def get_demo_configuration():
    return get_ssm_parameters(f"/orbit/{env_name}/demo", True)

demo_config = get_demo_configuration()
lake_bucket = demo_config.get("LakeBucket")
users_bucket = notebook_bucket
(lake_bucket,users_bucket)


#### Create Databases
We have to create 3 databases: 1 for our raw data, 1 for a default database (both on the lake_bucket path) and 1 for users in our user_bucket.

We can do so with the following function which:

  **1.** Deletes the existing database with a given name (if it exists)
  
  **2.** Create new one located in designated s3 bucket

In [4]:
def create_db(name, location, description=''):
    try:
        response = glue.delete_database(
            Name=name
        )
    except:
        pass
    response = glue.create_database(
        DatabaseInput={
            'Name': name,
            'Description': description,
            'LocationUri': f's3://{location}/{name}'
        }
    )

In [5]:
create_db('cms_raw_db', lake_bucket,'lake: claims data from cms')
create_db('default', lake_bucket)
create_db('users', users_bucket)

#### Get Parameters
Lastly, we set the source paths of our Zip files, a path to our extracted data folders, and our database name howing our raw data:

In [6]:
location = glue.get_database(Name='cms_raw_db')['Database']['LocationUri']
bucket = location[5:].split('/')[0]
(bucket, location)

('orbit-dev-env-scratch-495869084367-004229',
 's3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db')

In [7]:
sourcePrefix = "cms/"
sourceFolder = "landing/data/" + sourcePrefix
bucketName = bucket
extractedPrefix = "extracted/"
extractedFolder = "s3://{}/{}".format(bucketName,extractedPrefix)
database_name = "cms_raw_db"
team_space=workspace['team_space']

---
## Step 2: S3 Clean Up

Another step we must take before beginning to extract our data and orchestrate our data lake is to clean out our S3 files for any existing data files sitting there. Let's remove existing content in the following folders:
* Files stored in our Extracted Data Folder
* Data Stored in our 'cms_raw_db' table
* Remove Test Output so we can populate with new test results


In [8]:
!aws s3 ls $extractedFolder
!aws s3 rm --recursive $extractedFolder
!aws s3 ls $extractedFolder

                           PRE Beneficiary_Summary/
                           PRE Carrier_Claims/
                           PRE Inpatient_Claims/
                           PRE Outpatient_Claims/
                           PRE Prescription_Drug_Events/
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Inpatient_Claims/DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.csv
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Carrier_Claims/DE1_0_2008_to_2010_Carrier_Claims_Sample_1A.csv
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Beneficiary_Summary/DE1_0_2009_Beneficiary_Summary_File_Sample_1.csv
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Prescription_Drug_Events/DE1_0_2008_to_2010_Prescription_Drug_Events_Sample_1.csv
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Outpatient_Claims/DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv
delete: s3://orbit-dev-env-scratch-495869084367-004229/extracted/Benefic

In [9]:
!aws s3 ls s3://$bucketName/$database_name/
!aws s3 rm --recursive  s3://$bucketName/$database_name/
!aws s3 ls s3://$bucketName/$database_name/

                           PRE Beneficiary_Summary/
                           PRE Carrier_Claims/
                           PRE Inpatient_Claims/
                           PRE Outpatient_Claims/
                           PRE Prescription_Drug_Events/
delete: s3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db/Beneficiary_Summary/20210211_212112_00103_j65b6_6b48deb8-ca75-4877-bb8f-eddaad27bf7e
delete: s3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db/Carrier_Claims/20210211_212114_00049_tceut_1743117b-298b-453f-a59e-3a514d076834
delete: s3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db/Carrier_Claims/20210211_212114_00049_tceut_350661b3-e74e-465a-b9ef-e54710853cfe
delete: s3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db/Carrier_Claims/20210211_212114_00049_tceut_05ea217c-3d3e-4dd5-98d6-c32e514d67ff
delete: s3://orbit-dev-env-scratch-495869084367-004229/cms_raw_db/Carrier_Claims/20210211_212114_00049_tceut_147d5ceb-bd50-4698-8614-8e9a90ac4f5b
delete: s3

***Lets Copy our Sample Data to our New s3 Bucket and Begin our Orchestration:***

In [10]:
res=!orbit list env --variable=toolkitbucket
dm_s3_bucket = res[0]
print(dm_s3_bucket)
!aws s3 sync s3://$dm_s3_bucket/data s3://$bucketName/landing/data

In [11]:
!aws s3 cp s3://$dm_s3_bucket/cms/schema s3://$bucketName/landing/cms/schema/ --recursive

copy: s3://orbit-dev-env-toolkit-495869084367-004229/cms/schema/Beneficiary_Summary.json to s3://orbit-dev-env-scratch-495869084367-004229/landing/cms/schema/Beneficiary_Summary.json
copy: s3://orbit-dev-env-toolkit-495869084367-004229/cms/schema/Carrier_Claims.json to s3://orbit-dev-env-scratch-495869084367-004229/landing/cms/schema/Carrier_Claims.json
copy: s3://orbit-dev-env-toolkit-495869084367-004229/cms/schema/Outpatient_Claims.json to s3://orbit-dev-env-scratch-495869084367-004229/landing/cms/schema/Outpatient_Claims.json
copy: s3://orbit-dev-env-toolkit-495869084367-004229/cms/schema/Prescription_Drug_Events.json to s3://orbit-dev-env-scratch-495869084367-004229/landing/cms/schema/Prescription_Drug_Events.json
copy: s3://orbit-dev-env-toolkit-495869084367-004229/cms/schema/Inpatient_Claims.json to s3://orbit-dev-env-scratch-495869084367-004229/landing/cms/schema/Inpatient_Claims.json


In [12]:
def get_schemas(source_bucket_name, prefix='', suffix=''):
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(name=source_bucket_name) 
    schemas = []
    for o in bucket.objects.all():
        if (o.key.startswith(prefix)):
            name = os.path.basename(o.key).split(".")[0]
            schemaStr = o.get()['Body'].read().decode('utf-8') 
            schema = json.loads(schemaStr) #StructType.fromJson(json.loads(schemaStr))
            schemas.append((name, schema))
    return schemas

def get_schema(schemas, filename):
    for (schema_name, schema) in schemas:
        #print(f"{schema_name} in {filename} : {schema_name in filename}")
        if schema_name in filename:
            return schema_name, schema
    return None, None

schemas = get_schemas(bucketName, 'landing/cms/schema/')

***
## Step 3: Extract Zip Files in Parallel

We have completed all of the necessary set up and s3 clean up and our ready to move into the first phase of our data lake orchestration. Here we will:

* Handle Zipped Files and Extract their CSV Data
* Migrate the Extracted Data back to s3 in a new Target Bucket
* Schedule Multiple Notebooks to Execute in Parallel

We will schedule separate notebooks to run **Example-2-Extract-Files** and execute in parallel. You can refer to that notebook which goes step-by-step handling zipped files and unzipping and migrating their content back to s3.

Here we will define error checking functions to assert that our extraction was successful. We ensure the number of executions matches what is expected and that there are no errors in our execution history:


In [13]:
def run_file_extraction():
    notebooks = []
    for key in s3.list_objects_v2(Bucket=bucketName, Prefix=sourceFolder)['Contents']:
        file = key['Key']
        schema = get_schema(schemas, file)
        s3_data_folder = os.path.join(extractedFolder, schema[0] if schema[0] else "")
        notebook = {
          "notebookName": "Example-2-Extract-Files.ipynb",
          "sourcePath": "/efs/shared/samples/notebooks/A-LakeCreator",
          "targetPath": "/efs/shared/regression/notebooks/A-LakeCreator",
          "params": {
            "bucketName": bucketName,
            "zipFileName": file,
            "targetFolder": s3_data_folder,
            "use_subdirs" : False if schema[0] else True
          },
        }
        notebooks.append(notebook)

    notebooksToRun = {
      "compute": {
          "container" : {
              "p_concurrent": "10"
          },
          "node_type": "fargate"
      },
      "tasks":  notebooks  
    }
    # notebooks
    containers = controller.run_notebooks(notebooksToRun)
    print (containers)
    controller.wait_for_tasks_to_complete([containers], 60,10, False)


In [14]:
extractedFolder

's3://orbit-dev-env-scratch-495869084367-004229/extracted/'

In [15]:
def checkNotebooks(executions, expected_count):
    assert len(executions) == expected_count
    for index, row in executions.iterrows():
        if 'error@' in row['relativePath']:
            raise AssertionError('error in ' + row['relativePath'])
    print("SUCCESS")

In [16]:
%%time

run_file_extraction()

INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}
INFO:root:Waiting for 1 tasks [{'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-fargate-runner-j4kx4', 'NodeType': 'fargate'}]
INFO:root:job-status={'active': None,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': None,
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...


{'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-fargate-runner-j4kx4', 'NodeType': 'fargate'}


INFO:root:job-status={'active': 1,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': datetime.datetime(2021, 2, 11, 21, 23, 36, tzinfo=tzlocal()),
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...
INFO:root:job-status={'active': 1,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': datetime.datetime(2021, 2, 11, 21, 23, 36, tzinfo=tzlocal()),
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...
INFO:root:job-status={'active': None,
 'completion_time': datetime.datetime(2021, 2, 11, 21, 25, 56, tzinfo=tzlocal()),
 'conditions': [{'last_probe_time': datetime.datetime(2021, 2, 11, 21, 25, 56, tzinfo=tzlocal()),
                 'last_transition_time': datetime.datetime(2021, 2, 11, 21, 25, 56, tzinfo=tzlocal()),
                 'message': None,
                 'reason': None,
                 'status': 'True',
                 'type': 'Complete'}],
 'failed': Non

CPU times: user 259 ms, sys: 11.8 ms, total: 271 ms
Wall time: 3min


In [17]:
!kubectl get job -n lake-creator

NAME                                      COMPLETIONS   DURATION   AGE
orbit-lake-creator-ec2-runner-526gr       1/1           33s        5m38s
orbit-lake-creator-ec2-runner-8fdh5       1/1           24s        5m38s
orbit-lake-creator-ec2-runner-f25px       1/1           32s        5m37s
orbit-lake-creator-ec2-runner-mp9cx       1/1           31s        5m37s
orbit-lake-creator-ec2-runner-xr4qn       1/1           28s        5m37s
orbit-lake-creator-fargate-runner-2k9ln   1/1           2m22s      16m
orbit-lake-creator-fargate-runner-j4kx4   1/1           2m20s      3m1s
orbit-lake-creator-fargate-runner-qtr59   1/1           2m24s      11m
orbit-lake-creator-fargate-runner-xd4j2   1/1           2m20s      14m


In [18]:
executions = controller.get_execution_history("/efs/shared/regression/notebooks/A-LakeCreator", "Example-2-Extract-Files.ipynb")
executions

Unnamed: 0,relativePath,timestamp,path
0,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:17:50.477,/efs/shared/regression/notebooks/A-LakeCreator...
1,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:25:55.139,/efs/shared/regression/notebooks/A-LakeCreator...
2,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:24:42.482,/efs/shared/regression/notebooks/A-LakeCreator...
3,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:17:22.367,/efs/shared/regression/notebooks/A-LakeCreator...
4,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:16:31.475,/efs/shared/regression/notebooks/A-LakeCreator...
5,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:25:01.725,/efs/shared/regression/notebooks/A-LakeCreator...
6,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:16:30.973,/efs/shared/regression/notebooks/A-LakeCreator...
7,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:24:43.697,/efs/shared/regression/notebooks/A-LakeCreator...
8,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:17:50.837,/efs/shared/regression/notebooks/A-LakeCreator...
9,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:24:41.841,/efs/shared/regression/notebooks/A-LakeCreator...


In [19]:
try: 
    checkNotebooks(executions, 8)
except AssertionError as e:
    print("Failed once, lets give one more try")
    !rm -rf /efs/shared/regression/notebooks/A-LakeCreator/Example-2*
    run_file_extraction()
    executions = controller.get_execution_history("/efs/shared/regression/notebooks/A-LakeCreator", "Example-2-Extract-Files.ipynb")
    checkNotebooks(executions,8)
    

Failed once, lets give one more try


INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}
INFO:root:Waiting for 1 tasks [{'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-fargate-runner-rhr87', 'NodeType': 'fargate'}]
INFO:root:job-status={'active': None,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': None,
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...


{'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-fargate-runner-rhr87', 'NodeType': 'fargate'}


INFO:root:job-status={'active': 1,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': datetime.datetime(2021, 2, 11, 21, 26, 38, tzinfo=tzlocal()),
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...
INFO:root:job-status={'active': 1,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': datetime.datetime(2021, 2, 11, 21, 26, 38, tzinfo=tzlocal()),
 'succeeded': None}
INFO:root:Running: 1 Completed: 0 Errored: 0
INFO:root:waiting...
INFO:root:job-status={'active': None,
 'completion_time': datetime.datetime(2021, 2, 11, 21, 29, tzinfo=tzlocal()),
 'conditions': [{'last_probe_time': datetime.datetime(2021, 2, 11, 21, 29, tzinfo=tzlocal()),
                 'last_transition_time': datetime.datetime(2021, 2, 11, 21, 29, tzinfo=tzlocal()),
                 'message': None,
                 'reason': None,
                 'status': 'True',
                 'type': 'Complete'}],
 'failed': None,
 'start_t

SUCCESS


***We will check our Output in our extractedFolder and split the output into an array of File Outputs:***

In [20]:
%%bash --out output --err error -s $extractedFolder

aws s3 ls "$1"

In [21]:
print(output)
files = output.split('\n')
print("total files: " + str(len(files)))

                           PRE Beneficiary_Summary/
                           PRE Carrier_Claims/
                           PRE Inpatient_Claims/
                           PRE Outpatient_Claims/
                           PRE Prescription_Drug_Events/

total files: 6



***
## Step 4: Read the CSV Files and Create Glue tables with Parquet format according to schema

Our Zipped files have been succesfully extracted and their CSV data content is placed in our **extractedFolder**. We now must collect the schema for our different data tables saved as csv files and create Parquet Output Glue Tables with set schema located in our target directory on s3.


#### Create Glue Tables
We will now create our Parquet formatted Glue Tables by scheduling and executing the notebook, **Example-3-Load-Database-Athena**. For each extracted csv data file in our extracted folder our notebook will perform the following:

* Find for each file the corresponding schema

* Read the file using spark

* Create external tables to create the Glue table and parquet output

You can refer to that notebook which goes step-by-step through the process of schema detection and table creation. We will execute the **run_glue_table_loading** on 4 concurrent containers and checks the execution history to ensure that the code ran for all the data files:

In [22]:
load_data_notebook = "Example-3-Load-Database-Athena"

def run_glue_table_loading(concurrent_containers=4):
    containers = []
    found_schemas = []
    i = 0
    for key in s3.list_objects_v2(Bucket=bucketName, Prefix=extractedPrefix)['Contents']:
        file = key['Key']
        p = Path(file).parent
        schema = get_schema(schemas, file)
        if schema in found_schemas:
            continue
        i = i + 1
        print(f"Found schema: {schema[0]}")
        found_schemas.append(schema)
        notebooksToRun = {
          "compute": {
              "node_type": "ec2",
              "container": {
                  "p_concurrent" :1
              }
          },
          "tasks":  [
                {
                      "notebookName": f"{load_data_notebook}.ipynb",
                      "sourcePath": "/efs/shared/samples/notebooks/A-LakeCreator",
                      "targetPath": "/efs/shared/regression/notebooks/A-LakeCreator",
                      "targetPrefix": "unsecured-{}".format(i),
                      "params": {
                            "source_bucket_name" : bucketName,
                            "target_bucket_name" : bucketName,
                            "database_name" : "cms_raw_db",
                            "schema_dir" : "landing/cms/schema",
                            "file_path": str(p),
                            "region": my_region
                      }      
                }
          ],
          "env_vars": [
                {
                    'name': 'AWS_ORBIT_S3_BUCKET',
                    'value': bucketName
                }
          ]
        }

        t = time.localtime()
        current_time = time.strftime("%H:%M:%S", t)

        container = controller.run_notebooks(notebooksToRun)
        containers.append(container)
        print("task : ", current_time , str(container), "-->", notebooksToRun['tasks'][0]['params']['file_path'])
        if i%concurrent_containers == 0:
            print(f"Now waiting for {str(len(containers))} tasks to complete before spawning new ones")
            controller.wait_for_tasks_to_complete(containers, 60,30, False)
            print("task : ", containers, " done ")
            containers = []
    
    if len(containers) > 0: 
        print(f"Now waiting for {str(len(containers))} tasks to complete before spawning new ones")
        controller.wait_for_tasks_to_complete(containers, 60,15, False)

In [23]:
%%time

concurrent_containers = 10
for retry in range(0, 3):
    try: 
        !rm -rf /efs/shared/regression/notebooks/A-LakeCreator/$load_data_notebook/*
        run_glue_table_loading(concurrent_containers)
        executions = controller.get_execution_history("/efs/shared/regression/notebooks/A-LakeCreator", f"{load_data_notebook}.ipynb")  
        display(executions)
        checkNotebooks(executions,5)
        break
    except AssertionError as e:
        print(f"Failed {retry}, lets give one more try")
        #concurrent_containers = concurrent_containers - 1



Found schema: Beneficiary_Summary


INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}


task :  21:29:39 {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-575r5', 'NodeType': 'ec2'} --> extracted/Beneficiary_Summary
Found schema: Carrier_Claims


INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}


task :  21:29:40 {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-jlbjl', 'NodeType': 'ec2'} --> extracted/Carrier_Claims
Found schema: Inpatient_Claims


INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}


task :  21:29:40 {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-kdhng', 'NodeType': 'ec2'} --> extracted/Inpatient_Claims
Found schema: Outpatient_Claims


INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}
INFO:root:using default profile {'display_name': 'Micro', 'slug': 'micro', 'description': '2 CPU + 2G MEM', 'properties': {'cpu_guarantee': 2, 'cpu_limit': 2, 'mem_guarantee': '2G', 'mem_limit': '2G'}, 'default': True}
INFO:root:Waiting for 5 tasks [{'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-575r5', 'NodeType': 'ec2'}, {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-jlbjl', 'NodeType': 'ec2'}, {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-kdhng', 'NodeType': 'ec2'}, {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-9mrg8', 'NodeType': 'ec2'}, {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-l5vg2', 'NodeType': 'ec2'}]
INFO:root:job-status={'active': 1,
 'comp

task :  21:29:40 {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-9mrg8', 'NodeType': 'ec2'} --> extracted/Outpatient_Claims
Found schema: Prescription_Drug_Events
task :  21:29:40 {'ExecutionType': 'eks', 'Identifier': 'orbit-lake-creator-ec2-runner-l5vg2', 'NodeType': 'ec2'} --> extracted/Prescription_Drug_Events
Now waiting for 5 tasks to complete before spawning new ones


INFO:root:job-status={'active': None,
 'completion_time': datetime.datetime(2021, 2, 11, 21, 30, 12, tzinfo=tzlocal()),
 'conditions': [{'last_probe_time': datetime.datetime(2021, 2, 11, 21, 30, 12, tzinfo=tzlocal()),
                 'last_transition_time': datetime.datetime(2021, 2, 11, 21, 30, 12, tzinfo=tzlocal()),
                 'message': None,
                 'reason': None,
                 'status': 'True',
                 'type': 'Complete'}],
 'failed': None,
 'start_time': datetime.datetime(2021, 2, 11, 21, 29, 40, tzinfo=tzlocal()),
 'succeeded': 1}
INFO:root:Running: 0 Completed: 1 Errored: 0
INFO:root:All tasks stopped


Unnamed: 0,relativePath,timestamp,path
0,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:30:09.997,/efs/shared/regression/notebooks/A-LakeCreator...
1,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:30:07.226,/efs/shared/regression/notebooks/A-LakeCreator...
2,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:30:12.345,/efs/shared/regression/notebooks/A-LakeCreator...
3,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:30:11.659,/efs/shared/regression/notebooks/A-LakeCreator...
4,/efs/shared/regression/notebooks/A-LakeCreator...,2021-02-11 21:30:11.156,/efs/shared/regression/notebooks/A-LakeCreator...


SUCCESS
CPU times: user 469 ms, sys: 59.5 ms, total: 529 ms
Wall time: 1min 1s


#### Check that Glue Tables are Created
We will ensure that we now have 7 tables created from our 7 data files that we extracted earlier:

In [24]:
glue = boto3.client('glue')
res = glue.get_tables(DatabaseName='cms_raw_db')
tables = res['TableList']
raw_count = 0
parq_count = 0
for t in tables:
    if t['Name'].endswith('_raw'):
        raw_count += 1
    else:
        parq_count += 1
    print(t['Name'])
print(f"Total tables: {str(len(tables))}. Raw tables: {raw_count}. Final tables: {parq_count}")
assert raw_count == parq_count and raw_count > 0
!echo "PASSED" >> /efs/shared/regression/PASSED
!ls /efs/shared/regression

beneficiary_summary
beneficiary_summary_raw
carrier_claims
carrier_claims_raw
inpatient_claims
inpatient_claims_raw
outpatient_claims
outpatient_claims_raw
prescription_drug_events
prescription_drug_events_raw
Total tables: 10. Raw tables: 5. Final tables: 5
notebooks  PASSED


In [25]:
!kubectl get job -n lake-creator

NAME                                      COMPLETIONS   DURATION   AGE
orbit-lake-creator-ec2-runner-526gr       1/1           33s        9m43s
orbit-lake-creator-ec2-runner-575r5       1/1           28s        62s
orbit-lake-creator-ec2-runner-8fdh5       1/1           24s        9m43s
orbit-lake-creator-ec2-runner-9mrg8       1/1           32s        62s
orbit-lake-creator-ec2-runner-f25px       1/1           32s        9m42s
orbit-lake-creator-ec2-runner-jlbjl       1/1           32s        62s
orbit-lake-creator-ec2-runner-kdhng       1/1           31s        62s
orbit-lake-creator-ec2-runner-l5vg2       1/1           33s        62s
orbit-lake-creator-ec2-runner-mp9cx       1/1           31s        9m42s
orbit-lake-creator-ec2-runner-xr4qn       1/1           28s        9m42s
orbit-lake-creator-fargate-runner-2k9ln   1/1           2m22s      20m
orbit-lake-creator-fargate-runner-j4kx4   1/1           2m20s      7m6s
orbit-lake-creator-fargate-runner-qtr59   1/1           2m24s     



# End of notebook

**Congratulations! You have just built you very own Data Lake with countless Amazon Web Service integrations through AWS Orbit Workbench**

<img style="width:100px;height:100px;border:0;" src="https://images-na.ssl-images-amazon.com/images/I/71pHyDfdXwL._SL1500_.jpg" />