# LakeFormation Example Notebook
***Creating LakeFormation and Secured Database with Granular Security Access***

___


## Contents

1. [Introduction](#Introduction)
2. [Setup](#Setup)
  1. [Imports](#Imports)
  2. [Create Low-Level Clients](#Create-Low-Level-Clients)
  3. [Athena Connection](#Athena-Connection)
3. [Create Secured Database](#Create-Secured-Database)
  1. [Create Database In Glue](#Create-Database-In-Glue)
  2. [Create Tables](#Create-Tables)
4. [Adding Lakeformation policy tags to the resources - Database, Tables and Columns](#)
  1. [Database Level Tagging](#Database-Level-Tagging)
  2. [Table Level Tagging](#Table-Level-Tagging)
  3. [Column Level Tagging](#Column-Level-Tagging)
5. [Securing the Database Using LakeFormation](#Securing-the-Database-Using-LakeFormation)
  1. [Registering Database](Registering-Database)


___
## Introduction

This notebook dives deeps into the Tag-based Security Access in AWS LakeFormation. It illustrates the following:

* Ability to create new database that is secured by AWS Lake Formation and is managed by AWS Glue Catalog.

* Ability to tag databases, tables and columns with user defined security tags


This is the second step in setting up our Data Lake before we can securely start analyzing our data, typically through reporting, visualization, advanced analytics and machine learning methodologies.

---

#### Author: AWS Professional Services Emerging Technology and Intelligent Platforms Group
#### Date: June 10 2021


## Setup

#### Imports and Parameters
First, let's import all of the modules we will need for our lake formation, including Pandas DataFrames, Athena, etc. Lets store our session state so that we can create service clients and resources later on.

Next, lets define the location of our unsecured databased, a secured db location, assert we are indeed the lake-creator
(**Note:** We cannot run this notebook if we are not the lake-creator):

In [None]:
import json
import boto3
from pandas import DataFrame
# Import orbit helpers
from aws_orbit_sdk.database import get_athena
from aws_orbit_sdk.common import get_workspace

my_session = boto3.session.Session()
my_region = my_session.region_name
print(my_region)

In [None]:
# Clients
lfc = boto3.client('lakeformation')
iamc = boto3.client('iam')
ssmc = boto3.client('ssm')
gluec = boto3.client('glue')


In [None]:
workspace = get_workspace()

catalog_id = workspace['EksPodRoleArn'].split(':')[-2] 
orbit_lake_creator_role_arn = workspace['EksPodRoleArn']
orbit_env_admin_role_arn = orbit_lake_creator_role_arn.replace("-lake-creator-role", "-admin")
env_name = workspace['env_name']
team_space = workspace['team_space']
assert team_space == 'lake-creator'
workspace

In [None]:
# Define parameters
unsecured_glue_db = f"cms_raw_db_{env_name}".replace('-', '_')
secured_glue_db = f"cms_secured_db_{env_name}".replace('-', '_')

#### Create Low-Level Clients
Next we must create clients for our different AWS services, lakeformation, iam, glue, & AWS Systems Manager (SSM). We will also use SSM to get the location of our secured bucket:


In [None]:
def get_ssm_parameters(ssm_string, ignore_not_found=False):
    try:
        return json.loads(ssmc.get_parameter(Name=ssm_string)['Parameter']['Value'])
    except Exception as e:
        if ignore_not_found:
            return {}
        else:
            raise e

def get_demo_configuration():
    return get_ssm_parameters(f"/orbit/{env_name}/demo", True)

demo_config = get_demo_configuration()
lake_bucket = demo_config.get("LakeBucket").split(':::')[1]
secured_lake_bucket = demo_config.get("SecuredLakeBucket").split(':::')[1]
secured_location = f"s3://{secured_lake_bucket}/{secured_glue_db}/"

(lake_bucket,secured_lake_bucket, secured_location)

#### Athena Connection
Our last set up is to connect ot athena with a defualt database and check our connection by running a simple SQL query in our notebook:

In [None]:
%reload_ext sql
%config SqlMagic.autocommit=False # for engines that do not support autommit
athena = get_athena()
%connect_to_athena -database default

In [None]:
%%sql

SELECT 1 as "Test"


# Create Secured Database

Let's begin by deregistering our secured bucket ARN if registered so that Lake Formation removes the path from the inline policy attached to your service-linked role.

**Note:** We will then re-register the bucket location to use Lake Formation permissions for fine-grained access control to AWS Glue Data Catalog objects.

Afterwards let's clean out our secured glue db if it exists and clean our s3 secured bucket to prepare for our new database creation (**CASCADE** clause tells Apache SQL to drop all tables along with database):


In [None]:
# Deregister lakeformation location if its already exists
try:
    deregister_resource_response = lfc.deregister_resource(ResourceArn=f"arn:aws:s3:::{secured_lake_bucket}")
    print(deregister_resource_response['ResponseMetadata']['HTTPStatusCode'])
except Exception as e:
    print("location was not yet registered")
    print(e)

In [None]:
# Drop and clean previous created database

%sql drop database if exists $secured_glue_db CASCADE
!aws s3 rm --recursive $secured_location --quiet

#### Create Database In Glue
We are all set to start creating our secured database in our secured s3 location by running an Athena SQL query. We will quickly check our database list to ensure it was created succesfully:

In [None]:
try:
    gluec.get_database(Name=secured_glue_db)
except gluec.exceptions.EntityNotFoundException as err:
    print(f"Database {secured_glue_db} doesn't exist. Creating {secured_glue_db}")
    create_db = f"create database {secured_glue_db} LOCATION '{secured_location}'"
    create_db
    athena.current_engine.execute(create_db)

In [None]:
%sql show databases

## Create Tables
It's time to create new tables in our secured database from our unsecured database data. We will run a load_tables() function which iterate over all of the tables:

The load_tables() function performs the following steps:

- Retrieves the definitions of all the tables in our secured db as a list of the requested Table objects
- For each table object creates a new Parquet formatted table in our secured database located in our secured s3 location
- Runs a query on secured table to check if creation successful

In [None]:
import time

def load_tables():
    response = gluec.get_tables(
        DatabaseName=unsecured_glue_db
    )
    response
    for table in response['TableList']:
        createTable = """
                CREATE TABLE {}.{}
                WITH (
                    format = 'Parquet',
                    parquet_compression = 'SNAPPY',
                    external_location = '{}/{}'
                )
                AS
                (select * from {}.{})                      
            """.format(secured_glue_db,table['Name'], secured_location,table['Name'],unsecured_glue_db,table['Name'])

        print(f'creating table {table["Name"]}...')
        athena.current_engine.execute(createTable)
        print(f'created table {table["Name"]}')
        query = f"select count(*) as {table['Name']}_count from {secured_glue_db}.{table['Name']}"
        try:
            res = athena.current_engine.execute(query)
        except: 
            print("Unexpected error:", sys.exc_info()[0])
            print("Try again to run query...")
            %sql drop database if exists $secured_glue_db CASCADE 
            !aws s3 rm --recursive $secured_location --quiet
            !sleep 10s
            # try one more time
            res = athena.current_engine.execute(query)

        df = DataFrame(res.fetchall())
        print(df)


In [None]:
for i in range(0,3):
    try:
        load_tables()
    except:
        # try one more time
        time.sleep(60)

In [None]:
%%sql

SHOW TABLES IN {secured_glue_db};

# Adding Lakeformation policy tags to the resources - Database, Tables and Columns.

Our secured database is filled with all of our data but we must now configure security and access permissions for our differnet tables. By default , columns in a table have the lowest security tagging. To fix this, we must tag the columns and tables with higher security access.

**Note:** Policy Tag usage in the example - sec-1(more secure) > sec-5(less secured)

In [None]:
orbit_env_lf_tag_key = workspace['env_name']+'-security-level'

# Database Level Tagging

Adding policy tag to Database will allow all tables and respective columns to inherit the policy tag

In [None]:
db_add_lf_tags_to_resource_response = lfc.add_lf_tags_to_resource(
    CatalogId=catalog_id,
    Resource={
        'Database': {
            'CatalogId': catalog_id,
            'Name': secured_glue_db
        },
    },
    LFTags=[
        {
            'CatalogId': catalog_id,
            'TagKey': orbit_env_lf_tag_key,
            'TagValues': [
                'sec-5',
            ]
        },
    ]
)


In [None]:
assert 200 == db_add_lf_tags_to_resource_response['ResponseMetadata']['HTTPStatusCode']

# Table with high security access

One way to increase security is to tag an entire table with a higher security level. Here we will give a table a sec-4 security level.
Overrides the database inherited tag.

In [None]:
table_add_lf_tags_to_resource_response = lfc.add_lf_tags_to_resource(
    CatalogId=catalog_id,
    Resource={
        'Table': {
            'CatalogId': catalog_id,
            'DatabaseName': secured_glue_db,
            'Name': 'inpatient_claims',
        },
    },
    LFTags=[
        {
            'CatalogId': catalog_id,
            'TagKey': orbit_env_lf_tag_key,
            'TagValues': [
                'sec-4',
            ]
        },
    ]
)

In [None]:
assert 200 == table_add_lf_tags_to_resource_response['ResponseMetadata']['HTTPStatusCode']

## Column Level Tagging

Tagging two columns 'sp_depressn' and 'sp_diabetes' with a higher security access (sec-2) while the table gets a security access level of sec-5( inherited from database):

In [None]:
table_columns_add_lf_tags_to_resource_response = lfc.add_lf_tags_to_resource(
    CatalogId=catalog_id,
    Resource={
        'TableWithColumns': {
            'CatalogId': catalog_id,
            'DatabaseName': secured_glue_db,
            'Name': 'beneficiary_summary',
            'ColumnNames': [
                'sp_depressn',
                'sp_diabetes'
            ]
        },
    },
    LFTags=[
        {
            'CatalogId': catalog_id,
            'TagKey': orbit_env_lf_tag_key,
            'TagValues': [
                'sec-2',
            ]
        },
    ]
)

In [None]:
assert 200 == table_columns_add_lf_tags_to_resource_response['ResponseMetadata']['HTTPStatusCode']


---
## Securing the Database Using LakeFormation

Lastly, after securing our tables in our database, we have a few more steps to finalize our LakeFormation.

#### Registering Database

Registering our s3 bucket ARN registers the resource as managed by the Data Catalog. By establishing **UseServiceLinkedRole=True** we designates an AWS IAM service-linked role by registering this role with the Data Catalog.

Our lake formation can now access our secured bucket and work with our data:

In [None]:
reg_s3_location_response = lfc.register_resource(ResourceArn=f"arn:aws:s3:::{secured_lake_bucket}",UseServiceLinkedRole=True)


In [None]:
assert 200 == reg_s3_location_response['ResponseMetadata']['HTTPStatusCode']

#### Revoking IAM Default Permissions

In our default account settings,  we are using the "Use only IAM Access control for new databases".  Therefore our new database is providing Super access to all IAM users.  In the next cell , we will revoke this privilieges to leave only the specific Orbit Lake User IAM role.

In [None]:
def revoke_database_tables_super_permissions(database_name):
    response = gluec.get_tables(
        DatabaseName=database_name
    )
    for table in response['TableList']:
        try:
            response = lfc.revoke_permissions(
                Principal={
                    'DataLakePrincipalIdentifier': 'IAM_ALLOWED_PRINCIPALS'
                },
                Resource={
                    'Table': {
                        'DatabaseName': database_name,
                        'Name': table['Name']
                    }
                },
                Permissions=[
                    'ALL'
                ]
            )
        except lfc.exceptions.InvalidInputException as err:
            print(err)
revoke_database_tables_super_permissions(secured_glue_db)

In [None]:
def revoke_database_super_permissions(database_name):
    try:
        response = lfc.revoke_permissions(
            Principal={
                'DataLakePrincipalIdentifier': 'IAM_ALLOWED_PRINCIPALS'
            },
            Resource={
                'Database': {
                    'CatalogId': catalog_id,
                    'Name': database_name
                },
            },
            Permissions=[
                'ALL'
            ]
        )
    except lfc.exceptions.InvalidInputException as err:
            print(err)
revoke_database_super_permissions(secured_glue_db)


In [None]:
#Used for cleanup operations.
def grant_creator_drop_permission(database_name):
    response = lfc.grant_permissions(
        CatalogId=catalog_id,
        Principal={
            'DataLakePrincipalIdentifier': orbit_lake_creator_role_arn
        },
        Resource={
            'Database': {
                'CatalogId': catalog_id,
                'Name': database_name
            }
        },
        Permissions=[
            'DROP'
        ]
    )
    print(response)
grant_creator_drop_permission(secured_glue_db)



# Quick check on the created tables.


In [None]:
%reload_ext sql
%config SqlMagic.autocommit=False # for engines that do not support autommit
athena = get_athena()


In [None]:
%connect_to_athena -database secured_glue_db


In [None]:
%sql select * from {secured_glue_db}.inpatient_claims limit 1

In [None]:
%sql select sp_depressn, sp_diabetes from {secured_glue_db}.beneficiary_summary limit 1

In [None]:
%sql select clm_pmt_amt, nch_prmry_pyr_clm_pd_amt from {secured_glue_db}.outpatient_claims limit 1

# End of orbit lake creator demo notebook.