# ANT404 Lab #4: Setup AWS Lake Formation

### Why use AWS Lake Formation?

AWS Lake Formation makes it easier to build, secure, and manage a data lake. It simplifies and automates the steps that required to create a data lake, especially cataloging data and making data available securely for analytics and machine learning.

Lake Formation provides a permissions model that can be enforced at the table and column level and works across the full portfolio of AWS analytics and machine learning services, including Amazon Redshift Spectrum and Amazon Athena.

This centrally defined permissions model enables fine-grained access to data stored in your data lake through a simple grant/revoke mechanism.

The AWS Glue Data Catalog integrates the data access policies, making sure of compliance regardless of the data’s origin. 

<img src="https://docs.aws.amazon.com/lake-formation/latest/dg/images/overview-diagram.png" width="700"  />



### Steps for creating a Lake Formation Data Lake
You will complete the following steps to create your Lake Formation data lake.
* Attach `AWSLakeFormationDataAdmin` policy to your current user
* Create a new policy and role for Redshift to access Lake Formation and Glue
* Copy your parquet data to a new Data Lake bucket
* Create a new external database and new tables in AWS Glue
* Register the bucket with Lake Formation and grant `SELECT` table permissions
* Create a Redshift external schema for the Lake Formation data lake
* Check that the restricted columns are _NOT_ visible in new tables
* Compare the output of the previous tables to new Lake Formation tables


## 1. Check for credentials file
Check for the credentials created in the `START_HERE` notebook.

In [None]:
%%bash
cat ant404-lab.creds

## 2. Set local variables from credentials file
Run this `cell` to import the credentials created in `START_HERE` notebook into this notebook. Later cells rely on these variables.

In [None]:
import simplejson
with open("ant404-lab.creds") as fh:
    creds = simplejson.loads(fh.read())
username=creds["user_name"]
password=creds["password"]
host_name=creds["host_name"]
port_num=creds["port_num"]
db_name=creds["db_name"]

# Example Account, Region, and Cluster values for this lab
log_account=123456789101
region="us-east-1"
cluster_name="reporting-cluster"

# Default date values used to get sample files
audit_year=2019
audit_month=11
audit_day=10 

%set_env username={username}


## 3. Create a `DataLakeUserPolicy` policy
This will allow Lake Formation access

In [None]:
%%bash
policy='{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "lakeformation:GetDataAccess",
      "glue:GetTable",
      "glue:GetTables",
      "glue:SearchTables",
      "glue:GetDatabase",
      "glue:GetDatabases",
      "glue:GetPartitions"],
    "Resource": "*"}]}'

aws iam create-policy --policy-name DataLakeUserPolicy --policy-document "$policy"

aws iam get-policy --policy-arn arn:aws:iam::080945919444:policy/DataLakeUserPolicy

## 4. Create a new `RedshiftDataLakeUserRole` role

In [None]:
%%bash
trust_policy='{
"Version": "2012-10-17",
"Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "redshift.amazonaws.com"},
      "Action": "sts:AssumeRole",
      "Condition": {}
}]}'

aws iam create-role --role-name RedshiftDataLakeUserRole --assume-role-policy-document "$trust_policy"

aws iam get-role --role-name RedshiftDataLakeUserRole

## 5. Attach the `DataLakeUserPolicy` to the role 

In [None]:
%%bash
aws iam attach-role-policy --role-name RedshiftDataLakeUserRole \
    --policy-arn arn:aws:iam::080945919444:policy/DataLakeUserPolicy
    
aws iam list-attached-role-policies --role-name RedshiftDataLakeUserRole

## 6. Associate the `RedshiftDataLakeUserRole` role to your cluster

In [None]:
%%bash
aws redshift modify-cluster-iam-roles --cluster-identifier mod-27c4c61fae3b42fe-redshiftcluster-bz825ah27i69 \
    --add-iam-roles arn:aws:iam::080945919444:role/RedshiftDataLakeUserRole \
    --query 'Cluster.IamRoles[]'

## 7. Copy your parquet data to a new Data Lake bucket

In [None]:
%%bash
aws s3 mb s3://ant404-datalake-86feeb76
aws s3 cp s3://ant404-lab-86feeb76/data_lake s3://ant404-datalake-86feeb76/ \
    --recursive --acl bucket-owner-full-control
aws s3 ls s3://ant404-datalake-86feeb76/

## 8. Create a new external database and tables in AWS Glue

In [None]:
%%bash
aws glue create-database --database-input '{"Name": "lakeformation","CreateTableDefaultPermissions": []}'
aws glue get-database --name "lakeformation"

### 8.1. Create `useractivitylog` table
This structured JSON table definition for Glue can retrieved from an existing table using:  
`aws glue get-table --database-name "glue_db" --name "tbl_name"`

In [None]:
%%bash
tabledef='{ "Name": "useractivitylog",
"StorageDescriptor": {
    "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
    "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
    "SerdeInfo": {
        "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
        "Parameters": {"serialization.format": "1" } },
    "Location": "s3://ant404-datalake-86feeb76/table=useractivitylog/region=us-east-1/log_year=2019/log_month=11/",
    "Columns": [
        { "Type": "varchar(32)", "Name": "recordtime" },
        { "Type": "varchar(64)", "Name": "db" },
        { "Type": "varchar(64)", "Name": "username" },
        { "Type": "bigint", "Name": "pid" },
        { "Type": "int", "Name": "userid" },
        { "Type": "bigint", "Name": "xid" },
        { "Type": "varchar(65535)", "Name": "query" },
        { "Type": "int", "Name": "log_day" } ]
}
}'
aws glue create-table --database-name "lakeformation" --table-input "$tabledef"

### 8.2. Create `connectionlog` table

In [None]:
%%bash
tabledef='{ "Name": "connectionlog",
  "StorageDescriptor": {
      "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
      "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
      "SerdeInfo": {
          "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
          "Parameters": {"serialization.format": "1" }},
      "Location": "s3://ant404-datalake-86feeb76/table=connectionlog/region=us-east-1/log_year=2019/log_month=11/",
      "Columns": [
          { "Type": "varchar(64)", "Name": "event" },
          { "Type": "varchar(32)", "Name": "recordtime" },
          { "Type": "varchar(64)", "Name": "remotehost" },
          { "Type": "int", "Name": "remoteport" },
          { "Type": "int", "Name": "pid" },
          { "Type": "varchar(64)", "Name": "dbname" },
          { "Type": "varchar(64)", "Name": "username" },
          { "Type": "varchar(64)", "Name": "authmethod" },
          { "Type": "bigint", "Name": "duration" },
          { "Type": "varchar(32)", "Name": "sslversion" },
          { "Type": "varchar(32)", "Name": "sslcipher" },
          { "Type": "int", "Name": "mtu" },
          { "Type": "varchar(16)", "Name": "sslcompression" },
          { "Type": "varchar(16)", "Name": "sslexpansion" },
          { "Type": "varchar(64)", "Name": "iamauthguid" },
          { "Type": "varchar(64)", "Name": "application_name" },
          { "Type": "int", "Name": "log_day" } ]
      }
}'
aws glue create-table --database-name "lakeformation" --table-input "$tabledef"

### 8.2. Create `cloudtrail` table

In [None]:
%%bash
tabledef='{ "Name": "cloudtrail",
  "StorageDescriptor": {
      "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
      "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
      "SerdeInfo": {
          "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
          "Parameters": {"serialization.format": "1"}},
      "Location": "s3://ant404-datalake-86feeb76/table=cloudtrail/region=us-east-1/log_year=2019/log_month=11/",
      "Columns": [
          { "Type": "varchar(8)", "Name": "event_version" },
          { "Type": "varchar(16)", "Name": "user_identity_type" },
          { "Type": "varchar(128)", "Name": "user_identity_principalid" },
          { "Type": "varchar(256)", "Name": "user_identity_arn" },
          { "Type": "varchar(16)", "Name": "user_identity_accountid" },
          { "Type": "varchar(64)", "Name": "user_identity_invokedby" },
          { "Type": "varchar(32)", "Name": "user_identity_accesskeyid" },
          { "Type": "varchar(32)", "Name": "user_identity_username" },
          { "Type": "varchar(8)", "Name": "session_context_mfa_authenticated" },
          { "Type": "varchar(32)", "Name": "session_context_creation_date" },
          { "Type": "varchar(8)", "Name": "session_issuer_type" },
          { "Type": "varchar(32)", "Name": "session_issuer_principal_id" },
          { "Type": "varchar(256)", "Name": "session_issuer_arn" },
          { "Type": "varchar(16)", "Name": "session_issuer_account_id" },
          { "Type": "varchar(64)", "Name": "session_issuer_user_name" },
          { "Type": "varchar(32)", "Name": "event_time" },
          { "Type": "varchar(64)", "Name": "event_source" },
          { "Type": "varchar(64)", "Name": "event_name" },
          { "Type": "varchar(16)", "Name": "aws_region" },
          { "Type": "varchar(64)", "Name": "source_ipaddress" },
          { "Type": "varchar(256)", "Name": "user_agent" },
          { "Type": "varchar(64)", "Name": "error_code" },
          { "Type": "varchar(512)", "Name": "error_message" },
          { "Type": "int", "Name": "request_param_duration_seconds" },
          { "Type": "varchar(256)", "Name": "request_param_role_arn" },
          { "Type": "varchar(64)", "Name": "request_param_role_session_name" },
          { "Type": "varchar(16)", "Name": "request_param_database_name" },
          { "Type": "varchar(64)", "Name": "request_param_table_name" },
          { "Type": "varchar(128)", "Name": "assumed_role_user_arn" },
          { "Type": "varchar(64)", "Name": "assumed_role_user_assumed_role_id" },
          { "Type": "varchar(32)", "Name": "credentials_access_key_id" },
          { "Type": "varchar(32)", "Name": "credentials_expiration" },
          { "Type": "varchar(2048)", "Name": "credentials_session_token" },
          { "Type": "varchar(128)", "Name": "lake_formation_principal" },
          { "Type": "varchar(64)", "Name": "request_id" },
          { "Type": "varchar(64)", "Name": "event_id" },
          { "Type": "varchar(256)", "Name": "resource_arn" },
          { "Type": "varchar(16)", "Name": "resource_accountid" },
          { "Type": "varchar(32)", "Name": "resource_type" },
          { "Type": "varchar(32)", "Name": "event_type" },
          { "Type": "varchar(16)", "Name": "api_version" },
          { "Type": "varchar(8)", "Name": "read_only" },
          { "Type": "varchar(16)", "Name": "recipient_account_id" },
          { "Type": "varchar(1024)", "Name": "service_event_details" },
          { "Type": "varchar(64)", "Name": "shared_event_id" },
          { "Type": "varchar(16)", "Name": "vpc_endpoint_id" },
          { "Type": "int", "Name": "log_day" } ]
    }
}'
aws glue create-table --database-name "lakeformation" --table-input "$tabledef"

### 8.4. List the tables in the AWS Glue database to confirm they were created 

In [None]:
%%bash
aws glue get-tables --database-name "lakeformation" --query 'TableList[].{DbName:DatabaseName,TableName:Name}'

## 9. Register the bucket with Lake Formation and set table permissions

In order to restrict access. You must register the S3 location with Lake Formation as a "resource". Then you grant `SELECT` permissions on that resource to the `RedshiftDataLakeUserRole`. 



### 9.1. Register the Data Lake bucket
------
**DO NOT RUN** this step until you have created the Glue tables above. Once you register the resource you will need to either grant additional explicit permissions or de-register it to make further changes. 

In [None]:
%%bash
aws lakeformation register-resource \
    --resource-arn "arn:aws:s3:::ant404-datalake-86feeb76/" \
    --use-service-linked-role

### 9.2. Define the `DataLakePrincipalIdentifier` 
This is the authorized identity that accesses the data lake

In [None]:
principal='''{"DataLakePrincipalIdentifier":"arn:aws:iam::080945919444:role/RedshiftDataLakeUserRole"}'''
%set_env principal={principal}

### 9.3. Register `cloudtrail` and exclude 2 `credentials_` columns

In [None]:
%%bash
cloudtrail='{"TableWithColumns":{
    "DatabaseName":"lakeformation",
    "Name":"cloudtrail",
    "ColumnWildcard":{"ExcludedColumnNames":["credentials_access_key_id","credentials_session_token"]}}}'
aws lakeformation grant-permissions --principal "$principal" --resource "$cloudtrail" --permissions '["SELECT"]'

### 9.4. Register `useractivitylog` and exclude the `query` column

In [None]:
%%bash
useractivitylog='{"TableWithColumns":{
    "DatabaseName":"lakeformation",
    "Name":"useractivitylog",
    "ColumnWildcard":{"ExcludedColumnNames":["query"]}}}'
aws lakeformation grant-permissions --principal "$principal" --resource "$useractivitylog" --permissions '["SELECT"]'

### 9.5. Register `connectionlog` and exclude the `iamauthguid` column

In [None]:
%%bash
connectionlog='{"TableWithColumns":{
    "DatabaseName":"lakeformation",
    "Name":"connectionlog",
    "ColumnWildcard":{"ExcludedColumnNames":["iamauthguid"]}}}'
aws lakeformation grant-permissions --principal "$principal" --resource "$connectionlog" --permissions '["SELECT"]'

## 10. Connect to your Redshift cluster

You will use the `sqlalchemy` and `ipython-sql` Python libraries to manage the Redshift connection. 

This cell creates a `%sql` element so we can use the connection in other cells in the notebook.

-------
**Note:** _Please ignore the pink error message that says: "UserWarning: The psycopg2 wheel package will be renamed from release 2.8"_   
**Look for** 'Connected: ant404@dev' in the 'Out [ ]' section below the warning.

In [None]:
import sqlalchemy
import psycopg2
import simplejson

%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = 'postgresql+psycopg2://'+username+':'+password+'@'+host_name+':'+port_num+'/'+db_name
%sql $connect_to_db

## 11. Create a Redshift external schema for the Data Lake
You are creating a separate external schema for Data Lake user. This schema uses the new IAM role with restricted access permissions. You can then `GRANT` this restricted access to specific Redshift users and groups. For instance you may want allow Analysts to query the data lake but restrict access to data fields that contain PII.

In [None]:
%%sql
CREATE EXTERNAL SCHEMA IF NOT EXISTS lakeformation
FROM DATA CATALOG
DATABASE 'lakeformation'
IAM_ROLE 'arn:aws:iam::080945919444:role/RedshiftDataLakeUserRole'
;
SELECT * FROM svv_external_schemas WHERE schemaname = 'lakeformation';


## 12. Check that the restricted columns are _NOT_ visible

---
**NOTE** This query _should_ return no rows

In [None]:
%%sql 
SELECT schemaname, tablename, columnname
FROM svv_external_columns
WHERE schemaname = 'lakeformation'
  AND tablename IN ('cloudtrail','connectionlog','useractivitylog')
-- # Excluded columns list
  AND columnname IN ('iamauthguid','query','credentials_access_key_id','credentials_session_token')
;

In [None]:
%%sql
SELECT * FROM lakeformation.useractivitylog LIMIT 10;

In [None]:
%%sql
SELECT * FROM lakeformation.connectionlog LIMIT 10;

In [None]:
%%sql
SELECT * FROM lakeformation.cloudtrail LIMIT 10;

## 13. Compare the output of the Admin access tables to new Data Lake tables

###  Test the Data Lake table

In [None]:
%%sql
SELECT user_identity_principalid, COUNT(*)
FROM  lakeformation.cloudtrail
WHERE log_year = 2019 
  AND log_month = 11 
  AND log_day = 11 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

### Test the table table with full Admin access
This is the view created in Lab #3 over the table created in Lab #4.

In [None]:
%%sql
SELECT user_identity_principalid, COUNT(*)
FROM  public.v_export_cloudtrail
WHERE log_year = 2019 
  AND log_month = 11 
  AND log_day = 11 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

### Further Info on Lake Formation

* Lake Formation Documentation: ["What Is AWS Lake Formation?"](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)
* Lake Formation Tutorial: [Creating a Data Lake from an AWS CloudTrail Source](https://docs.aws.amazon.com/lake-formation/latest/dg/getting-started-cloudtrail-tutorial.html)
* AWS Blog: [Getting started with AWS Lake Formation](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-lake-formation/)