# 0. Load sagemaker_studio_analytics_extension

In [None]:
%pip install sagemaker-studio-analytics-extension

In [None]:
%load_ext sagemaker_studio_analytics_extension.magics

# 1. Act as data engineer 


<div class="alert alert-block alert-success">
In this section, we will create Spark application with *data engineer* EMR runtime role, which is designed as a Lake Formation database and table creator. Then, we will performce Create, Read, Insert, Drop, a.k.a CRUD, actions on the databases and tables that are governed by Lake Formation.
</div>

## 1.1. Connect to EMR cluster via Livy as the ENGINEER_ROLE

In [None]:
%%sh

source ~/.bash_profile
EMR_CLUSTER_ID=$(aws emr list-clusters --active  --query 'Clusters[?contains(Name,`emr-roadshow-runtime-role-lf`)].Id' --output text)
echo "CLUSTER_ID:   $EMR_CLUSTER_ID"
echo "IAM_ARN:      $ENGINEER_ROLE"


<div class="alert alert-block alert-warning">
<b>Note:</b> In the following command, replace <b>&ltCLUSTER_ID&gt</b> with 'cluster ID', and <b>&ltIAM_ARN&gt</b> with 'IAM ARN' using the output from the above cell. 

    
Example:<br>
<code>
%sm_analytics emr connect \
--cluster-id j-1NEOGU3MXB8YT \
--auth-type Basic_Access \
--emr-execution-role-arn arn:aws:iam::012345678:role/lf-data-access-engineer
</code>
</div>



In [None]:
%sm_analytics emr connect \
--cluster-id <CLUSTER_ID> \
--auth-type Basic_Access \
--emr-execution-role-arn <IAM_ARN>

## 1.2. Config parameters for S3 data lake 

In [None]:
%%sh

source ~/.bash_profile
echo "DATALAKE_BUCKET:    $DATALAKE_BUCKET"

<div class="alert alert-block alert-warning">
<b>Note:</b> Replace <code>"DATALAKE_BUCKET"</code> with the output from the above cell.<br><b>For exmaple:</b> DATALAKE_BUCKET="lf-datalake-676072755675-us-east-1"
</div>


In [None]:
DATALAKE_BUCKET="DATALAKE_BUCKET"

In [None]:
import os
from pyspark.sql.functions import concat, col, lit, to_timestamp, dense_rank, desc, count, rand, when
from pyspark.sql.window import Window
from pyspark.sql.types import StringType


rawS3TablePath = f"s3://{DATALAKE_BUCKET}/raw/ticket_purchase_hist/"
outputS3Path = f"s3://{DATALAKE_BUCKET}/output/"

targetDBName = 'sample'
targetTableName = 'ticket_purchase_hist'
targetPath = os.path.join(outputS3Path, targetDBName, targetTableName)

targetDBName2 = 'sample2'
targetTableName2 = 'ticket_purchase_hist2'
targetPath2 = os.path.join(outputS3Path, targetDBName2, targetTableName2)


## 1.3. Create 2 database and 2 tables governed by Lake Formation

In [None]:
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {targetDBName} LOCATION '{outputS3Path}{targetDBName}'
""")

In [None]:
spark.sql(f"""
        CREATE TABLE IF NOT EXISTS {targetDBName}.{targetTableName} (
        sporting_event_ticket_id string,
        purchased_by_id string,
        transaction_date_time string,
        transferred_from_id string,
        purchase_price string
    )
    USING PARQUET
    OPTIONS(
        'path' '{targetPath}'
    )
""")


In [None]:
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {targetDBName2} LOCATION '{outputS3Path}{targetDBName2}'
""")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {targetDBName2}.{targetTableName2} LIKE {targetDBName}.{targetTableName}
""")


## 1.4 Insert data to a table governed by Lake Formation

In [None]:
inputDf = spark.read.option("header", True).csv(rawS3TablePath).limit(10)

inputDf.write.insertInto(f"{targetDBName}.{targetTableName}")



## 1.5. Read data from the table

In [None]:
spark.sql(f"""
    SELECT * FROM {targetDBName}.{targetTableName}
""").show(10, False)


## 1.6. Drop a database & table

In [None]:
spark.sql(f"""
    DROP TABLE {targetDBName2}.{targetTableName2}
""")

spark.sql(f"""
    DROP DATABASE {targetDBName2}
""")


# 2. Act as data analyst 

<div class="alert alert-block alert-warning">
<b>Note:</b> Please go back to workshop instruction <b>Lab: EMR on EC2</b>, Section: <b>Use EMR Runtime Roles with Sagemaker notebook</b>, Step: <b>Grant read-only permission to Data Analyst role on Lake Formation</b> to configure the data access before running the following cells.
</div>

<div class="alert alert-block alert-success">
In this section, we will connect an existing EMR cluster with a *data analyst* role, which is designed to be a business consumer that has a fine-grained access control by Lake Formation.
</div>



## 2.1. Connect to EMR cluster via Livy as ANALYST_ROLE

In [None]:
%%sh

source ~/.bash_profile
EMR_CLUSTER_ID=$(aws emr list-clusters --active  --query 'Clusters[?contains(Name,`emr-roadshow-runtime-role-lf`)].Id' --output text)
echo "CLUSTER_ID:   $EMR_CLUSTER_ID"
echo "IAM_ARN:      $ANALYST_ROLE"


In [None]:
%sm_analytics emr connect \
--cluster-id <CLUSTER_ID> \
--auth-type Basic_Access \
--emr-execution-role-arn <IAM_ARN>

In [None]:
%%sh

source ~/.bash_profile
echo "DATALAKE_BUCKET:    $DATALAKE_BUCKET"

<div class="alert alert-block alert-warning">
<b>Note:</b> Replace <code>"DATALAKE_BUCKET"</code> with the output from the above cell. 
    <br><b>For exmaple:</b> DATALAKE_BUCKET="lf-datalake-676072755675-us-east-1"
</div>


In [None]:
DATALAKE_BUCKET="DATALAKE_BUCKET"

In [None]:
import os
from pyspark.sql.functions import concat, col, lit, to_timestamp, dense_rank, desc, count, rand, when
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
outputS3Path = f"s3://{DATALAKE_BUCKET}/output/"

rawS3TablePath = f"s3://{DATALAKE_BUCKET}/raw/ticket_purchase_hist/"

targetDBName = 'sample'
targetTableName = 'ticket_purchase_hist'
targetPath = os.path.join(outputS3Path, targetDBName, targetTableName)

## 2.2 Test the column-level permission


<div class="alert alert-block alert-success">
The expected output is to display only 2 columns that are granted by Lake Formation for the analyst role
</div>

In [None]:
spark.sql(f"""
    SELECT * FROM {targetDBName}.{targetTableName}
""").show(10, False)

## 2.3 Test insert permissions

In [None]:
rand_df = spark.read.option("header", True).csv(rawS3TablePath).sample(fraction=0.1).limit(10)

In [None]:
rand_df.createOrReplaceTempView("tmp_table")

spark.sql(f"""
    INSERT INTO {targetDBName}.{targetTableName}
    SELECT *
    FROM tmp_table
""")


<div class="alert alert-block alert-success">
The expected output from above cell is an error message, like  <code>Permission Denied: User XXXX does not have INSERT permission on sample/ticket_purchase_hist</code>
</div>

