# 0. Load sagemaker_studio_analytics_extension

In [None]:
%pip install sagemaker-studio-analytics-extension

In [1]:
%load_ext sagemaker_studio_analytics_extension.magics

# 1. Act as data engineer 


>**In this section, we will create Spark application with *data engineer* EMR runtime role, which is designed as a Lake Formation database and table creator. Then, we will performce Create, Read, Insert, Drop, a.k.a CRUD, actions on the databases and tables that are governed by Lake Formation.**

## 1.1. Connect to EMR cluster via Livy as the ENGINEER_ROLE

In [2]:
%%sh

source ~/.bash_profile
EMR_CLUSTER_ID=$(aws emr list-clusters --active  --query 'Clusters[?contains(Name,`emr-roadshow-runtime-role-lf`)].Id' --output text)
echo "cluster ID:   $EMR_CLUSTER_ID"
echo "IAM ARN:      $ENGINEER_ROLE"


cluster ID:   j-1NEOGU3MXB8YT
IAM ARN:      arn:aws:iam::010261715632:role/lf-data-access-engineer


>**In the following command, replace `<CLUSTER_ID>` with 'cluster ID', and `<IAM_ARN>` with 'IAM ARN' using the output from the above cell.**

Example:
```
%sm_analytics emr connect \
--cluster-id j-1NEOGU3MXB8YT \
--auth-type Basic_Access \
--emr-execution-role-arn arn:aws:iam::012345678:role/lf-data-access-engineer
```

In [None]:
%sm_analytics emr connect \
--cluster-id <CLUSTER_ID> \
--auth-type Basic_Access \
--emr-execution-role-arn <IAM_ARN>

## 1.2. Config parameters for S3 data lake 

In [4]:
%%sh

source ~/.bash_profile
echo "DATALAKE_BUCKET:    $DATALAKE_BUCKET"

DATALAKE_BUCKET:    lf-datalake-010261715632-us-east-1


**Replace `DATALAKE_BUCKET` with the output from the above cell.**

Example:
```
DATALAKE_BUCKET="lf-datalake-012345657-us-east-1"
```

In [None]:
DATALAKE_BUCKET="<DATALAKE_BUCKET>"

In [23]:
import os
from pyspark.sql.functions import concat, col, lit, to_timestamp, dense_rank, desc, count, rand, when
from pyspark.sql.window import Window
from pyspark.sql.types import StringType


rawS3TablePath = f"s3://{DATALAKE_BUCKET}/raw/ticket_purchase_hist/"
outputS3Path = f"s3://{DATALAKE_BUCKET}/output/"

targetDBName = 'sample'
targetTableName = 'ticket_purchase_hist'
targetPath = os.path.join(outputS3Path, targetDBName, targetTableName)

targetDBName2 = 'sample2'
targetTableName2 = 'ticket_purchase_hist2'
targetPath2 = os.path.join(outputS3Path, targetDBName2, targetTableName2)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.3. Create database and table governed by Lake Formation

In [13]:
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {targetDBName}
""")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

In [14]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {targetDBName}.{targetTableName} (
        sporting_event_ticket_id string,
        purchased_by_id string,
        transaction_date_time string,
        transferred_from_id string,
        purchase_price string
    )
    USING PARQUET
    OPTIONS(
        'path' '{targetPath}'
    )
""")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

In [30]:
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {targetDBName2}
""")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {targetDBName2}.{targetTableName2} LIKE {targetDBName}.{targetTableName}
""")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

## 1.4 Load data to created table governed by Lake Formation

In [15]:
inputDf = spark.read.option("header", True).csv(rawS3TablePath)
inputDf.printSchema()
inputDf.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- sporting_event_ticket_id: string (nullable = true)
 |-- purchased_by_id: string (nullable = true)
 |-- transaction_date_time: string (nullable = true)
 |-- transferred_from_id: string (nullable = true)
 |-- purchase_price: string (nullable = true)

+------------------------+-----------------------+---------------------+-------------------+--------------+
|sporting_event_ticket_id|purchased_by_id        |transaction_date_time|transferred_from_id|purchase_price|
+------------------------+-----------------------+---------------------+-------------------+--------------+
|+1.2295361000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |47.30         |
|+1.2294881000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2294891000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2295351000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |4

In [16]:
inputDf.write.insertInto(f"{targetDBName}.{targetTableName}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.5. Read data from created table governed by Lake Formation

In [22]:
spark.sql(f"""
    SELECT * FROM {targetDBName}.{targetTableName}
""").show(10, False)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------------+-----------------------+---------------------+-------------------+--------------+
|sporting_event_ticket_id|purchased_by_id        |transaction_date_time|transferred_from_id|purchase_price|
+------------------------+-----------------------+---------------------+-------------------+--------------+
|+1.2295361000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |47.30         |
|+1.2294881000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2294891000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2295351000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |47.30         |
|+1.2294871000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2294861000000000e+07 |+4.5526330000000000e+06|2023-01-23 06:36:25  |null               |37.84         |
|+1.2295321000000000e+07 |+4

## 1.6. Insert data to created table governed by Lake Formation

In [31]:
rand_df = spark.read.option("header", True).csv(rawS3TablePath).sample(fraction=0.1).limit(10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [32]:
rand_df.createOrReplaceTempView("tmp_table")

spark.sql(f"""
    INSERT INTO {targetDBName}.{targetTableName}
    SELECT *
    FROM tmp_table
""")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

## 1.7. Drop database and table governed by Lake Formation

In [33]:
spark.sql(f"""
    DROP TABLE {targetDBName2}.{targetTableName2}
""")

spark.sql(f"""
    DROP DATABASE {targetDBName2}
""")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

# 2. Act as data analyst 


>**In this section, we will create Spark application with *data analyst* EMR runtime role, which is designed as a consumer for the table that has fine-grained access control by Lake Formation.**


## 2.1. Connect to EMR cluster via Livy as ANALYST_ROLE

In [2]:
%%sh
source ~/.bash_profile
EMR_CLUSTER_ID=$(aws emr list-clusters --active  --query 'Clusters[?contains(Name,`emr-roadshow-runtime-role-lf`)].Id' --output text)
echo $EMR_CLUSTER_ID
echo $ANALYST_ROLE


j-1NEOGU3MXB8YT
arn:aws:iam::010261715632:role/lf-data-access-analyst


In [None]:
%sm_analytics emr connect \
--cluster-id <CLUSTER_ID> \
--auth-type Basic_Access \
--emr-execution-role-arn <ANALYST_ROLE>

In [4]:
%%sh

source ~/.bash_profile
echo "DATALAKE_BUCKET:    $DATALAKE_BUCKET"

DATALAKE_BUCKET:    lf-datalake-010261715632-us-east-1


**Replace `DATALAKE_BUCKET` with the output from the above cell.**

Example:
```
DATALAKE_BUCKET="lf-datalake-012345657-us-east-1"
```

In [None]:
DATALAKE_BUCKET="<DATALAKE_BUCKET>"

In [5]:
import os
from pyspark.sql.functions import concat, col, lit, to_timestamp, dense_rank, desc, count, rand, when
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
outputS3Path = f"s3://{DATALAKE_BUCKET}/output/"

rawS3TablePath = f"s3://{DATALAKE_BUCKET}/raw/ticket_purchase_hist/"

targetDBName = 'sample'
targetTableName = 'ticket_purchase_hist'
targetPath = os.path.join(outputS3Path, targetDBName, targetTableName)

targetDBName2 = 'sample2'
targetTableName2 = 'ticket_purchase_hist2'
targetPath2 = os.path.join(outputS3Path, targetDBName2, targetTableName2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2.2 Test the column-level permission

> Please grant ANALYST_ROLE permission via Lake Formation following the construction in workshop

In [6]:
spark.sql(f"""
    SELECT * FROM {targetDBName}.{targetTableName}
""").show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------------+--------------+
|sporting_event_ticket_id|purchase_price|
+------------------------+--------------+
|+1.2295361000000000e+07 |47.30         |
|+1.2294881000000000e+07 |37.84         |
|+1.2294891000000000e+07 |37.84         |
|+1.2295351000000000e+07 |47.30         |
|+1.2294871000000000e+07 |37.84         |
|+1.2294861000000000e+07 |37.84         |
|+1.2295321000000000e+07 |47.30         |
|+1.2295341000000000e+07 |47.30         |
|+1.2295331000000000e+07 |47.30         |
|+1.2294851000000000e+07 |37.84         |
+------------------------+--------------+
only showing top 10 rows

**The above cell's output expects to only have columns granted from Lake Formation for analyst role**

## 2.3 Test insert permissions

In [7]:
rand_df = spark.read.option("header", True).csv(rawS3TablePath).sample(fraction=0.1).limit(10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
rand_df.createOrReplaceTempView("tmp_table")

spark.sql(f"""
    INSERT INTO {targetDBName}.{targetTableName}
    SELECT *
    FROM tmp_table
""")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
com.amazonaws.emr.recordserver.remote.AuthorizationException: Permission Denied: User WWGEYX2KNJRORBC2XW2IVZ422YM62EZK does not have INSERT permission on sample/ticket_purchase_hist
	at com.amazonaws.emr.recordserver.authz.AuthorizerImpl.authorize(Authorizer.scala:392)
	at com.amazonaws.emr.recordserver.handler.AuthorizationHandler.$anonfun$channelRead0$1(AuthorizationHandler.scala:33)
	at com.amazonaws.emr.recordserver.metrics.Metrics$.withMetering(Metrics.scala:49)
	at com.amazonaws.emr.recordserver.handler.AuthorizationHandler.channelRead0(AuthorizationHandler.scala:33)
	at com.amazonaws.emr.recordserver.handler.AuthorizationHandler.channelRead0(AuthorizationHandler.scala:21)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractCh

**The above cell's output expects to see error, like  `Permission Denied: User XXXX does not have INSERT permission on sample/ticket_purchase_hist`**
