# 0. Connect to EMR Cluster with Analyst Runtime Role

<div class="alert alert-block alert-success">
In this section, we connect to EMR cluster and create Spark session with *data analyst* EMR runtime role, which is designed as a Lake Formation database and table reader. 
</div>

## 0.1 Install and load sagemaker studio extension

In [None]:
#%pip uninstall sagemaker-studio-analytics-extension -y

In [None]:
#%pip install sagemaker-studio-analytics-extension==0.0.17

In [None]:
%load_ext sagemaker_studio_analytics_extension.magics

## 0.2 Get EMR cluster ID and EMR runtime role

<div class="alert alert-block alert-warning">
<b>Note:</b> In case the following `sm_analytics emr connect` cell fails with the message:
 <b>   Warning: The Spark session does not have enough YARN resources to start.</b> 
Terminate unneeded Livy sessions to free the cluster resources.
</div>


In [None]:
%%sh

source ~/.bash_profile
EMR_CLUSTER_ID=$(aws emr list-clusters --active  --query 'Clusters[?contains(Name,`emr-bootcamp-runtime-role-lf`)].Id' --output text)
echo "ACCOUNT_ID:   $ACCOUNTID"
echo "REGION:       $REGION"
echo "CLUSTER_ID:   $EMR_CLUSTER_ID"
echo "IAM_ARN:      $ANALYST_ROLE"

## 0.3 Connect to EMR cluster with runtime role and create Spark Session


In [None]:
%sm_analytics emr connect \
--cluster-id <CLUSTER_ID> \
--auth-type Basic_Access \
--emr-execution-role-arn <ANALYST_ROLE_ARN>

<div class="alert alert-block alert-warning">
<b>Note:</b> Before execute the %%configure, ensure all the placeholders are replaced
</div>

* Replace `<ACCOUNT_ID>` with your AWS accout ID
* Replace `<REGION>` with your region, e.g. `us-east-1`

In [None]:
%%configure -f
{
"conf":{
         "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension",
         "spark.sql.catalog.iceberg_catalog":"org.apache.iceberg.spark.SparkCatalog",
         "spark.sql.catalog.iceberg_catalog.warehouse":"s3://lf-datalake-<ACCOUNT_ID>-us-east-1/",
         "spark.sql.catalog.iceberg_catalog.catalog-impl":"org.apache.iceberg.aws.glue.GlueCatalog", 
         "spark.sql.catalog.iceberg_catalog.io-impl":"org.apache.iceberg.aws.s3.S3FileIO",
         "spark.sql.catalog.iceberg_catalog.glue.account-id":"<ACCOUNT_ID>",
         "spark.sql.catalog.iceberg_catalog.glue.id":"<ACCOUNT_ID>",
         "spark.sql.catalog.iceberg_catalog.client.assume-role.region":"<REGION>",
         "spark.sql.catalog.iceberg_catalog.lf.managed":"true",
    
         "spark.dynamicAllocation.enabled": "true",
         "spark.dynamicAllocation.minExecutors": "3",
         "spark.dynamicAllocation.maxExecutors": "5"
        }
}

In [None]:
from datetime import datetime
from pyspark.sql.functions import col,lit, current_timestamp,unix_timestamp, min, when, desc, split

# 1. Config Parameters for Iceberg Data Lake 

<div class="alert alert-block alert-success">
In this section, we config the parameters for source data and iceberg database and table that will be created
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b> Replace the following paramters
</div>

* Replace `ACCOUNT-ID` with your account ID

In [None]:
LF_S3_BUCKET_NAME = "lf-datalake-<ACCOUNT-ID>-us-east-1"


In [None]:
VERSION = 1

# source data variables
SRC_DB_NAME = "tpcparquet"
SRC_TABLE_NAME = "dl_tpc_customer"

# Iceberg variables
ICEBERG_CATALOG = "iceberg_catalog"
ICEBERG_DATABASE = f"emr_bootcamp_iceberg_db_{VERSION}"
ICEBERG_DATABASE_LOCATION = f"s3://{LF_S3_BUCKET_NAME}/{ICEBERG_DATABASE}"
ICEBERG_TABLE_NAME = f"emr_bootcamp_iceberg_sql_{SRC_TABLE_NAME}_{VERSION}"
ICEBERG_TABLE_LOCATION = f"{ICEBERG_DATABASE_LOCATION}/{ICEBERG_TABLE_NAME}"

In [None]:
# sparkmagic SQL configs

spark.conf.set('iceberg_catalog', ICEBERG_CATALOG)
spark.conf.set('iceberg_db', ICEBERG_DATABASE)
spark.conf.set('iceberg_table_name', ICEBERG_TABLE_NAME)

print ("iceberg_catalog:              "+ICEBERG_CATALOG)
print ("iceberg_db:                   "+ICEBERG_DATABASE)
print ("iceberg_table_name:           "+ICEBERG_TABLE_NAME)

# 2. Query Iceberg table

<div class="alert alert-block alert-success">
In this section, we query iceberg with data analyst EMR runtime role to see if three PII columns: c_customer_id, c_email_address, and c_last_name have been excluded
</div>
<div class="alert alert-block alert-warning">
<b>Note:</b> In case you see the timeout error "RecordServerException: Could not fetch metadata after maximum number of retries", run the following cells again. 
</div>

In [None]:
%%sql
-- the full table contains 7 columns

DESCRIBE TABLE ${iceberg_catalog}.${iceberg_db}.${iceberg_table_name} 

In [None]:
%%sql
-- the analyst role only can see 4 columns

SELECT *  
FROM ${iceberg_catalog}.${iceberg_db}.${iceberg_table_name}
WHERE 
    c_birth_country = 'INDIA'