<img src="https://lakefs.io/wp-content/uploads/2022/09/lakeFS-Logo.svg" alt="lakeFS logo" width=200/>

# Integration of lakeFS with Trino and Glue Catalog

[📚 Docs](https://docs.lakefs.io/integrations/presto_trino.html)

## Use Case: Isolated Dev/Test/ETL Environments

## Config

### Spark Configuration: Change lakeFS endpoint, Access and Secret Key

In [None]:
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3a.endpoint": "<lakeFS Endpoint URL>",
        "spark.hadoop.fs.s3a.access.key": "<lakeFS Access Key>",
        "spark.hadoop.fs.s3a.secret.key": "<lakeFS Secret Key>",
        "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
        "spark.hadoop.fs.s3a.path.style.access": "true",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = '<lakeFS Endpoint URL>' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://<Bucket Name>' # e.g. 's3://bucket'

## Install and configure lakectl on your computer (lakeFS command-line tool): https://docs.lakefs.io/reference/cli.html

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### Glue database name

In [None]:
glueDatabaseName = "trino_glue_demo" # This notebook will create this database

### lakeFS repository name

In [None]:
repo_name = "trino-glue-demo"

### Versioning Information

In [None]:
mainBranch = "main"
etlBranch = "etl_branch"
customersTable = "customers"
ordersTable = "orders"

### Install Python libraries

In [None]:
sc.install_pypi_package("urllib3==1.25.3")

In [None]:
sc.install_pypi_package("lakefs==0.6.0")

In [None]:
sc.install_pypi_package("pyhive")

In [None]:
sc.install_pypi_package("requests")

### Import Python libraries

In [None]:
import lakefs
import os
from pyspark.sql.types import ByteType, IntegerType, LongType, StringType, StructType, StructField
from pyspark.sql.functions import *
from pyhive import trino
import requests

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Create lakeFSClient

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("Failed to get lakeFS version")
else:
    print(f"lakeFS credentials verified\n\nlakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

### Connect to Trino using `PyHive`

You will now connect to Trino using `PyHive` library. You might have to replace the values for `hostName, userName, schemaName and catalogName` as applicable to your environment. The port is set to EMR default of 8889.

In [None]:
hostName = "127.0.0.1"
userName = "hadoop"
schemaName = "default"
catalogName = "glue"
trinoPort = 8889

headers = {
    'X-Trino-User': userName,
    'X-Trino-Schema': schemaName,
    'X-Trino-Catalog': catalogName
}

trinoSession = requests.Session()
trinoSession.headers.update(headers)

conn = trino.connect(requests_session=trinoSession,
                     host=hostName,
                     port=trinoPort
                    )

### Define some helper functions

In [None]:
def execute_trino_query(query):
    cur = conn.cursor()
    cur.execute(query)
    result = cur.fetchall()

    return result

def print_commit(log):
    from datetime import datetime
    from pprint import pprint

    print('Message:', log.message)
    print('ID:', log.id)
    print('Committer:', log.committer)
    print('Creation Date:', datetime.utcfromtimestamp(log.creation_date).strftime('%Y-%m-%d %H:%M:%S'))
    print('Parents:', log.parents)
    print('Metadata:')
    pprint(log.metadata)

### Create Glue Database

In [None]:
execute_trino_query(f"CREATE SCHEMA {glueDatabaseName} WITH (location = 's3a://{repo_name}/{mainBranch}')")

### Define CUSTOMER.csv data file schema

In [None]:
customersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Country", StringType(), False),
  StructField("Gender", StringType(), False),
  StructField("Personal_ID", IntegerType(), True),
  StructField("Customer_Name", StringType(), False),
  StructField("Customer_FirstName", StringType(), False),
  StructField("Customer_LastName", StringType(), False),
  StructField("Birth_Date", StringType(), False),
  StructField("Customer_Address", StringType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Street_Number", IntegerType(), False),
  StructField("Customer_Type_ID", IntegerType(), False)
])

In [None]:
customersSchemaForGlue = "Customer_ID int, \
  Country varchar, \
  Gender varchar, \
  Personal_ID int, \
  Customer_Name varchar, \
  Customer_FirstName varchar, \
  Customer_LastName varchar, \
  Birth_Date varchar, \
  Customer_Address varchar, \
  Street_ID bigint, \
  Street_Number int, \
  Customer_Type_ID int"

### Define ORDER_FACT.csv data file schema

In [None]:
ordersSchema = StructType([
  StructField("Customer_ID", IntegerType(), False),
  StructField("Employee_ID", IntegerType(), False),
  StructField("Street_ID", LongType(), False),
  StructField("Order_Date", StringType(), False),
  StructField("Delivery_Date", StringType(), False),
  StructField("Order_ID", LongType(), False),
  StructField("Order_Type", ByteType(), False),
  StructField("Product_ID", LongType(), False),
  StructField("Quantity", ByteType(), False),
  StructField("Total_Retail_Price", StringType(), False),
  StructField("CostPrice_Per_Unit", StringType(), False),
  StructField("Discount", LongType(), True)
])

In [None]:
ordersSchemaForGlue = "Customer_ID int, \
  Employee_ID int, \
  Street_ID bigint, \
  Order_Date varchar, \
  Delivery_Date varchar, \
  Order_ID bigint, \
  Order_Type int, \
  Product_ID bigint, \
  Quantity int, \
  Total_Retail_Price varchar, \
  CostPrice_Per_Unit varchar, \
  Discount bigint"

---

# Main demo starts here 🚦 👇🏻

For this demo - we'll be utilizing a dataset - [Orion Star - Sports and outdoors RDBMS dataset](https://www.kaggle.com/datasets/chethanp11/orion-star-sports-and-outdoors-rdbms-dataset) from [Kaggle](https://www.kaggle.com/).

## Run following command on your computer to clone lakeFS samples repo along with sample data used by this notebook:

### git clone https://github.com/treeverse/lakeFS-samples.git

## Print the command and run it on your computer to upload sample data to lakeFS repository

In [None]:
print(f"cd lakeFS-samples && lakectl fs upload -s ./data/OrionStar lakefs://{repo_name}/main/data/OrionStar --recursive && lakectl commit lakefs://{repo_name}/main -m 'Uploaded sample data'")

## Create Customers table in the main branch (using [CUSTOMER.csv](https://github.com/treeverse/lakeFS-samples/blob/040ce6fd2a2f45bd991dd17c8e9ad1d88887cdae/data/OrionStar/CUSTOMER.csv) file)

#### Register table in Glue catalog

In [None]:
customersTablePath = f"s3a://{repo_name}/{mainBranch}/{customersTable}"

execute_trino_query(f" \
          CREATE TABLE IF NOT EXISTS {glueDatabaseName}.{customersTable}( \
              {customersSchemaForGlue} \
          ) \
          WITH ( \
              format = 'Parquet', \
              external_location = '{customersTablePath}' \
              ) \
          ")

#### Read CSV file and write data to Customers table in the main branch

In [None]:
df = spark.read.csv(f"s3a://{repo_name}/{mainBranch}/data/OrionStar/CUSTOMER.csv",header=True,schema=customersSchema)
df.write.format("parquet").mode("append").save(f"{customersTablePath}")
df.show(10)

## Create Orders table in the main branch (using [ORDER_FACT.csv](https://github.com/treeverse/lakeFS-samples/blob/040ce6fd2a2f45bd991dd17c8e9ad1d88887cdae/data/OrionStar/ORDER_FACT.csv) file)

#### Register table in Glue catalog

In [None]:
ordersTablePath = f"s3a://{repo_name}/{mainBranch}/{ordersTable}"

execute_trino_query(f" \
          CREATE TABLE IF NOT EXISTS {glueDatabaseName}.{ordersTable}( \
              {ordersSchemaForGlue} \
          ) \
          WITH ( \
              format = 'Parquet', \
              external_location = '{ordersTablePath}' \
              ) \
          ")

#### Read CSV file and write to Orders table in the main branch

In [None]:
df = spark.read.csv(f"s3a://{repo_name}/{mainBranch}/data/OrionStar/ORDER_FACT.csv",header=True,schema=ordersSchema)
df.write.format("parquet").mode("append").save(f"{ordersTablePath}")
df.show(10)

## Commit changes and attach some metadata

In [None]:
ref = branchMain.commit(message='Added customers and orders tables!', 
        metadata={'using': 'python_api'})
print_commit(ref.get_commit())

## Execute Trino queries to read the data

In [None]:
spark.createDataFrame(execute_trino_query(f'SELECT * FROM "{glueDatabaseName}"."{customersTable}"'), schema=customersSchema).show(10) 

In [None]:
spark.createDataFrame(execute_trino_query(f'SELECT * FROM "{glueDatabaseName}"."{ordersTable}"'), schema=ordersSchema).show(10) 

# 🟢 ETL Job Starts

## Create an ETL branch

In [None]:
branchETL = repo.branch(etlBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{etlBranch} ref:", branchETL.get_commit().id)

### Create Glue Database for the ETL branch

In [None]:
execute_trino_query(f"CREATE SCHEMA {glueDatabaseName}_{etlBranch} WITH (location = 's3a://{repo_name}/{etlBranch}')")

### Register tables in Glue catalog for the ETL branch

In [None]:
customersTablePathETLBranch = f"s3a://{repo_name}/{etlBranch}/{customersTable}"

execute_trino_query(f" \
          CREATE TABLE IF NOT EXISTS {glueDatabaseName}_{etlBranch}.{customersTable}( \
              {customersSchemaForGlue} \
          ) \
          WITH ( \
              format = 'Parquet', \
              external_location = '{customersTablePathETLBranch}' \
              ) \
          ")

In [None]:
ordersTablePathETLBranch = f"s3a://{repo_name}/{etlBranch}/{ordersTable}"

execute_trino_query(f" \
          CREATE TABLE IF NOT EXISTS {glueDatabaseName}_{etlBranch}.{ordersTable}( \
              {ordersSchemaForGlue} \
          ) \
          WITH ( \
              format = 'Parquet', \
              external_location = '{ordersTablePathETLBranch}' \
              ) \
          ")

## Execute Trino query to insert Customers data in the ETL branch

In [None]:
execute_trino_query(f"INSERT INTO {glueDatabaseName}_{etlBranch}.{customersTable} VALUES (1,'US','M',2,'Scott Gibbs','Scott','Gibbs','12APR1970','556 Greywood Rd',9260103713,1068,1030)")

## Execute Trino query to read Customers data from the ETL branch

In [None]:
spark.createDataFrame(execute_trino_query(f'SELECT * FROM "{glueDatabaseName}_{etlBranch}"."{customersTable}" ORDER BY Customer_ID'), schema=customersSchema).show(10) 

## Execute Trino query to read Customers data from the main branch

In [None]:
spark.createDataFrame(execute_trino_query(f'SELECT * FROM "{glueDatabaseName}"."{customersTable}" ORDER BY Customer_ID'), schema=customersSchema).show(10) 

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack