# Hive Catalog
Here we are setting up the Hive Metastore as the Catalog.


## Importing Required Libraries
We will be importing `SparkSession` and `os`, which is used to read environment variable for the Minio access key and secret.

We also set some styling to display tables better.

In [1]:
from pyspark.sql import SparkSession
import os

# this is to better display pyspark dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Setting up Spark Session
We set up Spark Session with with the configs required to connect to the Hive Metastore. Here we are setting up `iceberg_hive` as the iceberg catalog. `s3.endpoint` and `s3.path-style-access` configs are spacifically to connect to the local Minio setup.

In [2]:
iceberg_catalog_name = "iceberg"
spark = SparkSession.builder \
  .appName("iceberg-hive") \
  .config("spark.driver.memory", "4g") \
  .config("spark.executor.memory", "4g") \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.jars", "/opt/extra-jars/iceberg-spark-runtime.jar,/opt/extra-jars/iceberg-aws-bundle.jar") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.type", "hive") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.uri", "thrift://hive-metastore:9083") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.warehouse", "s3://warehouse/iceberg-hive/") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.endpoint", "http://minio:9000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.path-style-access", "true") \
  .getOrCreate()


24/09/02 15:31:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Setting up the test parquet file as a dataframe

In [3]:
df = spark.read.parquet("file:///home/iceberg/workspace/downloaded-data/yellow_tripdata_2024-01.parquet")

Now we check the data to get an idea of the size, structure and the actual data.

In [4]:
print(f"Number of rows: {df.count()}")
print("Schema:")
df.printSchema()
print("Data:")
df.show(5)

Number of rows: 2964624
Schema:
root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp_ntz (nullable = true)
 |-- tpep_dropoff_datetime: timestamp_ntz (nullable = true)
 |-- passenger_count: long (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: long (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- Airport_fee: double (nullable = true)

Data:
+--------+--------------------+---------------------+---------------+-------------+----

24/09/02 15:31:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Creating Iceberg namespace under the catalog
We create the namespace (schema) under the iceberg catalog `iceberg_hive` we created in the Spark Session configs, and assign the s3 (minio) location.

In [5]:
spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.hive LOCATION 's3://warehouse/iceberg-hive/'")

DataFrame[]

## Writing the data to Iceberg Table
Finally, writing the data to the Iceberg table.

In [6]:
df.writeTo("iceberg.hive.yellow_tripdata_2024_01").create()

                                                                                

We can stop the spark session, since we will be using Trino to query the data.

In [7]:
spark.stop()

We then check the data saved to Minio. 

In [8]:
!mc ls --recursive minio/warehouse/iceberg-hive

]11;?\[6n[m[32m[2024-09-02 15:31:34 UTC][0m[33m     0B[0m [34mSTANDARD[0m[36;1m /[0;22m[m
[m[32m[2024-09-02 15:31:49 UTC][0m[33m  16MiB[0m [34mSTANDARD[0m[1m yellow_tripdata_2024_01/data/00001-16-637cf551-c56b-42e0-9b07-bf4d792a1f39-0-00001.parquet[22m[m
[m[32m[2024-09-02 15:31:49 UTC][0m[33m  16MiB[0m [34mSTANDARD[0m[1m yellow_tripdata_2024_01/data/00003-18-637cf551-c56b-42e0-9b07-bf4d792a1f39-0-00001.parquet[22m[m
[m[32m[2024-09-02 15:31:46 UTC][0m[33m  13MiB[0m [34mSTANDARD[0m[1m yellow_tripdata_2024_01/data/00006-21-637cf551-c56b-42e0-9b07-bf4d792a1f39-0-00001.parquet[22m[m
[m[32m[2024-09-02 15:31:50 UTC][0m[33m 3.7KiB[0m [34mSTANDARD[0m[1m yellow_tripdata_2024_01/metadata/00000-0bbcfcba-4820-4b7a-b7ec-e7ea66492479.metadata.json[22m[m
[m[32m[2024-09-02 15:31:49 UTC][0m[33m 8.4KiB[0m [34mSTANDARD[0m[1m yellow_tripdata_2024_01/metadata/19735b9c-3dd8-4629-8246-f5efacd5163b-m0.avro[22m[m
[m[32m[2024-09-02 15:31:50 UTC][0

## Querying with Trino
To start querying the data with Trino, we first need to configure Trino to connect to the Hive catalog using the following catalog properties (which has already been setup in the Trino configuration folder [here]()):

```
connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://hive-metastore:9083
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.path-style-access=true
s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
s3.region=${ENV:AWS_REGION}
```

We then use the Trino python client, together with pandas to ready the data back. First we setup the connection:

In [9]:
from trino.dbapi import connect
import pandas as pd

conn = connect(
    host="trino",
    port=8080,
    user="user"
)

Then we read the data into a pandas dataframe

In [11]:
df_from_trino = pd.read_sql_query('select * from "iceberg-hive".hive.yellow_tripdata_2024_01 limit 10', conn)

  df_from_trino = pd.read_sql_query('select * from "iceberg-hive".hive.yellow_tripdata_2024_01 limit 10', conn)


In [12]:
df_from_trino

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1,1.72,1,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1,1.8,1,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1,4.7,1,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1,1.4,1,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1,0.8,1,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0
5,1,2024-01-01 00:54:08,2024-01-01 01:26:31,1,4.7,1,N,148,141,1,29.6,3.5,0.5,6.9,0.0,1.0,41.5,2.5,0.0
6,2,2024-01-01 00:49:44,2024-01-01 01:15:47,2,10.82,1,N,138,181,1,45.7,6.0,0.5,10.0,0.0,1.0,64.95,0.0,1.75
7,1,2024-01-01 00:30:40,2024-01-01 00:58:40,0,3.0,1,N,246,231,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0
8,2,2024-01-01 00:26:01,2024-01-01 00:54:12,1,5.44,1,N,161,261,2,31.0,1.0,0.5,0.0,0.0,1.0,36.0,2.5,0.0
9,2,2024-01-01 00:28:08,2024-01-01 00:29:16,1,0.04,1,N,113,113,2,3.0,1.0,0.5,0.0,0.0,1.0,8.0,2.5,0.0


24/09/02 15:33:35 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /tmp/spark-560704b6-d2fb-4386-acf5-df96e6ddccff. Falling back to Java IO way
java.io.IOException: Failed to delete: /tmp/spark-560704b6-d2fb-4386-acf5-df96e6ddccff
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:174)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:109)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126)
	at org.apache.spark.util.ShutdownHookManager$.$anonfun$new$4(ShutdownHookManager.scala:65)
	at org.apache.spark.util.ShutdownHookManager$.$anonfun$new$4$adapted(ShutdownHookManager.scala:62)
	at scala.collection.IndexedSeqOptimized.fore