# Iceberg with Spark and Hive Metastore Catalog
Here we are using Iceberg with Spark, and using Hive Metastore as the Catalog. This is the OC catalog.


## Importing Required Libraries
We will be importing `SparkSession` and `os`, which is used to read environment variable for the Minio access key and secret.

We also set some styling to display tables better.

In [2]:
from pyspark.sql import SparkSession
import os

# this is to better display pyspark dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Setting up Spark Session
We set up Spark Session with with the configs required to connect to the Hive Metastore. Here we are setting up `spark_iceberg_hive` as the iceberg catalog

In [3]:
iceberg_catalog_name = "spark_iceberg_hive"
spark = SparkSession.builder \
  .appName("iceberg-hive") \
  .config("spark.driver.memory", "4g") \
  .config("spark.executor.memory", "4g") \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.jars", "/opt/extra-jars/iceberg-spark-runtime.jar,/opt/extra-jars/iceberg-aws-bundle.jar") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.type", "hive") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.uri", "thrift://hive-metastore:9083") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.warehouse", "s3://warehouse/spark-iceberg-hive/") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.endpoint", "http://minio:9000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.path-style-access", "true") \
  .getOrCreate()


24/08/24 16:35:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/24 16:35:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Setting up the test parquet file as a dataframe

In [4]:
df = spark.read.parquet("file:///home/iceberg/workspace/downloaded-data/yellow_tripdata_2024-01.parquet")

                                                                                

Now we check the data to get an idea of the size, structure and the actual data.

In [7]:
print(f"Number of rows: {df.count()}")
print("Schema:")
df.printSchema()
print("Data:")
df.show(5)

Number of rows: 2964624
Schema:
root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp_ntz (nullable = true)
 |-- tpep_dropoff_datetime: timestamp_ntz (nullable = true)
 |-- passenger_count: long (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: long (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- Airport_fee: double (nullable = true)

Data:
+--------+--------------------+---------------------+---------------+-------------+----

## Creating Iceberg namespace under the catalog
We create the namespace (schema) under the iceberg catalog `spark_iceberg_hive` we created in the Spark Session configs, and assign the s3 (minio) location.

In [6]:
spark.sql("create namespace spark_iceberg_hive.spark_iceberg_hive LOCATION 's3://warehouse/spark-iceberg-hive/'")

DataFrame[]

## Writing the data to Iceberg Table
Finally, writing the data to the Iceberg table and timing it.

In [7]:
start = timer()
cc_main_df.writeTo("spark_iceberg_hive.spark_iceberg_hive.cc_main_2024_26_part_00000").create()
end = timer()
print(end - start)

                                                                                

58.95098574800022
