# Iceberg with Spark and Rest-Java Catalog
Here we are using Iceberg with Spark, and using Rest as the Catalog. Here we mostly follow the official Iceberg tutorial, and the provided Rest-Java catalog

## Importing Required Libraries
Importing the same libraries again

In [1]:
from pyspark.sql import SparkSession
from timeit import default_timer as timer
import os

# this is to better display pyspark dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Setting up Spark Session
We set up Spark Session with with the configs required to connect to the Rest Catalog. Here we are setting up `iceberg_rest` as the iceberg catalog

In [2]:
iceberg_catalog_name = "iceberg_rest"
spark = SparkSession.builder \
  .appName("iceberg-rest-java") \
  .config("spark.driver.memory", "4g") \
  .config("spark.executor.memory", "4g") \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.jars", "/opt/extra-jars/iceberg-spark-runtime.jar,/opt/extra-jars/iceberg-aws-bundle.jar") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.type", "rest") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.uri", "http://iceberg-rest-java:8181") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.warehouse", "s3://warehouse/iceberg-rest/") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.endpoint", "http://minio:9000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.path-style-access", "true") \
  .getOrCreate()

24/08/01 08:35:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Setting up the test parquet file as a dataframe

In [3]:
cc_main_df = spark.read.parquet("file:///home/iceberg/workspace/downloaded-data/cc-main-2024-06-26-warc-part-00000.parquet")

                                                                                

In [4]:
cc_main_df.show(5)

24/08/01 08:35:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+--------------------+--------------------+-----------------+------------+----------------------+----------------------+----------------------+----------------------+------------------------+--------------------------+-----------------------+-----------------------+----------------------+------------+--------+--------------------+---------+-------------------+------------+--------------+--------------------+-----------------+---------------------+---------------+-----------------+-----------------+--------------------+------------------+------------------+----------------+
|         url_surtkey|                 url|    url_host_name|url_host_tld|url_host_2nd_last_part|url_host_3rd_last_part|url_host_4th_last_part|url_host_5th_last_part|url_host_registry_suffix|url_host_registered_domain|url_host_private_suffix|url_host_private_domain|url_host_name_reversed|url_protocol|url_port|            url_path|url_query|         fetch_time|fetch_status|fetch_redirect|      content_digest|content_m

24/08/01 08:36:00 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Creating Iceberg namespace under the catalog
We create the namespace (schema) under the iceberg catalog `iceberg_rest` we created in the Spark Session configs, to demarcate the date being saved using the Rest Java catalog.

In [5]:
spark.sql("create namespace iceberg_rest.java")

DataFrame[]

## Writing the data to Iceberg Table
Finally, writing the data to the Iceberg table and timing it.

In [6]:
start = timer()
cc_main_df.writeTo("iceberg_rest.java.cc_main_2024_26_part_00000").create()
end = timer()
print(end - start)

                                                                                

52.386883415005286
