# Iceberg with Spark and JDBC Catalog
Here we are using Iceberg with Spark, and using JDBC Catalog.

## Importing Required Libraries
Importing the same libraries again

In [3]:
from pyspark.sql import SparkSession
from timeit import default_timer as timer
import os

# this is to better display pyspark dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Setting up Spark Session
We set up Spark Session with with the configs required to connect to the JDBC Catalog. Here we are setting up `iceberg_jdbc` as the iceberg catalog

In [4]:
iceberg_catalog_name = "iceberg_jdbc"
spark = SparkSession.builder \
  .appName("iceberg-jdbc") \
  .config("spark.driver.memory", "4g") \
  .config("spark.executor.memory", "4g") \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.jars", "/opt/extra-jars/iceberg-spark-runtime.jar,/opt/extra-jars/iceberg-aws-bundle.jar,/opt/extra-jars/postgresql.jar") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.type", "jdbc") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.uri", "jdbc:postgresql://postgres:5432/postgres?currentSchema=iceberg") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.jdbc.user", "postgres") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.jdbc.password", "postgres") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.jdbc.schema-version", "V1") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.warehouse", "s3://warehouse/iceberg/") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.endpoint", "http://minio:9000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.path-style-access", "true") \
  .getOrCreate()

## Setting up the test parquet file as a dataframe

In [5]:
cc_main_df = spark.read.parquet("file:///home/iceberg/workspace/downloaded-data/cc-main-2024-06-26-warc-part-00000.parquet")

                                                                                

In [6]:
cc_main_df.show(5)

24/08/02 07:14:03 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+--------------------+--------------------+-----------------+------------+----------------------+----------------------+----------------------+----------------------+------------------------+--------------------------+-----------------------+-----------------------+----------------------+------------+--------+--------------------+---------+-------------------+------------+--------------+--------------------+-----------------+---------------------+---------------+-----------------+-----------------+--------------------+------------------+------------------+----------------+
|         url_surtkey|                 url|    url_host_name|url_host_tld|url_host_2nd_last_part|url_host_3rd_last_part|url_host_4th_last_part|url_host_5th_last_part|url_host_registry_suffix|url_host_registered_domain|url_host_private_suffix|url_host_private_domain|url_host_name_reversed|url_protocol|url_port|            url_path|url_query|         fetch_time|fetch_status|fetch_redirect|      content_digest|content_m

## Creating Iceberg namespace under the catalog
We create the namespace (schema) under the iceberg catalog `iceberg_jdbc` we created in the Spark Session configs, to demarcate the date being saved using the Rest Java catalog.

In [7]:
spark.sql("create namespace iceberg_jdbc.jdbc")

24/08/02 07:14:08 WARN JdbcCatalog: JDBC catalog is initialized without view support. To auto-migrate the database's schema and enable view support, set jdbc.schema-version=V1


DataFrame[]

## Writing the data to Iceberg Table
Finally, writing the data to the Iceberg table and timing it.

In [8]:
start = timer()
cc_main_df.writeTo("iceberg_jdbc.jdbc.cc_main_2024_26_part_00000").create()
end = timer()
print(end - start)

                                                                                

54.81681328100967
