# REST Catalog
Now lets have a look at the REST catalog. Here we will use the [Python REST Catalog by Kevin Liu](https://github.com/kevinjqliu/iceberg-rest-catalog), which use the Pyiceberg internally to proxy a SQL catalog. So for this, we will be doing the tests slightly differently. We will be setting up this catalog to proxy the JDBC Catalog created previously, and be reading data that was wrote into it. 

## Catalog Configuration
As this REST catalog is a proxy for a JDBC/SQL catalog, we need to ensure the configurations are setup to let it connect to the JDBC catalog we created, and this is done through environment variables on the container, as setup in the Docker Compose file:

```
    environment:
      CATALOG_NAME: iceberg
      CATALOG_JDBC_URI: postgresql://postgres:postgres@postgres:5432/iceberg
      CATALOG_WAREHOUSE: s3://warehouse/iceberg-jdbc/
      CATALOG_S3_ENDPOINT: http://minio:9000
      AWS_ACCESS_KEY_ID: admin
      AWS_SECRET_ACCESS_KEY: password
      AWS_REGION: us-east-1
```
Of particular importance is `CATALOG_NAME`, which has to match the name we set when creating the JDBC catalog using spark.

## Importing Required Libraries
As before, we import all the necessary libraries, and setup tge display styling.

In [1]:
from pyspark.sql import SparkSession
from trino.dbapi import connect
import pandas as pd

# this is to better display pyspark and pandas dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

pd.set_option('display.max_colwidth', None)

## Setting up Spark Session
Details docs of the spark configs to use with the Rest catalog can be found [here](https://iceberg.apache.org/docs/latest/spark-configuration/).
We will setting up `iceberg` as the catalog name, but as this is only as a reference on the spark side. Since the Rest Catalog container has already been setup with its own env variables to connection to the JDBC catalog, thats what it will use.

Here we only need to configure the Rest catalog url, and the Minio specific configs, since connection to postgres is handled by the Rest Catalog.

In [2]:
iceberg_catalog_name = "iceberg"
spark = SparkSession.builder \
  .appName("iceberg-rest") \
  .config("spark.driver.memory", "4g") \
  .config("spark.executor.memory", "4g") \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.jars", "/opt/extra-jars/iceberg-spark-runtime.jar,/opt/extra-jars/iceberg-aws-bundle.jar") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.type", "rest") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.uri", "http://iceberg-rest-catalog:8000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.warehouse", "s3://warehouse/iceberg-jdbc/") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.endpoint", "http://minio:9000") \
  .config(f"spark.sql.catalog.{iceberg_catalog_name}.s3.path-style-access", "true") \
  .getOrCreate()

24/09/21 03:15:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/21 03:15:41 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Reading Data Using Spark

So far we have only seen how to write data With spark. We can use this opportunity to test, reading data from Iceberg with Spark. We can do that using the `spark.table()` [method](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.table.html):

In [9]:
df = spark.table("iceberg.jdbc.yellow_tripdata")
df.show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2| 2024-01-31 23:59:53|  2024-02-01 00:18:35|              1|         6.95|         1|                 N|         249|         166|           1|       30.3|  1.0|    0.5|      7.0

## Querying with Trino
The configurations required to enable Trino querying would be the [REST Catalog configs](https://trino.io/docs/current/object-storage/metastores.html#rest-catalog), which have been setup in our Trino deployment:

```
# iceberg-rest.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest-catalog:8000
iceberg.rest-catalog.warehouse=s3://warehouse/iceberg-jdbc/
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.path-style-access=true
s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
s3.region=${ENV:AWS_REGION}
```

As before, we setup the Trino python client and run the queries, and load them into a pandas dataframe.

In [2]:
conn = connect(
    host="trino",
    port=8080,
    user="user"
)

In [5]:
df_from_trino = pd.read_sql_query('select * from "iceberg-jdbc".jdbc.yellow_tripdata limit 10', conn)
df_from_trino

  df_from_trino = pd.read_sql_query('select * from "iceberg-jdbc".jdbc.yellow_tripdata limit 10', conn)


Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1,1.72,1,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1,1.8,1,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1,4.7,1,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1,1.4,1,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1,0.8,1,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0
5,1,2024-01-01 00:54:08,2024-01-01 01:26:31,1,4.7,1,N,148,141,1,29.6,3.5,0.5,6.9,0.0,1.0,41.5,2.5,0.0
6,2,2024-01-01 00:49:44,2024-01-01 01:15:47,2,10.82,1,N,138,181,1,45.7,6.0,0.5,10.0,0.0,1.0,64.95,0.0,1.75
7,1,2024-01-01 00:30:40,2024-01-01 00:58:40,0,3.0,1,N,246,231,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0
8,2,2024-01-01 00:26:01,2024-01-01 00:54:12,1,5.44,1,N,161,261,2,31.0,1.0,0.5,0.0,0.0,1.0,36.0,2.5,0.0
9,2,2024-01-01 00:28:08,2024-01-01 00:29:16,1,0.04,1,N,113,113,2,3.0,1.0,0.5,0.0,0.0,1.0,8.0,2.5,0.0
