# Installing prerequisites

We are going to configure this colab instance by:

* downloading spark-3.2.0
* downloading a binary distribution of quarkus nessie

We are going to access web uis via tunnels provided by ngrok (register with your github account or google account and get your auth token)

replace `THE_AUTH_TOKEN_FOR_NGROK` with your actual auth token

In [1]:
%%shell
mkdir -p build
cd build
echo "Installing SPARK"
wget -q https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xf spark-3.1.2-bin-hadoop3.2.tgz
echo "Installing FINDSPARK"
pip -q install findspark
echo "Installing NESSIE"
wget -q https://github.com/andrea-rockt/colab-notebooks/raw/main/data/nessie-quarkus-0.9.2.tar.gz
tar xf nessie-quarkus-0.9.2.tar.gz
chmod +x nessie-quarkus-0.9.2.bin
wget -q https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.tgz
tar xf ngrok-stable-linux-amd64.tgz

./ngrok authtoken YOUR_TOKEN_HERE

Installing SPARK
Installing FINDSPARK
Installing NESSIE
Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml




We will start nessie as a background process nessie will serve its web UI at localhost:19120

nessie will use in memory persistence so everything we do will be ephemeral

In [2]:
import os
os.system("/content/build/nessie-quarkus-0.9.2.bin 2>&1 > nessie.log &")

0

In [3]:
import findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/build/spark-3.1.2-bin-hadoop3.2"

# Full url of the Nessie API endpoint to nessie
url = "http://localhost:19120/api/v1"
# Where to store nessie tables
full_path_to_warehouse = '/warehouse/'
# The ref or context that nessie will operate on (if different from default branch).
# Can be the name of a Nessie branch or tag or a Nessie commit SHA.
ref = "main"
# Nessie authentication type (BASIC, NONE or AWS)
auth_type = "NONE"

findspark.init()
from pyspark.sql import SparkSession
spark= SparkSession \
       .builder \
       .appName("Our First Spark example") \
       .config("spark.jars.packages",
              "org.apache.iceberg:iceberg-spark3-runtime:0.12.0,org.projectnessie:nessie-spark-extensions:0.18.0") \
        .config("spark.sql.extensions", 
               "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
        .config("spark.sql.catalog.nessie.uri", url) \
        .config("spark.sql.catalog.nessie.ref", ref) \
        .config("spark.sql.catalog.nessie.authentication.type", auth_type) \
        .config("spark.sql.catalog.nessie.catalog-impl", 
               "org.apache.iceberg.nessie.NessieCatalog") \
        .config("spark.sql.catalog.nessie.warehouse", full_path_to_warehouse) \
        .config("spark.sql.catalog.nessie",
               "org.apache.iceberg.spark.SparkCatalog") \
       .getOrCreate()
spark

In [4]:
!wget -q https://github.com/sivabalanb/Data-Analysis-with-Pandas-and-Python/raw/master/nba.csv

In [5]:
spark.sql('CREATE BRANCH IF NOT EXISTS initial_load IN nessie FROM main').collect()
spark.sql('USE REFERENCE initial_load IN nessie').collect()
createStatement = 'CREATE TABLE nessie.nba.player (Name STRING, Team STRING, Number String, Position STRING, Age STRING, Height STRING, Weight STRING, College STRING, Salary STRING) USING iceberg PARTITIONED BY (Team)'
spark.sql(createStatement).collect()
playersDf = spark.read.csv('nba.csv', header=True)
playersDf.write.format('iceberg').mode('append').partitionBy('Team').save('nessie.nba.player')

In [6]:
spark.sql('MERGE BRANCH initial_load INTO main IN nessie').collect()

[Row(name='main', hash='419f926faf2704ed6521607316b7e97bbd2e820e341e9f019041e0c3377efd0f')]

In [7]:
spark.sql('CREATE BRANCH IF NOT EXISTS add_row IN nessie FROM main').collect()
spark.sql('USE REFERENCE add_row IN nessie').collect()
spark.sql("INSERT INTO nessie.nba.player VALUES ('Name', 'Team', 'Number', 'Position', '12', '12', '12', 'College', '12')")
spark.sql('USE REFERENCE main IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))
spark.sql('USE REFERENCE add_row IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))
spark.sql('MERGE BRANCH add_row INTO main IN nessie').collect()
spark.sql('USE REFERENCE main IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))

count is: 458
count is: 459
count is: 459


In [9]:
!cat nessie.log

__  ____  __  _____   ___  __ ____  ______ 
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \   
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/   
2022-01-17 10:52:00,057 INFO  [org.pro.ser.pro.ConfigurableVersionStoreFactory] (main) Using INMEMORY Version store
2022-01-17 10:52:00,065 INFO  [io.quarkus] (main) nessie-quarkus 0.9.2 native (powered by Quarkus 2.2.0.Final) started in 0.063s. Listening on: http://0.0.0.0:19120
2022-01-17 10:52:00,065 INFO  [io.quarkus] (main) Profile prod activated. 
2022-01-17 10:52:00,065 INFO  [io.quarkus] (main) Installed features: [amazon-dynamodb, cdi, hibernate-validator, jaeger, jgit, micrometer, resteasy, resteasy-jackson, security, security-properties-file, sentry, smallrye-context-propagation, smallrye-health, smallrye-openapi, smallrye-opentracing, swagger-ui, vertx, vertx-web]
2022-01-17 10:52:19,915 INFO  [io.qua.htt.access-log] (executor-thread-0) 127.0.0.1 - - 17/Jan/2022:10:52:19 +0000 "GET /api/v1/trees/tree/

In [11]:
!build/ngrok http 19120