<a href="https://colab.research.google.com/github/andrea-rockt/colab-notebooks/blob/main/project_nessie_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing prerequisites

We are going to configure this colab instance by:

* downloading spark-3.2.0
* downloading a binary distribution of quarkus nessie

We are going to access web uis via tunnels provided by ngrok (register with your github account or google account and get your auth token)

replace `THE_AUTH_TOKEN_FOR_NGROK` with your actual auth token

In [1]:
%%shell
mkdir -p build
cd build
echo "Installing SPARK"
wget -q https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xf spark-3.1.2-bin-hadoop3.2.tgz
echo "Installing FINDSPARK"
pip -q install findspark
echo "Installing NESSIE"
wget -q https://github.com/andrea-rockt/colab-notebooks/raw/main/data/nessie-quarkus-0.9.2.tar.gz
tar xf nessie-quarkus-0.9.2.tar.gz
chmod +x nessie-quarkus-0.9.2.bin
wget -q https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.tgz
tar xf ngrok-stable-linux-amd64.tgz

./ngrok authtoken THE_AUTH_TOKEN_FOR_NGROK

Installing SPARK
Installing FINDSPARK
Installing NESSIE
Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml




We will start nessie as a background process, nessie will serve its web UI at localhost:19120

nessie will use in memory persistence so everything we do will be ephemeral

In [2]:
import os
os.system("/content/build/nessie-quarkus-0.9.2.bin 2>&1 > nessie.log &")

0

In [3]:
import findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/build/spark-3.1.2-bin-hadoop3.2"

# Full url of the Nessie API endpoint to nessie
url = "http://localhost:19120/api/v1"
# Where to store nessie tables
full_path_to_warehouse = '/warehouse/'
# The ref or context that nessie will operate on (if different from default branch).
# Can be the name of a Nessie branch or tag or a Nessie commit SHA.
ref = "main"
# Nessie authentication type (BASIC, NONE or AWS)
auth_type = "NONE"

findspark.init()
from pyspark.sql import SparkSession
spark= SparkSession \
       .builder \
       .appName("Our First Spark example") \
       .config("spark.jars.packages",
              "org.apache.iceberg:iceberg-spark3-runtime:0.12.0,org.projectnessie:nessie-spark-extensions:0.18.0") \
        .config("spark.sql.extensions", 
               "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
        .config("spark.sql.catalog.nessie.uri", url) \
        .config("spark.sql.catalog.nessie.ref", ref) \
        .config("spark.sql.catalog.nessie.authentication.type", auth_type) \
        .config("spark.sql.catalog.nessie.catalog-impl", 
               "org.apache.iceberg.nessie.NessieCatalog") \
        .config("spark.sql.catalog.nessie.warehouse", full_path_to_warehouse) \
        .config("spark.sql.catalog.nessie",
               "org.apache.iceberg.spark.SparkCatalog") \
       .getOrCreate()
spark

# Input dataset preparation

We are going to prepare data directly in the main branch to simulate a starting state of our initial data pipeline

In [4]:
#we are going to download a dataset of nba players
!wget -q https://github.com/sivabalanb/Data-Analysis-with-Pandas-and-Python/raw/master/nba.csv

In [34]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType, DecimalType
from pyspark.sql.functions import mean
playersSchema = StructType([
  StructField("Name",StringType(),False), \
  StructField("Team",StringType(),True), \
  StructField("Number",StringType(),True), \
  StructField("Position", StringType(), True), \
  StructField("Age", StringType(), True), \
  StructField("Height", StringType(), True), \
  StructField("Weight", DoubleType(), True), \
  StructField("College", StringType(), True), \
  StructField("Salary", DecimalType(14, 2), True)
])


In [41]:
playersDfRaw = spark.read.csv('nba.csv', header=True, schema=playersSchema)
playersDf = playersDfRaw.select(playersDfRaw.Name,
                                playersDfRaw.Team,
                                playersDfRaw.Number.cast(IntegerType()),
                                playersDfRaw.Position,
                                playersDfRaw.Age.cast(IntegerType()),
                                playersDfRaw.Height,
                                playersDfRaw.Weight,
                                playersDfRaw.College,
                                playersDfRaw.Salary)
createPlayersTableStatement = """
CREATE TABLE if not exists nessie.nba.player (
  Name STRING,
  Team STRING,
  Number INTEGER,
  Position STRING,
  Age INTEGER,
  Height STRING,
  Weight DOUBLE,
  College STRING,
  Salary DECIMAL(14,2)
) USING iceberg PARTITIONED BY (Team)
"""

createSalaryTableStatement = """
CREATE TABLE if not exists nessie.nba.salary (
  Position STRING,
  MeanSalary DECIMAL(14,2)
) USING iceberg
"""

spark.sql(createPlayersTableStatement).collect()
spark.sql(createSalaryTableStatement).collect()

playersDf.write.format('iceberg').mode('overwrite').partitionBy('Team').save('nessie.nba.player')
playersDf.groupBy('Position').agg(mean('Salary').alias('MeanSalary')).write.format('iceberg').mode('overwrite').save('nessie.nba.salary')

In [53]:
spark.sql('SHOW LOG main IN nessie').selectExpr('author', 'message','hash', 'explode(properties)').show()
spark.sql('CREATE TAG initial_state IN nessie')

+------+--------------------+--------------------+----------------+-------------------+
|author|             message|                hash|             key|              value|
+------+--------------------+--------------------+----------------+-------------------+
|  root|      iceberg commit|52a6e2d4b6dc3ebaa...|application-type|            iceberg|
|  root|      iceberg commit|52a6e2d4b6dc3ebaa...|          app-id|local-1642440435531|
|  root|      iceberg commit|b0cad5c1d6da44ccc...|application-type|            iceberg|
|  root|      iceberg commit|b0cad5c1d6da44ccc...|          app-id|local-1642440435531|
|  root|      iceberg commit|79fc232b74ac5844c...|application-type|            iceberg|
|  root|      iceberg commit|79fc232b74ac5844c...|          app-id|local-1642440435531|
|  root|      iceberg commit|f87fd480bbbd6cdd4...|application-type|            iceberg|
|  root|      iceberg commit|f87fd480bbbd6cdd4...|          app-id|local-1642440435531|
|  root|delete table nba....|2c9

DataFrame[refType: string, name: string, hash: string]

In [66]:
spark.sql('CREATE BRANCH IF NOT EXISTS fix_null_salaries IN nessie FROM main').collect()
spark.sql('USE REFERENCE fix_null_salaries IN nessie').collect()

spark.read.format('iceberg').load('nessie.nba.player').where('salary is NULL').show()

spark.sql("""
UPDATE nessie.nba.player
SET Salary = 100000.00
WHERE Salary is NULL
""")


spark.sql('select * from nessie.nba.player').groupBy('Position').agg(mean('Salary').alias('MeanSalary')).write.format('iceberg').mode('overwrite').save('nessie.nba.salary')

+----+----+------+--------+---+------+------+-------+------+
|Name|Team|Number|Position|Age|Height|Weight|College|Salary|
+----+----+------+--------+---+------+------+-------+------+
+----+----+------+--------+---+------+------+-------+------+



In [102]:
spark.sql('SHOW LOG IN nessie').show()
spark.sql('SHOW LOG main IN nessie').selectExpr('1 as main', '*').join(
spark.sql('SHOW LOG fix_null_salaries IN nessie').selectExpr('0 as main', '*'), 'hash', 'fullouter').show(100000)


+------+---------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|author|committer|                hash|             message|signedOffBy|          authorTime|       committerTime|          properties|
+------+---------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|  root|         |b0ab72989f633e084...|      iceberg commit|           |2022-01-17 17:52:...|2022-01-17 17:52:...|{application-type...|
|  root|         |1651611a3dc4686c7...|      iceberg commit|           |2022-01-17 17:52:...|2022-01-17 17:52:...|{application-type...|
|  root|         |0bdd10b8996805e35...|      iceberg commit|           |2022-01-17 17:52:...|2022-01-17 17:52:...|{application-type...|
|  root|         |4df705506a9c57ce9...|      iceberg commit|           |2022-01-17 17:51:...|2022-01-17 17:51:...|{application-type...|
|  root|         |0869a2badb0e28801...|      ice

In [103]:
spark.sql('MERGE BRANCH fix_null_salaries INTO main IN nessie')
spark.sql('SHOW LOG main IN nessie').selectExpr('1 as main', '*').join(
spark.sql('SHOW LOG fix_null_salaries IN nessie').selectExpr('0 as main', '*'), 'hash', 'fullouter').show(100000)

+--------------------+----+------+---------+--------------------+-----------+--------------------+--------------------+--------------------+----+------+---------+--------------------+-----------+--------------------+--------------------+--------------------+
|                hash|main|author|committer|             message|signedOffBy|          authorTime|       committerTime|          properties|main|author|committer|             message|signedOffBy|          authorTime|       committerTime|          properties|
+--------------------+----+------+---------+--------------------+-----------+--------------------+--------------------+--------------------+----+------+---------+--------------------+-----------+--------------------+--------------------+--------------------+
|6a9835a88742cbc8e...|   1|  root|         |delete table nba....|           |2022-01-17 17:31:...|2022-01-17 17:31:...|{application-type...|   0|  root|         |delete table nba....|           |2022-01-17 17:31:...|2022-01

In [None]:
spark.sql('CREATE BRANCH IF NOT EXISTS add_row IN nessie FROM main').collect()
spark.sql('USE REFERENCE add_row IN nessie').collect()
spark.sql("INSERT INTO nessie.nba.player VALUES ('Name', 'Team', 'Number', 'Position', '12', '12', '12', 'College', '12')")
spark.sql('USE REFERENCE main IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))
spark.sql('USE REFERENCE add_row IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))
spark.sql('MERGE BRANCH add_row INTO main IN nessie').collect()
spark.sql('USE REFERENCE main IN nessie').collect()
print('count is: ' + str(spark.sql('select * from nessie.nba.player').count()))

count is: 458
count is: 459
count is: 459


In [1]:
!cat nessie.log

cat: nessie.log: No such file or directory


In [None]:
!build/ngrok http 19120