<a href="https://colab.research.google.com/github/andrea-rockt/colab-notebooks/blob/main/lakefs_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lakefs lab

We need to build our own enviroment to test things out and learn about git like file systems

* as data engineers we want to try things out in order to properly understand systems that we are building.

* as data scientist we want an environment able to support our experimentations.

* as developers we want reproducible enviroments to validate our code on.

Let's begin by configuring our environment, nobody has ever been fired by defining a bit of infrastructure.

We are going to create an environment based on

* Apache Spark: our distributed execution engine, this will be the compute layer of our lab environment and will shuffle data around your cluster and crunch the numbers.
* The local filesystem: we need to store the actual data on a distributed filesystem, we are only going to only use one node so we will select the local filesystem viewing it as a *special* case of a more general distributed filesystem.
* Lakefs: our metadata management solution, table formats describe plain files as collection of related content by attaching metadata to those files, we will store this metadata inside lakefs in order to get time travel on metadata.  

# Installing prerequisites

We are going to configure this colab instance by:

* downloading `spark-3.1.2`
* postgresql 11
* downloading a binary distribution of `lakefs`
* downloading `ngrok`

We are going to access web uis via tunnels provided by `ngrok` (register with your github account or google account on `ngrok.com` and get your auth token)

replace `THE_AUTH_TOKEN_FOR_NGROK` with your actual auth token

In [None]:
%%shell
echo "Installing SPARK"
wget -q https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xf spark-3.1.2-bin-hadoop3.2.tgz
echo "Installing FINDSPARK"
pip -q install findspark 

Installing SPARK
Installing FINDSPARK




In [None]:
%%shell
echo "Installing POSTGRESQL 11"
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
RELEASE=$(lsb_release -cs)
echo "deb http://apt.postgresql.org/pub/repos/apt/ ${RELEASE}"-pgdg main | sudo tee  /etc/apt/sources.list.d/pgdg.list
sudo apt update -qq > /dev/null
sudo apt -y -qq install postgresql-11 > /dev/null
sudo service postgresql start
sudo -u postgres -- psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
sudo -u postgres -- psql -U postgres -c "CREATE DATABASE lakefs;"

Installing POSTGRESQL 11
OK
deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main




debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 16.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
 * Starting PostgreSQL 11 database server
   ...done.
ALTER ROLE
CREATE DATABASE




In [None]:
%%shell
mkdir -p /lakefs
wget -q https://github.com/treeverse/lakeFS/releases/download/v0.57.2/lakeFS_0.57.2_Linux_x86_64.tar.gz
tar xf lakeFS_0.57.2_Linux_x86_64.tar.gz
export LAKEFS_LOGGING_OUTPUT='-'
export LAKEFS_BLOCKSTORE_LOCAL_PATH='/lakefs'
export LAKEFS_DATABASE_CONNECTION_STRING='postgres://postgres:postgres@localhost:5432/lakefs?sslmode=disable'
export LAKEFS_LOGGING_FORMAT='text'
export LAKEFS_BLOCKSTORE_TYPE='local'
export LAKEFS_GATEWAYS_S3_REGION='us-east-1'
export LAKEFS_AUTH_ENCRYPT_SECRET_KEY='10a718b3f285d89c36e9864494cdd1507f3bc85b342df24736ea81f9a1134bcc09e90b6641'
export LAKEFS_LOGGING_LEVEL='DEBUG'
export LAKEFS_LISTEN_ADDRESS='0.0.0.0:8000'
nohup ./lakefs run > lakefs.log &

nohup: redirecting stderr to stdout




In [None]:
%%shell
wget -q https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.tgz
tar xf ngrok-stable-linux-amd64.tgz
./ngrok authtoken THE_AUTH_TOKEN_FOR_NGROK
nohup ./ngrok http 8000 > ngrok.log &

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml
nohup: redirecting stderr to stdout




In [None]:
%env LAKECTL_CREDENTIALS_ACCESS_KEY_ID=PUT_ACCESS_KEY_HERE
%env LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=PUT_SECRET_ACCESS_KEY_HERE
%env LAKECTL_SERVER_ENDPOINT_URL=http://localhost:8000

env: LAKECTL_CREDENTIALS_ACCESS_KEY_ID=AKIAJF3AKVOOH55D3RKQ
env: LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=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
env: LAKECTL_SERVER_ENDPOINT_URL=http://localhost:8000


In [None]:
!./lakectl branch create --source lakefs://colab/main lakefs://colab/dev

Source ref: lakefs://colab/main
created branch 'dev' 0705f889b77d38695689641ff4dfce94a967a837aa1d76cdad8d5c6769087256


In [None]:
import findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

CREDENTIALS_ACCESS_KEY_ID=os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"]
CREDENTIALS_SECRET_ACCESS_KEY=os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"]

findspark.init()
from pyspark.sql import SparkSession
spark= SparkSession \
       .builder \
       .appName("spark-lakefs-training") \
       .config("spark.jars.packages",
              "org.apache.hadoop:hadoop-aws:3.2.0") \
       .config("spark.sql.extensions", 
              "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
       .config('spark.hadoop.fs.s3a.access.key',CREDENTIALS_ACCESS_KEY_ID) \
       .config('spark.hadoop.fs.s3a.secret.key',CREDENTIALS_SECRET_ACCESS_KEY) \
       .config('spark.hadoop.fs.s3a.path.style.access',True) \
       .config('spark.hadoop.fs.s3a.endpoint','http://localhost:8000') \
       .getOrCreate()
spark

In [None]:
!wget -q https://github.com/sivabalanb/Data-Analysis-with-Pandas-and-Python/raw/master/nba.csv

In [None]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType, DecimalType
from pyspark.sql.functions import mean
playersSchema = StructType([
  StructField("Name",StringType(),False), \
  StructField("Team",StringType(),True), \
  StructField("Number",StringType(),True), \
  StructField("Position", StringType(), True), \
  StructField("Age", StringType(), True), \
  StructField("Height", StringType(), True), \
  StructField("Weight", DoubleType(), True), \
  StructField("College", StringType(), True), \
  StructField("Salary", DecimalType(14, 2), True)
])


In [None]:
playersDfRaw = spark.read.csv('nba.csv', header=True, schema=playersSchema)
playersDf = playersDfRaw.select(playersDfRaw.Name,
                                playersDfRaw.Team,
                                playersDfRaw.Number.cast(IntegerType()),
                                playersDfRaw.Position,
                                playersDfRaw.Age.cast(IntegerType()),
                                playersDfRaw.Height,
                                playersDfRaw.Weight,
                                playersDfRaw.College,
                                playersDfRaw.Salary)

playersDf.write.parquet('s3a://colab/dev/nba/player')
playersDf.groupBy('Position').agg(mean('Salary').alias('MeanSalary')).write.parquet('s3a://colab/dev/nba/salary')

In [None]:
%%shell
./lakectl commit lakefs://colab/dev -m "Initial load of nba tables"

Branch: lakefs://colab/dev
Commit for branch "dev" completed.

ID: [93m6fbd5a3b2da0c8d2101eae304d9fb3098e6cbb2cf20499b49105ad8737828295[0m
Message: Initial load of nba tables
Timestamp: 2022-01-20 12:11:41 +0000 UTC
Parents: 0705f889b77d38695689641ff4dfce94a967a837aa1d76cdad8d5c6769087256





In [None]:
%%shell
./lakectl merge  lakefs://colab/dev lakefs://colab/main 

Source: lakefs://colab/dev
Destination: lakefs://colab/main
Merged "[93mdev[0m" into "[93mmain[0m" to get "[92m2312bb907b4b5806ba5db756fb4546e957222025492a83a99855291c176d1d81[0m".

Added: 0
Changed: 0
Removed: 0





In [None]:
spark.read.parquet('s3a://colab/dev/nba/player').createOrReplaceTempView('player')

spark.sql("""
SELECT 
  SUM(CAST (Salary   is NULL as INTEGER)) as null_salaries,
  SUM(CAST (College  is NULL as INTEGER)) as null_college,
  SUM(CAST (Weight   is NULL as INTEGER)) as null_weight,
  SUM(CAST (Height   is NULL as INTEGER)) as null_height,
  SUM(CAST (Age      is NULL as INTEGER)) as null_age,
  SUM(CAST (Position is NULL as INTEGER)) as null_position,
  SUM(CAST (Number   is NULL as INTEGER)) as null_number,
  SUM(CAST (Team     is NULL as INTEGER)) as null_team,
  SUM(CAST (Name     is NULL as INTEGER)) as null_name
FROM 
  player 
""").toPandas()

Unnamed: 0,null_salaries,null_college,null_weight,null_height,null_age,null_position,null_number,null_team,null_name
0,12,85,1,1,1,1,1,1,1


In [None]:
%%shell
./lakectl fs rm --recursive lakefs://colab/dev/nba



In [None]:
playersDfRaw = spark.read.csv('nba.csv', header=True, schema=playersSchema)
playersDf = playersDfRaw.select(playersDfRaw.Name,
                                playersDfRaw.Team,
                                playersDfRaw.Number.cast(IntegerType()),
                                playersDfRaw.Position,
                                playersDfRaw.Age.cast(IntegerType()),
                                playersDfRaw.Height,
                                playersDfRaw.Weight,
                                playersDfRaw.College,
                                playersDfRaw.Salary).createOrReplaceTempView('player')


updatedDF = spark.sql(
"""
SELECT *
FROM player
WHERE
NOT(
Salary is NULL AND
College is NULL AND
Weight is NULL AND
Height is NULL AND
Age is NULL AND
Position is NULL AND
Number is NULL AND
Team is NULL AND
Name is NULL
) 
""")

updatedDF.write.parquet('s3a://colab/dev/nba/player')
updatedDF.groupBy('Position').agg(mean('Salary').alias('MeanSalary')).write.parquet('s3a://colab/dev/nba/salary')


In [None]:
spark.read.parquet('s3a://colab/main/nba/player').where('Weight is  Null').toPandas()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary


In [None]:
!./lakectl commit lakefs://colab/dev -m "Removed null rows"


Branch: lakefs://colab/dev
Commit for branch "dev" completed.

ID: [93m001e1575df80ff6116ee20d05c1ec3c023e47a1fd7c8816ffda0dbd2d3b68e5c[0m
Message: Removed null rows
Timestamp: 2022-01-20 12:14:00 +0000 UTC
Parents: 6fbd5a3b2da0c8d2101eae304d9fb3098e6cbb2cf20499b49105ad8737828295



In [None]:
!./lakectl merge lakefs://colab/dev lakefs://colab/main 

Source: lakefs://colab/dev
Destination: lakefs://colab/main
Merged "[93mdev[0m" into "[93mmain[0m" to get "[92m7e11f996d79dfc8b6b52644be9de69422d1637ea83c593a9b1d6cbff0ac599b8[0m".

Added: 0
Changed: 0
Removed: 0

