<img src="../images/logo.svg" alt="lakeFS logo" width=300/> 

# Data + AI Summit 2022 - Chaos Engineering: Books Demo

_🚧 This notebook may have existing environment or data requirements; it's included here so that you can see the contents and be inspired by it—but it may not run properly.🚧_

----

_This is an updated version of the original notebooks which can be found [here](https://github.com/treeverse/lakeFS-samples/commit/607beb6ae1af48261b60a8c1a36c580ddbc5036a)._ 

🎥 The video of the talk that this notebook accompanies is [here](https://youtu.be/jWxdi5Ya05I).

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. 
The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

In [None]:
repo_name = "dais-2022-chaos-engineering-books-demo"

In [None]:
main_repo_path = f"s3a://{repo_name}/main/"

## Setup

### Configuring lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

In [None]:
print(f"lakeFS client version: {lakefs_client.__version__}")

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

# Generate data

In [None]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
books = [("54278345","Building Resilient Data Pipelines","Iron Man"),
    ("15678345","Building Data Platforms","Gamora"),
    ("89898782","Scaling metadata","catwoman"),
    ("32278345","Beyond the clean data curve","batman"),
    ("31478888","Project lightspeed - the definitive guide","Databricks"),
    ("32278888","Hello Spark Fans","Advanced Analytics"),
    ("73825104","Fundamentals of Data Observability","Andy Petrella"),
    ("73825103","High Performance Spark","Holden Karau"),
    ("73341143","Data Engineering with Apache Spark, Delta Lake, and Lakehouse","Manoj Kukreja"),
    ("54725104","Fundamentals of Data Observability","Andy Petrella"),
    ("54725222","Designing Data-Intensive Applications","Martin Kleppmann"),
    ("54725283","Data Management at Scale","Piethein Strengholt"),     
    ("29829283","Database Internals","Alex Petrov"),  
         
         
         
    ("25678345","Scaling Data Platforms","Gamora"),
    ("39898782","Project metadata","catwoman"),
    ("42278345","Intro to Hive metastore","she-hulk"),
    ("52278888","Reviving zookeper","dr-strange"),
    ("62278888","Life after Hadoop","Green Arrow"),
    ("83825104","Fundamentals of Lakehouse","Barry Allen"),
    ("93825104","High Performance Yarn","Harley Quinn"),  
  ]

schema = StructType([ \
    StructField("isbn",StringType(),True), \
    StructField("name",StringType(),True), \
    StructField("author",StringType(),True) \
  ])
 
books_df = spark.createDataFrame(data=books,schema=schema)
books_df.printSchema()
books_df.show(truncate=False)

In [None]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
genre = [("68978345","Building Resilient Data Pipelines","fiction"),
    ("15678345","Building Data Platforms","drama"),
    ("89898782","Scaling metadata","mystery"),
    ("32278345","Beyond the clean data curve","tragedy"),
    # ("31478888","Project lightspeed - the definitive guide","classics"),
    ("32278888","Hello Spark Fans","classics"),
    ("73825104","Fundamentals of Data Observability","adventure"),
    # ("73825103","High Performance Spark","classics"),
    ("73341143","Data Engineering with Apache Spark, Delta Lake, and Lakehouse","adventure"),
    ("54725104","Fundamentals of Data Observability","classics"),
    ("54725222","Designing Data-Intensive Applications","classics"),
    ("54725283","Data Management at Scale","classics"),
    # ("29829283","Database Internals","classics"), 
  
         
    ("25678345","Scaling Data Platforms","drama"),
    ("39898782","Project metadata","adventure"),
    ("42278345","Intro to Hive metastore","mystery"),
    ("52278888","Reviving zookeper","drama"),
    ("62278888","Life after Hadoop","crime"),
    ("83825104","Fundamentals of Lakehouse","adventure"),
    ("93825104","High Performance Yarn","fiction"),  
  ]

schema = StructType([ \
    StructField("isbn",StringType(),True), \
    StructField("name",StringType(),True), \
    StructField("genre",StringType(),True) \
  ])
 
genre_df = spark.createDataFrame(data=genre,schema=schema)
genre_df.printSchema()
genre_df.show(truncate=False)

### Write data to lakeFS

In [None]:
books_df.write.mode("overwrite").parquet(f"{main_repo_path}/books")
genre_df.write.mode("overwrite").parquet(f"{main_repo_path}/genres")

## List branches

In [None]:
lakefs.branches.list_branches(repo.id)

### Commit new files

In [None]:
lakefs.commits.commit(repository=repo_name,
                      branch='main',
                      commit_creation=CommitCreation(
                          message="Add books and genre data")
                     )

## Create new branch

In [None]:
experiment_branch='experiment-chaos'
lakefs.branches.create_branch(repository=repo.id, branch_creation=BranchCreation(name=experiment_branch, source='main'))

In [None]:
chaos_repo_path=f"s3a://{repo_name}/{experiment_branch}/"

## Diffing a single branch will show all uncommitted changes on that branch

_There are no uncommitted changes yet as all we've done is create the branch_

In [None]:
lakefs.branches.diff_branch(repository=repo.id, branch='experiment-chaos').results

## Load the data from the new branch

Whilst reading from a different path, the data is actually just the same as we wrote to the `main` branch above because that's where this branch was created from

In [None]:
books_df = spark.read.format("parquet").load(chaos_repo_path+"books")
genre_df = spark.read.format("parquet").load(chaos_repo_path+"genres")

### Inspect loaded data

In [None]:
books_df.show(10, truncate=False)

In [None]:
genre_df.show(10, truncate=False)

## Load data into tables

In [None]:
%%sql
DROP TABLE IF EXISTS books

In [None]:
books_df.write.saveAsTable("books")

In [None]:
%%sql
DROP TABLE IF EXISTS genre

In [None]:
genre_df.write.saveAsTable("genre")

## Join operation

In [None]:
data = genre_df.join( books_df, genre_df.isbn ==  books_df.isbn, "left" ).select(books_df.isbn, books_df.name, books_df.author, genre_df.genre)

### Save the materialized view

In [None]:
data.write.mode('overwrite').parquet(f"{chaos_repo_path}/books-dataset")

In [None]:
data.show(20,truncate=False)

## Run Quality checks on the experimentation brach

In [None]:
from pyspark.sql.functions import col,isnan, when, count
data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns]
   ).show()

In [None]:
data.count()

In [None]:
genre_df.count()

In [None]:
books_df.count()

In [None]:
genre_df.show()

## Join operation #2nd try

In [None]:
data_v2 = books_df.join( genre_df, genre_df.isbn ==  books_df.isbn, "left" ).select(books_df.isbn, books_df.name, books_df.author, genre_df.genre)

In [None]:
from pyspark.sql.functions import col,isnan, when, count
data_v2.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data_v2.columns]
   ).show()

In [None]:
genre_df.count()

In [None]:
books_df.count()

In [None]:
data.count()

## Fix missed data

In [None]:
data_v3 = data_v2.fillna("classics",subset=["genre"])
data_v3.write.mode('overwrite').parquet(chaos_repo_path+"books-dataset")

In [None]:
lakefs.branches.diff_branch(repository=repo_name, branch=experiment_branch).results

# Git like interface - Branching out

![](https://docs.lakefs.io/assets/img/branching_7.png)

## Cross collection consistency
We often need consistency between different data collections. A few examples may be:

* To join different collections in order to create a unified view of an account, a user or another entity we measure.
* To introduce the same data in different formats
* To introduce the same data with a different leading index or sorting due to performance considerations

![](https://docs.lakefs.io/assets/img/branching_8.png)

## More Questions?

**👉🏻 Join the lakeFS Slack group - https://lakefs.io/slack**