<img src="../images/logo.svg" alt="lakeFS logo" width=300/> 

# ML Experimentation 01 (Dogs)

In this tutorial, you will learn how to version your ML training data, model artifacts, metrics and  your training code together with lakeFS.

We will be using [Stanford-Dogs-Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) (aka ImageNetDogs) for the image classification 

_🚧 This notebook may have existing environment or data requirements; it's included here so that you can see the contents and be inspired by it—but it may not run properly.🚧_

----

## Before you start!

Download the [Stanford-Dogs-Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) to your MinIO bucket. We will be importing this data into a lakeFS repository and use it for ML model training.

1. Create a new MinIO bucket for lakeFS repository
2. Create a lakeFS repository (ml-demo)
3. Import dataset into the lakeFS repo (branch `_main_imported`)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "ml-demo"

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_lakefs.import *
from lakefs_lakefs.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v.version}")

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

### Install libraries

In [None]:
! pip install opencv-python tensorflow nbimporter s3fs

### Imports

In [None]:
import os
import json
import boto3
import s3fs
import joblib
import tempfile
from io import BytesIO
import nbimporter
import pprint

In [None]:
from ml_reproducibility.ml_utils import *
from ml_reproducibility.file_utils import *


In [None]:
from datetime import date, time, datetime

import cv2
import numpy as np
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from tensorflow import keras
from keras.import Sequential
from keras.layers import Conv2D,MaxPooling2D,Dense,Flatten,Dropout
from tensorflow.keras.layers import BatchNormalization

In [None]:
from pyspark.sql.types import StructType,StructField, StringType

### Configure boto3 client

In [None]:
s3_client = boto3.client('s3',
    endpoint_url='http://lakefs:8000',
    aws_access_key_id=lakefsAccessKey,
    aws_secret_access_key=lakefsSecretKey)

s3_resource = boto3.resource('s3',
    endpoint_url='http://lakefs:8000',
    aws_access_key_id=lakefsAccessKey,
    aws_secret_access_key=lakefsSecretKey)

In [None]:
s3 = s3fs.S3FileSystem(anon=False,
                      key=lakefsAccessKey,
                      secret=lakefsSecretKey,
                      client_kwargs={'endpoint_url': lakefsEndPoint})


---

# Main Tutorial starts here 🚦 👇🏻

## Experiment Configs

In [None]:
ingest_branch = "_main_imported"
exp1_branch = "experiment-1"
exp2_branch = "experiment-2"

prod_branch = "main"

In [None]:
file_path = f"s3a://{repo_name}"

images_path = "dogs_dataset_/images/Images"
annotations = "dogs_dataset_/annotations/Annotations"

raw_path = "raw"
processed_path = "processed"
config_path = "config"
artifact_path = "artifacts"
metrics_path = "metrics"
training_code_path = "src"


# Experimentation Begins

## Experiment #1

In [None]:
params_exp1 ={
    'repo_name': repo_name,
    'branch': exp1_branch,
    'image_path': f"{exp1_branch}/{raw_path}/{images_path}",
    'artifacts_path': f"{exp1_branch}/{artifact_path}",
    'metrics_path': f"{exp1_branch}/{metrics_path}",
    'config_path': f"{exp1_branch}/{config_path}",
    'model_name': "model.pkl",
    'delimiter': "/",
    'n_cats': 3,
    'n_images': 100,
    'is_shuffle':True,
    'is_normalize': False,
    'epochs': 200,
    'train_test_split_ratio': 0.2,
    'optimizer': "adam",
    'loss': "sparse_categorical_crossentropy",
    'metrics': ["accuracy"]
}
params = params_exp1

### Set up lakeFS for experiment #1

#### Create a new branch: `experiment-1` from `_main_exported`

In [None]:
lakefs.branches.list_branches(repo_name)

In [None]:
lakefs.branches.create_branch(repository=repo_name, 
                              branch_creation=BranchCreation(name=exp1_branch, 
                                                                    source=ingest_branch)
                             )
lakefs.branches.list_branches(repo_name)

In [None]:
with open('config.json', 'w') as fp:
    json.dump(params, fp)
    
with open(f'./config.json', 'rb') as f:
    lakefs.objects.upload_object(repository=repo_name, 
                                 branch=exp1_branch, 
                                 path=f"{config_path}/config.json", 
                                 content=f
                                )

#### Load training data from lakeFS. 
#### Generate images and labels for training and Commit.

In [None]:
images, labels = load_training_data(params)
print("Loading training data")

In [None]:
#TODO: Commit the training data after preprocessing under /processed

#### Train the model. 
#### Upload model metrics to lakeFS and commit. 

In [None]:
model1, metrics1 = ml_pipeline(params, images, labels)

In [None]:
save_metrics(metrics1, repo_name, params['metrics_path'])

In [None]:
params['loss'], params['accuracy'] = load_metrics(repo_name, params['metrics_path'])
pprint.pprint(params)

In [None]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=exp1_branch).results

commit_meta_params = {}
for k,v in params.items():
    commit_meta_params[k]=str(v)

lakefs.commits.commit(repository=repo_name,
                      branch=exp1_branch,
                      commit_creation=CommitCreation(
                          message=f"Saving model metrics to {exp1_branch}",
                          metadata=commit_meta_params)
                     )

#### Upload model artifacts to lakeFS and commit. 

In [None]:
model_save(model1, 
           params['model_name'], 
           params['repo_name'], 
           params['artifacts_path'])


In [None]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=exp1_branch).results

commit_meta_params = {}
for k,v in params.items():
    commit_meta_params[k]=str(v)
print(commit_meta_params)

lakefs.commits.commit(repository=repo_name,
                      branch=exp1_branch,
                      commit_creation=CommitCreation(
                          message=f"Saving model artifacts to {exp1_branch}",
                          metadata=commit_meta_params)
                     )

#### Load the pickle file from lakeFS, and run predictions.

In [None]:
model1_reloaded = model_load(params['model_name'], 
           params['repo_name'], 
           params['artifacts_path'])

In [None]:
x_train, x_test, y_train, y_test = split_train_test(images, labels, params['train_test_split_ratio'])
pred = model1_reloaded.predict(x_test)

pred.shape

In [None]:
plt.figure(1 , figsize = (19 , 10))
n = 0 

for i in range(9):
    n += 1 
    r = np.random.randint( 0, x_test.shape[0], 1)
    
    plt.subplot(3, 3, n)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(x_test[r[0]])
    plt.title('Actual = {}, Predicted = {}'.format(y_test[r[0]] , y_test[r[0]]*pred[r[0]][y_test[r[0]]]) )
    plt.xticks([]) , plt.yticks([])

plt.show()

## Experiment #2

In [None]:
params_exp2 ={
    'repo_name': repo_name,
    'branch': exp2_branch,
    'image_path': f"{exp2_branch}/{raw_path}/{images_path}",
    'artifacts_path': f"{exp2_branch}/{artifact_path}",
    'metrics_path': f"{exp2_branch}/{metrics_path}",
    'config_path': f"{exp2_branch}/{config_path}",
    'model_name': "model.pkl",
    'delimiter': "/",
    'n_cats': 3,
    'n_images': 50,
    'is_shuffle': True,
    'is_normalize': True,
    'epochs': 10,
    'train_test_split_ratio': 0.15,
    'optimizer': "adagrad",
    'loss': "sparse_categorical_crossentropy",
    'metrics': ["accuracy"]
}
params = params_exp2

### Set up lakeFS for experiment #2

1. Create a new branch: `experiment-2` from `_main_exported`

In [None]:
lakefs.branches.list_branches(repo_name)

lakefs.branches.create_branch(repository=repo_name, 
                              branch_creation=BranchCreation(name=exp2_branch, 
                                                                    source=ingest_branch)
                             )

lakefs.branches.list_branches(repo_name)

In [None]:
with open('config.json', 'w') as fp:
    json.dump(params, fp)
    
with open(f'./config.json', 'rb') as f:
    lakefs.objects.upload_object(repository=repo_name, 
                                 branch=exp2_branch, 
                                 path=f"{config_path}/config.json", 
                                 content=f
                                )

#### Load training data from lakeFS. 
#### Generate images and labels for training and Commit.

In [None]:
images, labels = load_training_data(params)

In [None]:
# TODO: Commit training data

#### Train the model. 
#### Upload model metrics to lakeFS and commit.

In [None]:
model2, metrics2 = ml_pipeline(params, images, labels)

In [None]:
save_metrics(metrics2, repo_name, params['metrics_path'])

In [None]:
params['loss'], params['accuracy'] = load_metrics(repo_name, params['metrics_path'])
pprint.pprint(params)

In [None]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=exp2_branch).results

commit_meta_params = {}
for k,v in params.items():
    commit_meta_params[k]=str(v)
pprint.pprint(commit_meta_params)

lakefs.commits.commit(repository=repo_name,
                      branch=exp2_branch,
                      commit_creation=CommitCreation(
                          message=f"Saving model metrics to {exp2_branch}",
                          metadata=commit_meta_params)
                     )

#### Upload model artifacts to lakeFS and commit.

In [None]:
model_save(model2, 
           params['model_name'], 
           params['repo_name'], 
           params['artifacts_path'])

In [None]:
lakefs.branches.diff_branch(repository=repo_name, 
                            branch=exp2_branch).results

commit_meta_params = {}
for k,v in params.items():
    commit_meta_params[k]=str(v)
pprint.pprint(commit_meta_params)

lakefs.commits.commit(repository=repo_name,
                      branch=exp2_branch,
                      commit_creation=CommitCreation(
                          message=f"Saving model artifacts to {exp2_branch}",
                          metadata=commit_meta_params)
                     )

#### Load the pickle file from lakeFS, and run predictions.

In [None]:
model2_reloaded = model_load(params['model_name'], 
           params['repo_name'], 
           params['artifacts_path'])

In [None]:
x_train, x_test, y_train, y_test = split_train_test(images, labels, params['train_test_split_ratio'])
pred = model2_reloaded.predict(x_test)

pred.shape

In [None]:
plt.figure(1 , figsize = (19 , 10))
n = 0 

for i in range(9):
    n += 1 
    r = np.random.randint( 0, x_test.shape[0], 1)
    
    plt.subplot(3, 3, n)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(x_test[r[0]])
    plt.title('Actual = {}, Predicted = {}'.format(y_test[r[0]] , y_test[r[0]]*pred[r[0]][y_test[r[0]]]) )
    plt.xticks([]) , plt.yticks([])

plt.show()

### Compare in both branches. Merge the winning model to Prod.

In [None]:
win_branch = exp2_branch
if metrics1['accuracy']> metrics2['accuracy']:
    win_branch = exp1_branch

In [None]:
win_branch

In [None]:
lakefs.refs.merge_into_branch(repository=repo_name, 
                              source_ref=win_branch, 
                              destination_branch=prod_branch)

## Reproducing ML experiments with lakeFS tags

In [None]:
tag_branch = exp1_branch
tag = f'{datetime.now().strftime("%Y_%m_%d_%H_%M_%S")}_{tag_branch}'
tag

In [None]:
lakefs.tags.create_tag(
    repository=repo_name,
    tag_creation=TagCreation(
        id=tag, 
        ref=tag_branch))

In [None]:
params_tag ={
    'repo_name': repo_name,
    'branch': tag,
    'image_path': f"{tag}/{raw_path}/{images_path}",
    'artifacts_path': f"{tag}/{artifact_path}",
    'metrics_path': f"{tag}/{metrics_path}",
    'model_name': "model.pkl",
    'delimiter': "/",
    'n_cats': 3,
    'n_images': 50,
    'is_shuffle': True,
    'is_normalize': True,
    'epochs': 10,
    'train_test_split_ratio': 0.15,
    'optimizer': "adagrad",
    'loss': "sparse_categorical_crossentropy",
    'metrics': ["accuracy"]
}
pprint.pprint(params_tag)

In [None]:
images, labels = load_training_data(params_tag)

In [None]:
images, labels = preprocess(images, labels, params['is_shuffle'], params['is_normalize'])

In [None]:
tag_model_reloaded = model_load(params_tag['model_name'], 
           params_tag['repo_name'], 
           params_tag['artifacts_path'])

In [None]:
pred = tag_model_reloaded.predict(images)

pred.shape

In [None]:
plt.figure(1 , figsize = (19 , 10))
n = 0 

for i in range(9):
    n += 1 
    r = np.random.randint( 0, x_test.shape[0], 1)
    
    plt.subplot(3, 3, n)
    plt.subplots_adjust(hspace = 0.3, wspace = 0.3)
    
    plt.imshow(x_test[r[0]])
    plt.title('Actual = {}, Predicted = {}'.format(y_test[r[0]] , y_test[r[0]]*pred[r[0]][y_test[r[0]]]) )
    plt.xticks([]) , plt.yticks([])

plt.show()

## DONE!!