# Isolated Reproducible Unstructured Datasets for ML

### Prerequisites

####### This Notebook requires connecting to a lakeFS Server. 
####### To spin up lakeFS quickly - use the [lakeFS Cloud](https://lakefs.cloud) which provides lakeFS server on-demand with a single click; 
####### Or, alternatively, refer to [lakeFS Quickstart doc](https://docs.lakefs.io/quickstart/installing.html).

## Setup Task: Download Images and Annotations datasets used for this demo and upload to a S3 bucket: [http://vision.stanford.edu/aditya86/ImageNetDogs/](http://vision.stanford.edu/aditya86/ImageNetDogs/)
#### Change sample-dog-images-bucket-name

In [0]:
bucketURLforImages = 's3://sample-dog-images-bucket-name'

## Setup Task: Download [changed Images and Annotations datasets](https://github.com/treeverse/lakeFS-samples/tree/main/01_standalone_examples/aws-databricks/data/stanforddogsdataset/changed) and upload to a different S3 bucket.
#### Change storage-account-name and sample-dog-images-changed-container-name

In [0]:
bucketURLforChangedImages = 's3://sample-dog-images-changed-bucket-name'

## Setup Task: Change your lakeFS credentials

In [0]:
lakefsEndPoint = 'https://YourEndPoint/' # e.g. 'https://username.azure_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecretKey'

## Setup Task: You can change lakeFS repo name

In [0]:
repo = "images-repo"

## Setup Task: Storage Information
#### Change the Storage Namespace to a location you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [0]:
import random
storageNamespace = 's3://lakefs-repository-bucket-name/'+repo+'/'+str(random.randint(1,100000000))

## Define variables

In [0]:
mainBranch = "main"
emptyBranch = "empty"
AnnotationsFolderName = "Annotations"
ImagesFolderName = "Images"

AfghanHoundSourcePath = "n02088094-Afghan_hound"
AfghanHoundFileName = "n02088094-Afghan_hound/n02088094_115.jpg"
WalkerHoundSourcePath = "n02089867-Walker_hound"
WalkerHoundFileName = "n02089867-Walker_hound/n02089867_24.jpg"

## Setup Task: Run additional [Setup](./?o=8376305627582670#notebook/3797698953092686) tasks here

In [0]:
%run ./unstructuredDataMLDemoSetup

## Setup Task: Import Images and Annotations datasets to lakeFS repository

In [0]:
from lakefs_client import models
paths=[
    models.ImportLocation(type="common_prefix", path=bucketURLforImages, destination=""),
]
commitMessage='Imported all annotations and images'
commitMetadata={'version': '1.0'}
import_objects(repo, mainBranch, paths, commitMessage, commitMetadata)

# Project Starts

## Project label and version information

In [0]:
classLabel = "_hound"
version = "v1"

## Create empty Project v1 branch

In [0]:
projectBranchV1 = "project"+classLabel+"_"+version

client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=projectBranchV1,
        source=emptyBranch))

## Get list of all Annotation folders

In [0]:
AnnotationsFolders = client.objects.list_objects(
    repository=repo,
    ref=mainBranch,
    prefix=AnnotationsFolderName+'/',
    delimiter='/')

## Import all annotation and images for a particular class label

In [0]:
paths=[]

for AnnotationsFolder in AnnotationsFolders.results:
    
    # If folder name ends with classLabel
    if AnnotationsFolder['path'].endswith(classLabel+'/'):
        print("Importing annotation and images in folder: " + AnnotationsFolder['path'])
                                         
        paths.append(models.ImportLocation(type="common_prefix", path=bucketURLforImages+'/'+AnnotationsFolder['path'], destination=AnnotationsFolder['path']))
        paths.append(models.ImportLocation(type="common_prefix", path=bucketURLforImages+'/'+AnnotationsFolder['path'].replace(AnnotationsFolderName, ImagesFolderName),
                                           destination=AnnotationsFolder['path'].replace(AnnotationsFolderName, ImagesFolderName)))

commitMessage='Imported annotation and images for class label ending with '+classLabel
commitMetadata={'classLabel': classLabel,'version': version}
import_objects(repo, projectBranchV1, paths, commitMessage, commitMetadata)

## Some of images changed

## Changed images

<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_26.jpg" width=150/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_60.jpg" width=330/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_93.jpg" width=310/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02088094-Afghan_hound/n02088094_115.jpg" width=310/>

## Upload changed annotations and images

In [0]:
paths=[
    models.ImportLocation(type="common_prefix", path=bucketURLforChangedImages+'/'+AnnotationsFolderName+'/'+AfghanHoundSourcePath, destination=AnnotationsFolderName),
    models.ImportLocation(type="common_prefix", path=bucketURLforChangedImages+'/'+ImagesFolderName+'/'+AfghanHoundSourcePath, destination=ImagesFolderName),
]
commitMessage='Uploaded changed annotation and images for class label ending with '+classLabel+' and version '+version
commitMetadata={'classLabel': classLabel, 'version': version}
import_objects(repo, projectBranchV1, paths, commitMessage, commitMetadata)

## Get stats for image on main branch

In [0]:
client.objects.stat_object(
    repository=repo,
    ref=mainBranch,
    path=ImagesFolderName+'/'+AfghanHoundFileName)

## Get stats for image on project branch

In [0]:
client.objects.stat_object(
    repository=repo,
    ref=projectBranchV1,
    path=ImagesFolderName+'/'+AfghanHoundFileName)

## Add v1 tag for future use. You can also run your model by using this tag.

In [0]:
import datetime
tagV1 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV1}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV1, 
        ref=projectBranchV1))

## Create Project v2 branch sourced from v1 branch

In [0]:
version = "v2"

In [0]:
projectBranchV2 = "project"+classLabel+"_"+version

client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=projectBranchV2,
        source=projectBranchV1))

## Some of images changed

## Changed images

<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_24.jpg" width=150/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_31.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_42.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_55.jpg" width=295/>
<img src="https://raw.githubusercontent.com/treeverse/lakeFS-samples/main/01_standalone_examples/azure-databricks/data/stanforddogsdataset/changed/Images/n02089867-Walker_hound/n02089867_90.jpg" width=295/>

## Upload changed annotations and images

In [0]:
paths=[
    models.ImportLocation(type="common_prefix", path=bucketURLforChangedImages+'/'+AnnotationsFolderName+'/'+WalkerHoundSourcePath, destination=AnnotationsFolderName),
    models.ImportLocation(type="common_prefix", path=bucketURLforChangedImages+'/'+ImagesFolderName+'/'+WalkerHoundSourcePath, destination=ImagesFolderName),
]
commitMessage='Uploaded changed annotation and images for class label ending with '+classLabel+' and version '+version
commitMetadata={'classLabel': classLabel, 'version': version}
import_objects(repo, projectBranchV2, paths, commitMessage, commitMetadata)

## Review commit log

In [0]:
results = map(
    lambda n:[n.message,n.metadata,n.id],
    client.refs.log_commits(
        repository=repo,
        ref=projectBranchV2).results)

from tabulate import tabulate
print(tabulate(
    results,
    headers=['Message','Metadata','Commit Id']))

## Add v2 tag for future use. You can also run your model by using this tag.

In [0]:
tagV2 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV2}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV2, 
        ref=projectBranchV2))

## Get image stats using v1 tag

In [0]:
client.objects.stat_object(
    repository=repo,
    ref=tagV1,
    path=ImagesFolderName+'/'+WalkerHoundFileName)

## Get image stats using v2 tag

In [0]:
client.objects.stat_object(
    repository=repo,
    ref=tagV2,
    path=ImagesFolderName+'/'+WalkerHoundFileName)

## Diff between v1 and v2 project branch

In [0]:
results = map(
    lambda n:[n.path,n.path_type,n.size_bytes,n.type],
    client.refs.diff_refs(
        repository=repo,
        left_ref=projectBranchV1,
        right_ref=projectBranchV2).results)

from tabulate import tabulate
print(tabulate(
    results,
    headers=['Path','Path Type','Size(Bytes)','Type']))

## If you made mistakes then you can atomically rollback all changes

### Rollback changes in v2 branch by using v2 tag

In [0]:
client.branches.revert_branch(
    repository=repo,
    branch=projectBranchV2, 
    revert_creation=models.RevertCreation(
        ref=tagV2, parent_number=1))

## Diff between v1 and v2 project branch

In [0]:
results = map(
    lambda n:[n.path,n.path_type,n.size_bytes,n.type],
    client.refs.diff_refs(
        repository=repo,
        left_ref=projectBranchV1,
        right_ref=projectBranchV2).results)

from tabulate import tabulate
print(tabulate(
    results,
    headers=['Path','Path Type','Size(Bytes)','Type']))

# Project Completes

## More Questions?

###### Join the [lakeFS Slack group](https://lakefs.io/slack)