# Integration of lakeFS with Labelbox

**Use Case**: ML Reproducibility

## Before you start

* ☑️ Sign up for a Labelbox account at https://app.labelbox.com/signup and Create a Labelbox API key 
* You will need an S3 bucket that both lakeFS and Labelbox can access. The provided MinIO storage won't work for this as its not accessible by Labelbox.

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### Labelbox API key

In [None]:
LB_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjbGllbmdmaHMwMW40MDczc2NuYWcyZDI2Iiwib3JnYW5pemF0aW9uSWQiOiJjbGllbmdmaGUwMW4zMDczc2NiY244bXJ4IiwiYXBpS2V5SWQiOiJjbGllbmo4ZTgwMHlyMDcxOWV0NXk2eGdzIiwic2VjcmV0IjoiYWNmM2Y3MDYzZTFmMjk3NjMxNWQ3NTZlNDJlYjc5MmEiLCJpYXQiOjE2ODU3MTU1ODMsImV4cCI6MjMxNjg2NzU4M30.xD1tVnDvjVv3wy0u1VP5u-IzxRBZhx2ljryzboYEGac"

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

### S3 Storage Information for storing images

Provide a S3 bucket, AWS region and access key information. This demo will upload few images to this S3 bucket and both lakeFS and Labelbox will access those images to create the datasets.

Enable AWS S3 integration and access to this S3 bucket by Labelbox by following instructions in the Labelbox docs: https://docs.labelbox.com/docs/import-aws-s3-data

lakeFS should also be able to read from this S3 bucket.

In [None]:
bucketName = '<S3 Bucket Name>' # e.g. labelbox-geospatial-vessel-detection
awsRegion = '<AWS region name>' # e.g. us-east-1
aws_access_key_id = 'aaaaaaaaaaaaa'
aws_secret_access_key = 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'
contentTypeJPG = "image/jpeg"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "labelbox"

In [None]:
mainBranch = "main"
emptyBranch = "empty"

### Labelbox dataset name

In [None]:
lbDataSetName = 'lakeFS Geospatial Vessel Detection'

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v.version}")

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here 🚦 👇🏻

## Setup Task: Run additional [Setup](./Labelbox/LabelBoxSetup.ipynb) tasks here

In [None]:
%run ./Labelbox/LabelBoxSetup.ipynb

# Next few steps are same as used in Labelbox tutorial:
#### https://docs.labelbox.com/reference/import-a-labeled-dataset-images-1

## Upload images to S3 bucket
#### These images are same as used in Labelbox tutorial: https://docs.labelbox.com/reference/import-a-labeled-dataset-images-1
#### This process may take few minutes

In [None]:
for path, subdirs, files in os.walk(os.path.expanduser('~')+'/Images/Labelbox/AllImages/'):
    for file in files:
        if file.endswith(".jpg"):
            folder = path.rsplit("/")[-1]
            s3.upload_file(Filename=path+'/'+file, Bucket=bucketName, Key=folder+'/'+file)

## Read annotations file
#### This annotations file is same as used in Labelbox tutorial: https://docs.labelbox.com/reference/import-a-labeled-dataset-images-1

In [None]:
with open(os.path.expanduser('~')+'/Images/Labelbox/geospatial_annotations.json', 'r') as fp:
    annotations = json.load(fp)

## Create Labelbox dataset

In [None]:
dataset = lb_client.create_dataset(name=lbDataSetName)
data_rows = []

for path, subdirs, files in os.walk(os.path.expanduser('~')+'/Images/Labelbox/AllImages/'):
    for file in files:
        if file.endswith(".jpg"):
            folder = path.rsplit("/")[-1]
            data_row_dict = {'row_data': "https://"+ bucketName + ".s3." + awsRegion + ".amazonaws.com/" + folder + '/' + file,
                "global_key": "https://" + bucketName + ".s3." + awsRegion + ".amazonaws.com/" + folder + '/' + file  + str(uuid4()),
                "external_id": folder + "/" + file,
                'media_type': 'IMAGE',
                "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
                "attachments": [{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" }],
            }
            data_rows.append(data_row_dict)

task = dataset.create_data_rows(data_rows)
task.wait_till_done()
print(task.errors)

## Setup a labeling project in Labelbox

In [None]:
ontology = OntologyBuilder()

for tool in annotations['categories']:
  print(tool['name'])
  ontology.add_tool(Tool(tool = Tool.Type.BBOX, name = tool['name']))

ontology = lb_client.create_ontology(lbDataSetName + " ontology", ontology.asdict())
project = lb_client.create_project(name = lbDataSetName)
project.setup_editor(ontology)
ontology_from_project = OntologyBuilder.from_project(project)

## Prepare and queue batch of Data Rows to the Labelbox project

In [None]:
data_rows = [dr.uid for dr in list(dataset.export_data_rows())]

# Randomly select 200 Data Rows
sampled_data_rows = random.sample(data_rows, 200)

batch = project.create_batch(
  "Initial batch", # name of the batch
  sampled_data_rows, # list of Data Rows
  1 # priority between 1-5
)

## Process ground truth annotations for import

In [None]:
queued_data_rows = project.export_queued_data_rows()
ground_truth_list = LabelList()

for datarow in queued_data_rows:
  annotations_list = []
  folder = datarow['externalId'].split("/")[0]
  id = datarow['externalId'].split("/")[1]
  if folder == "positive_image_set":
    for image in annotations['images']:
      if (image['file_name']==id):
        for annotation in annotations['annotations']:
          if annotation['image_id'] == image['id']:
            bbox = annotation['bbox']
            id = annotation['category_id'] - 1
            class_name = ontology_from_project.tools[id].name
            annotations_list.append(ObjectAnnotation(
                name = class_name,
                value = Rectangle(start = Point(x = bbox[0], y = bbox[1]), end = Point(x = bbox[2]+bbox[0], y = bbox[3]+bbox[1])),
            ))
  image = ImageData(uid = datarow['id'])
  ground_truth_list.append(Label(data = image, annotations = annotations_list))

## Import ground truth annotation

In [None]:
ground_truth_list.assign_feature_schema_ids(OntologyBuilder.from_project(project))
ground_truth_ndjson = list(NDJsonConverter.serialize(ground_truth_list))

start_time = time.time()
## Upload annotations
upload_task = LabelImport.create_from_objects(lb_client, project.uid, "geospatial-import-job-1", ground_truth_ndjson)
print(upload_task)

#Wait for upload to finish (Will take up to five minutes)
upload_task.wait_until_done()
print(upload_task.errors)
print("--- Finished in %s mins ---" % ((time.time() - start_time)/60))

## Create a new repo in lakeFS

In [None]:
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=mainBranch))

## Create an empty branch in lakeFS

In [None]:
client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=emptyBranch,
        source=mainBranch))

# Project Starts

## Labelbox Slice
#### Create a Slice in Labelbox for a particular Annotation e.g. bridge and save the slice

![Bridge Slice](./Images/Labelbox/BridgeSlice1.png)

## Copy the Slice ID

![Bridge Slice](./Images/Labelbox/BridgeSlice2.png)

## Paste Labelbox Slice ID

In [None]:
catalog_slice_id = "<Labelbox slice id>"

## Class label and version information
#### Class label can be Annotation name e.g. vehicle or any other label you want to use in lakeFS

In [None]:
classLabel = "bridge"
version = "v1"

## Create empty Project v1 branch

In [None]:
projectBranchV1 = "project_"+classLabel+"_"+version

client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=projectBranchV1,
        source=emptyBranch))

## Read Labelbox slice and stage/import those images to Project v1 branch in lakeFS repo

In [None]:
catalog_slice = lb_client.get_catalog_slice(catalog_slice_id) #-> CatalogSlice

# Get data row ids in a slice
slice_data_rows_ids = catalog_slice.get_data_row_ids()
for data_row_id in slice_data_rows_ids:
    datarow = lb_client.get_data_row(data_row_id)
    filename = datarow.external_id #.split("/")[1]
    filesize = datarow.media_attributes["contentLength"]

    # Stage image file
    stage_objects(repo, projectBranchV1, "s3://"+bucketName+"/"+filename, filename, filesize, contentTypeJPG)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=projectBranchV1,
    commit_creation=models.CommitCreation(
        message='Uploaded images for class label '+classLabel+' and version '+version,
        metadata={'classLabel': classLabel,
            'version': version}))

## Add v1 tag for future use. You can also run your model by using this tag.

In [None]:
tagV1 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV1}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV1, 
        ref=projectBranchV1))

## Read images using v1 tag

In [None]:
dataPath = f"s3a://{repo}/{tagV1}/positive_image_set/"

df= spark.read.format("image").load(dataPath)
df.select("image.origin", "image.width", "image.height").show(truncate=False)

## Create Project v2 branch sourced from v1 branch

In [None]:
version = "v2"

In [None]:
projectBranchV2 = "project_"+classLabel+"_"+version

client.branches.create_branch(
    repository=repo,
    branch_creation=models.BranchCreation(
        name=projectBranchV2,
        source=projectBranchV1))

## Upload changed and new images

In [None]:
directory = os.path.expanduser('~')+'/Images/Labelbox/ChangedImages/positive_image_set/'
files = Path(directory).glob('*.jpg')
path = 'positive_image_set'

upload_files(repo, projectBranchV2, path, files)

## Commit changes and attach some metadata

In [None]:
client.commits.commit(
    repository=repo,
    branch=projectBranchV2,
    commit_creation=models.CommitCreation(
        message='Uploaded changed images for class label '+classLabel+' and version '+version,
        metadata={'classLabel': classLabel,
            'version': version}))

## Review commit log

In [None]:
results = map(
    lambda n:[n.message,n.metadata,n.id],
    client.refs.log_commits(
        repository=repo,
        ref=projectBranchV2).results)

print(tabulate(
    results,
    headers=['Message','Metadata','Commit Id']))

## Add v2 tag for future use. You can also run your model by using this tag.

In [None]:
tagV2 = datetime.datetime.now().strftime("%Y_%m_%d")+f"_{projectBranchV2}"

client.tags.create_tag(
    repository=repo,
    tag_creation=models.TagCreation(
        id=tagV2, 
        ref=projectBranchV2))

## Read images using v2 tag

In [None]:
dataPath = f"s3a://{repo}/{tagV2}/positive_image_set/"

df= spark.read.format("image").load(dataPath)
df.select("image.origin", "image.width", "image.height").show(truncate=False)

## Diff between v1 and v2 project branch

In [None]:
results = map(
    lambda n:[n.path,n.path_type,n.size_bytes,n.type],
    client.refs.diff_refs(
        repository=repo,
        left_ref=projectBranchV1,
        right_ref=projectBranchV2).results)

print(tabulate(
    results,
    headers=['Path','Path Type','Size(Bytes)','Type']))

## If you made mistakes then you can atomically rollback all changes in v2 branch

### Rollback changes in v2 branch by using v2 tag

In [None]:
client.branches.revert_branch(
    repository=repo,
    branch=projectBranchV2, 
    revert_creation=models.RevertCreation(
        ref=tagV2, parent_number=1))

## Diff between v1 and v2 project branch
#### There will be no difference now as you rolled back the changes in the previous step

In [None]:
results = map(
    lambda n:[n.path,n.path_type,n.size_bytes,n.type],
    client.refs.diff_refs(
        repository=repo,
        left_ref=projectBranchV1,
        right_ref=projectBranchV2).results)

print(tabulate(
    results,
    headers=['Path','Path Type','Size(Bytes)','Type']))

# Project Completes

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack