<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Import into a lakeFS repository from multiple paths

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

The storageNamespace in lakeFS needs to be unique per repository. 

The value given here will be combined with the repo name to create the storage namespace used. 

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "multi-bucket-import"

### Import libraries

In [None]:
import os
import lakefs

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
sourceBranch = "main"
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=sourceBranch, exist_ok=True)
branchMain = repo.branch(sourceBranch)
print(repo)

## Import to a single repository from multiple paths

### Configure the source/target paths

In [None]:
# Import Sources and Destinations
importSource1 = "s3://sample-data/stanfordogsdataset/Images" # e.g. "s3://sample-dog-images/Images/n02085620-Chihuahua/"
importSource2 = "s3://sample-data/stanfordogsdataset/Annotation" # e.g. "s3://sample-dog-images/Annotation/n02085620-Chihuahua/"
importDestination = "raw/" # will keep the original files in the raw directory

### Do the import

In [None]:
import time

importer = branchMain.import_data(commit_message="import objects", metadata={"key": "value"}) \
    .prefix(importSource1, destination=importDestination) \
    .prefix(importSource2, destination=importDestination)

importer.start()
status = importer.status()
print(status)

while not status.completed and status.error is None:
    time.sleep(2)
    status = importer.status()
    print(status)

if status.error:
    raise Exception(status.error)
    
print(f"\nImported a total of {status.ingested_objects} objects into branch {sourceBranch}")

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 View the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo_name}/objects)")

## More Questions?

**👉🏻 Join the lakeFS Slack group - https://lakefs.io/slack**