<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Import into a lakeFS repository from multiple paths

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [2]:
storageNamespace = 's3://example/import/' # e.g. "s3://bucket"

In [3]:
repo_name = "multi-bucket-import"

## Setup

### Configuring lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

In [5]:
print(f"lakeFS client version: {lakefs_client.__version__}")

lakeFS client version: 0.102.2


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository multi-bucket-import does not exist, so going to try and create it now.
Created new repo multi-bucket-import using storage namespace s3://example/import//multi-bucket-import


## Import to a single repository from multiple paths

### Configure the source/target paths

In [7]:
sourceBranch = "main"

# Import Sources and Destinations
importSource1 = "s3://sample-data/stanfordogsdataset/Images" # e.g. "s3://sample-dog-images/Images/n02085620-Chihuahua/"
importSource2 = "s3://sample-data/stanfordogsdataset/Annotation" # e.g. "s3://sample-dog-images/Annotation/n02085620-Chihuahua/"
importDestination = "raw/" # will keep the 


### Do the import

In [8]:
import time

# Start Import
import_api = lakefs.__dict__["import"]
commit = CommitCreation(message="import objects", metadata={"key": "value"})
paths=[
    ImportLocation(type="common_prefix", path=importSource1, destination=importDestination1),
    ImportLocation(type="common_prefix", path=importSource2, destination=importDestination2)
]
import_creation = ImportCreation(paths=paths, commit=commit)
create_resp = import_api.import_start(repo.id, sourceBranch, import_creation)

# Wait for import to finish
while True:
    status_resp = import_api.import_status(repo.id, sourceBranch, create_resp.id)
    print(status_resp)
    if hasattr(status_resp, "Error in import"):
        raise Exception(status_resp.err)
    if status_resp.completed:
        print("Import completed Successfully. Data imported into branch:", sourceBranch)
        break
    time.sleep(2)

{'completed': False,
 'ingested_objects': 0,
 'update_time': datetime.datetime(2023, 6, 16, 15, 5, 39, 447398, tzinfo=tzlocal())}
{'commit': {'committer': 'everything-bagel',
            'creation_date': 1686927939,
            'id': '45c87b4bf61e9d83c56ea8e57da4e7b83884aa7ad108b0c7836f9c604bc59707',
            'message': 'import objects',
            'meta_range_id': '59f2bd443383773de2aed4b7a000a38332323bbc37fa504921639788f97f3978',
            'metadata': {'.lakefs.merge.strategy': 'source-wins',
                         'key': 'value'},
            'parents': ['1a485f5b85d57a7d0e5cba8ab0603c9306786d9e9fdd5790e5a7d0330843e714']},
 'completed': True,
 'ingested_objects': 1518,
 'metarange_id': '59f2bd443383773de2aed4b7a000a38332323bbc37fa504921639788f97f3978',
 'update_time': datetime.datetime(2023, 6, 16, 15, 5, 40, 466746, tzinfo=tzlocal())}
Import completed Successfully. Data imported into branch: main


In [9]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 View the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/multi-bucket-import/objects)")

### 👉🏻 View the objects in [lakeFS web UI](http://localhost:8000/repositories/multi-bucket-import/objects)

## More Questions?

**👉🏻 Join the lakeFS Slack group - https://lakefs.io/slack**