<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Import into a lakeFS repository from multiple paths

## Config

### lakeFS endpoint and credentials

Change these if using lakeFS other than provided in the samples repo. 

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Storage Information

If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored.

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

In [3]:
repo_name = "multi-bucket-import"

## Setup

### Configuring lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

In [5]:
print(f"lakeFS client version: {lakefs_client.__version__}")

lakeFS client version: 0.101.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository multi-bucket-import-repo does not exist, so going to try and create it now.
Created new repo multi-bucket-import-repo using storage namespace s3://example/multi-bucket-import-repo


## Import to a single repository from multiple paths

### Configure the source/target paths

In [7]:
sourceBranch = "main"

# Import Sources and Destinations
importSource1 = "s3://sample-data/nyc_film_permits.json" # e.g. "s3://sample-dog-images/Images/n02085620-Chihuahua/"
importSource2 = "s3://sample-data/movies.csv" # e.g. "s3://sample-dog-images/Annotation/n02085620-Chihuahua/"
importDestination1 = "dataset01/" # e.g. "Images/"
importDestination2 = "dataset02/" # e.g. "Annotations/"

### Create a set of imports

In [8]:
importPaths=[
    ImportLocation(type="common_prefix", path=importSource1, destination=importDestination1),
    ImportLocation(type="common_prefix", path=importSource2, destination=importDestination2)
]

### Do the import

In [10]:
import_instance = lakefs.import_api.import_start(
                       repo.id,
                       sourceBranch,
                       ImportCreation(paths=importPaths,
                                      commit=CommitCreation(message="import objects", 
                                                            metadata={"example_key": "example_value",
                                                                      "source_code": "jupyter notebook"})))

### Wait for import to finish

In [15]:
while True:
    status_resp = lakefs.import_api.import_status(repo.id, sourceBranch, import_instance.id)
    print(status_resp)
    if hasattr(status_resp, "Error in import"):
        raise Exception(status_resp.err)
    if status_resp.completed:
        print(f"\nImport completed Successfully 🎉\n\nData imported into branch: {status_resp.import_branch}")
        break
    time.sleep(2)

{'commit': {'committer': 'everything-bagel',
            'creation_date': 1685465437,
            'id': '7d061e7d3509dda4b94981bfba27ee9b42e5bc287acc9824300545b63a2ac72d',
            'message': 'import objects',
            'meta_range_id': '',
            'metadata': {'example_key': 'example_value',
                         'source_code': 'jupyter notebook'},
            'parents': ['70df95adc7dbc65eaaeb72bbb3bf6ba9fc57539df6f4f6c32becbad2588b1646']},
 'completed': True,
 'import_branch': '_main_imported',
 'ingested_objects': 2,
 'metarange_id': 'ff46b5743d0032782c9a1100f9bdcdb3b95a1f8823f40a60a3dac144660e2cbf',
 'update_time': datetime.datetime(2023, 5, 30, 16, 50, 38, 515743, tzinfo=tzlocal())}

Import completed Successfully 🎉

Data imported into branch: _main_imported


## Merge import branch into main

In [19]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=status_resp.import_branch, 
    destination_branch=sourceBranch)

{'reference': 'ff32b352a8cf4a85e1c38339118c959fad54b50f77a24fa46e51a8bf8a7133cf',
 'summary': {'added': 0, 'changed': 0, 'conflict': 0, 'removed': 0}}

## More Questions?

**👉🏻 Join the lakeFS Slack group - https://lakefs.io/slack**