In [2]:
from pyiceberg.catalog import load_catalog
import pyarrow as pa
import dlt
import pandas as pd
import pyarrow.dataset as ds
import pyarrow.fs as fs
from dlt.sources.filesystem import filesystem

## **Connect to the Nessie catalog**

We reconnect to the **Nessie REST catalog** to access the Iceberg tables created in the previous notebook.

Listing the namespaces ensures that the connection is active and the catalog is available for querying.


In [None]:

# Configure the connection to the Nessie REST catalog
catalog = load_catalog(
    "nessie",
    **{
        "uri": "http://nessie:19120/iceberg/main/",
    }
)

# Verify the connection by listing the namespaces
namespaces = catalog.list_namespaces()
print("Namespaces:", namespaces)

Namespaces: [('taxis-project',)]


## **Extract Iceberg data and load it into Azure Storage**

This step performs the **third stage** of the data pipeline —  
**moving the Iceberg data stored in MinIO into Azure Blob Storage** using DLT.

**Process overview:**

1. **Configure the MinIO filesystem**  
   - Establish an S3-compatible connection to MinIO with credentials and endpoint details.

2. **Define the DLT resource (`iceberg_df`)**  
   - Load the `taxis-project.taxis` table from the Nessie catalog.  
   - Execute a scan to retrieve the data as an Arrow Table.  
   - Yield the data in Arrow batches to be consumed by DLT.

3. **Initialize the pipeline**  
   - Name: `s3_to_adls`  
   - Destination: `filesystem` (configured for Azure via secrets).  
   - Dataset name: `azure`.

4. **Run the pipeline**  
   - Writes the extracted data from Iceberg (MinIO) into Azure ADLS in **Parquet** format.  
   - Uses `write_disposition="replace"` to overwrite any existing dataset.



In [None]:
s3 = fs.S3FileSystem(
    endpoint_override="http://minio:9000",  # inside Docker: use "minio:9000", from local: "localhost:9000"
    access_key="admin",
    secret_key="password",
    region="us-east-1"
)

iceberg_table_path = "my-bucket/taxis-project/taxis"

@dlt.resource(table_name="taxis")
def iceberg_df():
    # Load the Iceberg table from the Nessie catalog
    taxis = catalog.load_table("taxis-project.taxis")
    
    # Execute the scan and get an Arrow Table
    arrow_table = taxis.scan().to_arrow()
    
    # Iterate over Arrow batches
    for batch in arrow_table.to_batches():
        yield batch

# Define the DLT pipeline
pipeline = dlt.pipeline(
    pipeline_name="s3_to_adls",
    destination="filesystem",
    dataset_name="azure"
)

# Define the source (filesystem connector)
source = filesystem()

# Run the pipeline
load_info = pipeline.run(
    iceberg_df,             # indicates: use the filesystem connector
    loader_file_format="parquet",
    write_disposition="replace"
)
print(load_info)

Pipeline s3_to_adls load step completed in 1 minute and 38.66 seconds
1 load package(s) were loaded to destination filesystem and into dataset azure
The filesystem destination used abfss://clase-4-dlt@fhbd.dfs.core.windows.net/GRUPO_4 location to store data
Load package 1757017205.9115121 is LOADED and contains no failed jobs


The pipeline **successfully transferred data** from MinIO to Azure Blob Storage.

**Key details:**
- Runtime: ~1 minute 38 seconds.  
- Destination: Azure ADLS (`abfss://clase-4-dlt@fhbd.dfs.core.windows.net/GRUPO_4`).  
- Dataset: `azure`.  
- No failed jobs were detected.

## **Verify the uploaded files in Azure**

After the DLT pipeline finishes, we connect directly to **Azure Data Lake Storage (ADLS)**  
using `fsspec` to verify that the Parquet files were successfully uploaded.

Steps:
1. Retrieve Azure credentials (account name and key) from the DLT secrets.  
2. Initialize an `abfss` filesystem connection.  
3. List all files under the `azure` dataset path to confirm the data transfer.

In [27]:
import fsspec
secrets = dlt.secrets["s3_to_adls"]["destination"]["filesystem"]

path = secrets["bucket_url"] + "/azure"
account_name = secrets["credentials"]["azure_storage_account_name"]
account_key = secrets["credentials"]["azure_storage_account_key"]

fs = fsspec.filesystem(
    "abfss",
    account_name=account_name,       # tu cuenta de storage
    credential=account_key
)

# Listar los archivos que subió DLT
files = fs.ls(path)
print("Archivos en Azure:", files)

Archivos en Azure: ['clase-4-dlt/GRUPO_4/azure/_dlt_loads', 'clase-4-dlt/GRUPO_4/azure/_dlt_pipeline_state', 'clase-4-dlt/GRUPO_4/azure/_dlt_version', 'clase-4-dlt/GRUPO_4/azure/filesystem', 'clase-4-dlt/GRUPO_4/azure/init', 'clase-4-dlt/GRUPO_4/azure/post_2020', 'clase-4-dlt/GRUPO_4/azure/taxis']


The output confirms that multiple DLT-related folders and datasets were created in Azure, including:
- `_dlt_loads`, `_dlt_pipeline_state`, `_dlt_version` → pipeline metadata.
- `taxis` → actual data folder.

This verifies the **successful ingestion of the Iceberg data into Azure Blob Storage (ADLS)**.