In [None]:
!pip install "dlt[filesystem]==1.15.0" s3fs adlfs

In [None]:
!pip install pyarrow pandas

In [7]:
import dlt
import pyarrow.parquet as pq
import fsspec

## **Load a remote Parquet file into MinIO using DLT**

In this first script, we demonstrate how to **extract a Parquet file from a public URL**,  
and **load it into MinIO (S3-compatible storage)** using a DLT pipeline.

**Process overview:**
1. **Define the resource:**  
   - The function `my_df()` opens the Parquet file directly from a remote URL using `fsspec`.  
   - The file is read with `pyarrow` and converted into a Pandas DataFrame, which is then yielded to DLT for ingestion.

2. **Configure the pipeline:**  
   - The pipeline `parquet_to_minio` uses the `filesystem` destination.  
   - Authentication and configuration details are managed automatically from the `.dlt/secrets.toml` file.  
   - The data will be stored in a dataset named `taxis_parquet`.

3. **Execute the pipeline:**  
   - The pipeline writes the output in Parquet format inside the MinIO bucket.  
   - The `write_disposition="replace"` option ensures that any existing dataset is overwritten, maintaining a clean environment for testing.

In [None]:
# Define a resource that reads data from a remote Parquet file
@dlt.resource(table_name="df_data")
def my_df():
    parquet_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet"
    with fsspec.open(parquet_url, mode="rb") as f:
        table = pq.read_table(f)
        df = table.to_pandas()
        yield df   # yield the DataFrame to DLT


# Pipeline configured with a filesystem destination
# Credentials will be automatically read from `.dlt/secrets.toml`
pipeline = dlt.pipeline(
    pipeline_name="parquet_to_minio",
    destination="filesystem",
    dataset_name="taxis_parquet",
)

# Run the pipeline
load_info = pipeline.run(
    my_df,
    loader_file_format="parquet",
    write_disposition="replace"
)
print(load_info)

Pipeline parquet_to_minio load step completed in 1.21 seconds
1 load package(s) were loaded to destination filesystem and into dataset taxis_parquet
The filesystem destination used s3://taxis location to store data
Load package 1761058478.437103 is LOADED and contains no failed jobs


The pipeline ran successfully and completed in approximately **1.2 seconds**.  
It confirms that:
- The data was loaded to the **filesystem destination** (MinIO).  
- The dataset **`taxis_parquet`** was created and contains the ingested Parquet file.  
- No failed jobs were detected.

