# Building Custom Sources with the Filesystem in `dlt` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dlt-hub/dlt/blob/6be4aaac807414ae6100691174c5babcd6a87736/docs/education/dlt-advanced-course/Lesson_3_Custom_sources_Filesystem_and_cloud_storage.ipynb) [![GitHub badge](https://img.shields.io/badge/github-view_source-2b3137?logo=github)](https://github.com/dlt-hub/dlt/blob/6be4aaac807414ae6100691174c5babcd6a87736/docs/education/dlt-advanced-course/Lesson_3_Custom_sources_Filesystem_and_cloud_storage.ipynb)

## What you will learn

You will learn how to:

- Use the `filesystem` resource to build real custom sources
- Apply filters to file metadata (name, size, date)
- Implement and register custom transformers
- Enrich records with file metadata
- Use incremental loading both for files and content


## Setup: Download real data

Install dlt

In [None]:
%%capture
!pip install dlt[duckdb]

We’ll use a real `.parquet` file from [TimeStored.com](https://www.timestored.com/data/sample/userdata.parquet)

In [None]:
!mkdir -p local_data && wget -O local_data/userdata.parquet https://www.timestored.com/data/sample/userdata.parquet

## Step 1: Load Parquet file from Local Filesystem

**What the script below does**: Lists and reads all `.parquet` files in `./local_data` and loads them into a table named `userdata`.

In [None]:
import dlt
from dlt.sources.filesystem import filesystem, read_parquet

# Point to the local file directory
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")

# Add a transformer
parquet_data = fs | read_parquet()

# Create and run pipeline
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
load_info = pipeline.run(parquet_data.with_name("userdata"))
print(load_info)

# Inspect data
pipeline.dataset().userdata.df().head()

### **Question 1**:

In the `my_pipeline` pipeline, and the `userdata` dataset, what is the ratio of men:women in decimal?

In [None]:
# check out the numbers below and answer 👀
df = pipeline.dataset().userdata.df()
df.groupby("gender").describe()

## Step 2: Enrich records with file metadata

Let’s add the file name to every record to track the data origin.

In [None]:
@dlt.transformer()
def read_parquet_with_filename(files):
    import pyarrow.parquet as pq
    for file_item in files:
        with file_item.open() as f:
            table = pq.read_table(f).to_pandas()
            table["source_file"] = file_item["file_name"]
            yield table.to_dict(orient="records")

fs = filesystem(bucket_url="./local_data", file_glob="*.parquet")
pipeline = dlt.pipeline("meta_pipeline", destination="duckdb")

load_info = pipeline.run((fs | read_parquet_with_filename()).with_name("userdata"))
print(load_info)

## Step 3: Filter files by metadata



Only load files matching custom logic:

In [None]:
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")

# Only include files that contain "user" and are < 1MB
fs.add_filter(lambda f: "user" in f["file_name"] and f["size_in_bytes"] < 1_000_000)

pipeline = dlt.pipeline("filtered_pipeline", destination="duckdb")
load_info = pipeline.run((fs | read_parquet()).with_name("userdata_filtered"))
print(load_info)

## Step 4: Load files incrementally
Avoid reprocessing the same file twice.

In [None]:
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")
fs.apply_hints(incremental=dlt.sources.incremental("modification_date"))

data = (fs | read_parquet()).with_name("userdata")
pipeline = dlt.pipeline("incremental_pipeline", destination="duckdb")
load_info = pipeline.run(data)
print(load_info)

## Step 5: Create a custom transformer

Let’s read structured data from `.json` files.

In [None]:
@dlt.transformer(standalone=True)
def read_json(items):
    from dlt.common import json
    for file_obj in items:
        with file_obj.open() as f:
            yield json.load(f)

# Download a JSON file
!wget -O local_data/sample.json https://jsonplaceholder.typicode.com/users

fs = filesystem(bucket_url="./local_data", file_glob="sample.json")
pipeline = dlt.pipeline("json_pipeline", destination="duckdb")

load_info = pipeline.run((fs | read_json()).with_name("users"))
print(load_info)

📁 You will see that this file also exists in your local_data directory.

> A **standalone** resource is defined on a function that is top-level in a module (not an inner function) that accepts config and secrets values. Additionally, if the standalone flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the decorated function, and the user must call the wrapper to get the actual resource.

Let's inspect the `users` table in your DuckDB dataset:

In [None]:
pipeline.dataset().users.df().head()

## Step 6: Copy files before loading

Copy files locally as part of the pipeline. This is useful for backups or post-processing.


In [None]:
import os
from dlt.sources.filesystem import filesystem
from dlt.common.storages.fsspec_filesystem import FileItemDict

def copy_local(item: FileItemDict) -> FileItemDict:
    local_path = os.path.join("copied", item["file_name"])
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    item.fsspec.download(item["file_url"], local_path)
    return item

fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet").add_map(copy_local)
pipeline = dlt.pipeline("copy_pipeline", destination="duckdb")
load_info = pipeline.run(fs.with_name("copied_files"))
print(load_info)

## Next steps

- Try building a transformer for `.xml` using `xmltodict`
- Combine multiple directories or buckets in a single pipeline
- Explore [more examples](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/advanced)


✅ ▶ Proceed to the [next lesson](https://colab.research.google.com/drive/14br3TZTRFwTSwpDyom7fxlZCeRF4efMk#forceEdit=true&sandboxMode=true)!

![Lesson_3_Custom_sources_Filesystem_and_cloud_storage_img1](https://storage.googleapis.com/dlt-blog-images/dlt-advanced-course/Lesson_3_Custom_sources_Filesystem_and_cloud_storage_img1.webp)