Docs: Extend layout placeholder params for filesystem destinations (#…

…1220) * Add section about new placeholders * Add basic information about additional placeholders * Add more examples of layout configuration * Add code snippet examples * Remove typing info * Add note * Add note about auto_mkdir * Try concurrent snippet linting * Try concurrent snippet linting * Adjust wording and format check_embedded_snippets.py * Uncomment examples and submit task to pool properly * Submit snippets to workers * Revert parallelization stuff * Comment out unused laoyuts * Fix mypy issues * Add a section about the recommended layout * Adjust text * Better text * Adjust section titles * Adjust code section language identifier * Fix mypy errors * More cosmetic changes for the doc --------- Co-authored-by: Violetta Mishechkina <sansiositres@gmail.com>
dlt-hub · Apr 17, 2024 · 652bbfa · 652bbfa
1 parent d2cb6c0
commit 652bbfa
Show file tree

Hide file tree

Showing 2 changed files with 193 additions and 47 deletions.
diff --git a/docs/tools/check_embedded_snippets.py b/docs/tools/check_embedded_snippets.py
@@ -1,14 +1,21 @@
 """
 Walks through all markdown files, finds all code snippets, and checks wether they are parseable.
 """
-from typing import List, Dict, Optional
+import os
+import ast
+import subprocess
+import argparse
 
-import os, ast, json, yaml, tomlkit, subprocess, argparse  # noqa: I251
 from dataclasses import dataclass
 from textwrap import dedent
+from typing import List
 
+import tomlkit
+import yaml
 import dlt.cli.echo as fmt
 
+from dlt.common import json
+
 from utils import collect_markdown_files
 
 
@@ -295,6 +302,7 @@ def typecheck_snippets(snippets: List[Snippet], verbose: bool) -> None:
     python_snippets = [s for s in filtered_snippets if s.language == "py"]
     if args.command in ["lint", "full"]:
         lint_snippets(python_snippets, args.verbose)
+
     if ENABLE_MYPY and args.command in ["typecheck", "full"]:
         typecheck_snippets(python_snippets, args.verbose)
 

diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md
@@ -1,7 +1,5 @@
 # Filesystem & buckets
-Filesystem destination stores data in remote file systems and bucket storages like **S3**, **google storage** or **azure blob storage**.
-Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations.
-Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it.
+The Filesystem destination stores data in remote file systems and bucket storages like **S3**, **Google Storage**, or **Azure Blob Storage**. Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations. Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it.
 
 > 💡 Please read the notes on the layout of the data files. Currently, we are getting feedback on it. Please join our Slack (icon at the top of the page) and help us find the optimal layout.
 
@@ -15,8 +13,7 @@ This installs `s3fs` and `botocore` packages.
 
 :::caution
 
-You may also install the dependencies independently.
-Try:
+You may also install the dependencies independently. Try:
 ```sh
 pip install dlt
 pip install s3fs
@@ -28,16 +25,18 @@ so pip does not fail on backtracking.
 
 ### 1. Initialise the dlt project
 
-Let's start by initialising a new dlt project as follows:
+Let's start by initializing a new dlt project as follows:
    ```sh
    dlt init chess filesystem
    ```
-   > 💡 This command will initialise your pipeline with chess as the source and the AWS S3 filesystem as the destination.
+:::note
+This command will initialize your pipeline with chess as the source and the AWS S3 filesystem as the destination.
+:::
 
 ### 2. Set up bucket storage and credentials
 
 #### AWS S3
-The command above creates sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running:
+The command above creates a sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running:
 ```sh
 pip install -r requirements.txt
 ```
@@ -52,9 +51,7 @@ aws_access_key_id = "please set me up!" # copy the access key here
 aws_secret_access_key = "please set me up!" # copy the secret access key here
 ```
 
-If you have your credentials stored in `~/.aws/credentials` just remove the **[destination.filesystem.credentials]** section above
-and `dlt` will fall back to your **default** profile in local credentials.
-If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`):
+If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** section above, and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`):
 ```toml
 [destination.filesystem.credentials]
 profile_name="dlt-ci-user"
@@ -66,7 +63,7 @@ You can also pass an AWS region:
 region_name="eu-central-1"
 ```
 
-You need to create a S3 bucket and a user who can access that bucket. `dlt` is not creating buckets automatically.
+You need to create an S3 bucket and a user who can access that bucket. `dlt` does not create buckets automatically.
 
 1. You can create the S3 bucket in the AWS console by clicking on "Create Bucket" in S3 and assigning the appropriate name and permissions to the bucket.
 2. Once the bucket is created, you'll have the bucket URL. For example, If the bucket name is `dlt-ci-test-bucket`, then the bucket URL will be:
@@ -76,7 +73,7 @@ You need to create a S3 bucket and a user who can access that bucket. `dlt` is n
    ```
 
 3. To grant permissions to the user being used to access the S3 bucket, go to the IAM > Users, and click on “Add Permissions”.
-4. Below you can find a sample policy that gives a minimum permission required by `dlt` to a bucket we created above. The policy contains permissions to list files in a bucket, get, put and delete objects. **Remember to place your bucket name in Resource section of the policy!**
+4. Below you can find a sample policy that gives a minimum permission required by `dlt` to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. **Remember to place your bucket name in the Resource section of the policy!**
 
 ```json
 {
@@ -105,7 +102,7 @@ You need to create a S3 bucket and a user who can access that bucket. `dlt` is n
 
 ##### Using S3 compatible storage
 
-To use an S3 compatible storage other than AWS S3 like [MinIO](https://min.io/) or [Cloudflare R2](https://www.cloudflare.com/en-ca/developer-platform/r2/) you may supply an `endpoint_url` in the config. This should be set along with aws credentials:
+To use an S3 compatible storage other than AWS S3 like [MinIO](https://min.io/) or [Cloudflare R2](https://www.cloudflare.com/en-ca/developer-platform/r2/), you may supply an `endpoint_url` in the config. This should be set along with AWS credentials:
 
 ```toml
 [destination.filesystem]
@@ -123,12 +120,12 @@ To pass any additional arguments to `fsspec`, you may supply `kwargs` and `clien
 
 ```toml
 [destination.filesystem]
-kwargs = '{"use_ssl": true}'
+kwargs = '{"use_ssl": true, "auto_mkdir": true}'
 client_kwargs = '{"verify": "public.crt"}'
 ```
 
 #### Google Storage
-Run `pip install dlt[gs]` which will install `gcfs` package.
+Run `pip install dlt[gs]` which will install the `gcfs` package.
 
 To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`.
 You'll see AWS credentials by default.
@@ -142,8 +139,9 @@ project_id = "project_id" # please set me up!
 private_key = "private_key" # please set me up!
 client_email = "client_email" # please set me up!
 ```
-
-> 💡 Note that you can share the same credentials with BigQuery, replace the **[destination.filesystem.credentials]** section with less specific one: **[destination.credentials]** which applies to both destinations
+:::note
+Note that you can share the same credentials with BigQuery, replace the `[destination.filesystem.credentials]` section with a less specific one: `[destination.credentials]` which applies to both destinations
+:::
 
 if you have default google cloud credentials in your environment (i.e. on cloud function) remove the credentials sections above and `dlt` will fall back to the available default.
 
@@ -171,18 +169,18 @@ you can omit both `azure_storage_account_key` and `azure_storage_sas_token` and
 Note that `azure_storage_account_name` is still required as it can't be inferred from the environment.
 
 #### Local file system
-If for any reason you want to have those files in local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required)
+If for any reason you want to have those files in a local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required)
 
 ```toml
 [destination.filesystem]
-bucket_url = "file:///absolute/path"  # three / for absolute path
+bucket_url = "file:///absolute/path"  # three / for an absolute path
 # bucket_url = "file://relative/path" # two / for a relative path
 ```
 
 ## Write disposition
-`filesystem` destination handles the write dispositions as follows:
-- `append` - files belonging to such tables are added to dataset folder
-- `replace` - all files that belong to such tables are deleted from dataset folder, and then the current set of files is added.
+The filesystem destination handles the write dispositions as follows:
+- `append` - files belonging to such tables are added to the dataset folder
+- `replace` - all files that belong to such tables are deleted from the dataset folder, and then the current set of files is added.
 - `merge` - falls back to `append`
 
 ## File Compression
@@ -192,47 +190,99 @@ The filesystem destination in the dlt library uses `gzip` compression by default
 To handle compressed files:
 
 - To disable compression, you can modify the `data_writer.disable_compression` setting in your "config.toml" file. This can be useful if you want to access the files directly without needing to decompress them. For example:
-  ```toml
-  [normalize.data_writer]
-  disable_compression=true
-  ```
+
+```toml
+[normalize.data_writer]
+disable_compression=true
+```
 
 - To decompress a `gzip` file, you can use tools like `gunzip`. This will convert the compressed file back to its original format, making it readable.
 
 For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](https://dlthub.com/docs/reference/performance#disabling-and-enabling-file-compression).
 
-## Data loading
-All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of `pipeline`. In our example chess pipeline it is **chess_players_games_data**.
+## Files layout
+All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of the `pipeline`. In our example chess pipeline, it is **chess_players_games_data**.
 
-> 💡 Note that bucket storages are in fact key-blob storage so folder structure is emulated by splitting file names into components by `/`.
+:::note
+Bucket storages are, in fact, key-blob storage so the folder structure is emulated by splitting file names into components by separator (`/`).
+:::
 
-### Files layout
+You can control files layout by specifying the desired configuration. There are several ways to do this.
 
-The name of each file contains essential metadata on the content:
+### Default layout
 
-- **schema_name** and **table_name** identify the [schema](../../general-usage/schema.md) and table that define the file structure (column names, data types, etc.)
-- **load_id** is the [id of the load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) form which the file comes from.
-- **file_id** is there are many files with data for a single table, they are copied with different file id.
-- **ext** a format of the file i.e. `jsonl` or `parquet`
+Current default layout: `{table_name}/{load_id}.{file_id}.{ext}`
 
-Current default layout: **{table_name}/{load_id}.{file_id}.{ext}`**
+:::note
+The default layout format has changed from `{schema_name}.{table_name}.{load_id}.{file_id}.{ext}` to `{table_name}/{load_id}.{file_id}.{ext}` in dlt 0.3.12. You can revert to the old layout by setting it manually.
+:::
+
+### Available layout placeholders
+
+#### Standard placeholders
 
-> 💡 Note that the default layout format has changed from `{schema_name}.{table_name}.{load_id}.{file_id}.{ext}` to `{table_name}/{load_id}.{file_id}.{ext}` in dlt 0.3.12. You can revert to the old layout by setting the old value in your toml file.
+* `schema_name` - the name of the [schema](../../general-usage/schema.md)
+* `table_name` - table name
+* `load_id` - the id of the [load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) from which the file comes from
+* `file_id` - the id of the file, is there are many files with data for a single table, they are copied with different file ids
+* `ext` - a format of the file i.e. `jsonl` or `parquet`
 
+#### Date and time placeholders
+:::tip
+Keep in mind all values are lowercased.
+:::
+
+* `timestamp` - the current timestamp in Unix Timestamp format rounded to minutes
+* `load_package_timestamp` - timestamp from [load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) in Unix Timestamp format rounded to minutes
+* Years
+  * `YYYY` - 2024, 2025
+  * `Y` - 2024, 2025
+* Months
+  * `MMMM` - January, February, March
+  * `MMM` - Jan, Feb, Mar
+  * `MM` - 01, 02, 03
+  * `M` - 1, 2, 3
+* Days of the month
+  * `DD` - 01, 02
+  * `D` - 1, 2
+* Hours 24h format
+  * `HH` - 00, 01, 02...23
+  * `H` - 0, 1, 2...23
+* Minutes
+  * `mm` - 00, 01, 02...59
+  * `m` - 0, 1, 2...59
+* Days of the week
+  * `dddd` - Monday, Tuesday, Wednesday
+  * `ddd` - Mon, Tue, Wed
+  * `dd` - Mo, Tu, We
+  * `d` - 0-6
+* `Q` - quarters 1, 2, 3, 4,
 
 You can change the file name format by providing the layout setting for the filesystem destination like so:
 ```toml
 [destination.filesystem]
 layout="{table_name}/{load_id}.{file_id}.{ext}" # current preconfigured naming scheme
-# layout="{schema_name}.{table_name}.{load_id}.{file_id}.{ext}" # naming scheme in dlt 0.3.11 and earlier
+
+# More examples
+# With timestamp
+# layout = "{table_name}/{timestamp}/{load_id}.{file_id}.{ext}"
+
+# With timestamp of the load package
+# layout = "{table_name}/{load_package_timestamp}/{load_id}.{file_id}.{ext}"
+
+# Parquet-like layout (note: it is not compatible with the internal datetime of the parquet file)
+# layout = "{table_name}/year={year}/month={month}/day={day}/{load_id}.{file_id}.{ext}"
+
+# Custom placeholders
+# extra_placeholders = { "owner" = "admin", "department" = "finance" }
+# layout = "{table_name}/{owner}/{department}/{load_id}.{file_id}.{ext}"
 ```
 
 A few things to know when specifying your filename layout:
 - If you want a different base path that is common to all filenames, you can suffix your `bucket_url` rather than prefix your `layout` setting.
-- If you do not provide the `{ext}` placeholder, it will automatically be added to your layout at the end with a dot as separator.
-- It is the best practice to have a separator between each placeholder. Separators can be any character allowed as a filename character, but dots, dashes and forward slashes are most common.
-- When you are using the `replace` disposition, `dlt`` will have to be able to figure out the correct files to delete before loading the new data. For this
-to work, you have to
+- If you do not provide the `{ext}` placeholder, it will automatically be added to your layout at the end with a dot as a separator.
+- It is the best practice to have a separator between each placeholder. Separators can be any character allowed as a filename character, but dots, dashes, and forward slashes are most common.
+- When you are using the `replace` disposition, `dlt` will have to be able to figure out the correct files to delete before loading the new data. For this to work, you have to
   - include the `{table_name}` placeholder in your layout
   - not have any other placeholders except for the `{schema_name}` placeholder before the table_name placeholder and
   - have a separator after the table_name placeholder
@@ -241,6 +291,94 @@ Please note:
 - `dlt` will not dump the current schema content to the bucket
 - `dlt` will mark complete loads by creating an empty file that corresponds to `_dlt_loads` table. For example, if `chess._dlt_loads.1685299832` file is present in dataset folders, you can be sure that all files for the load package `1685299832` are completely loaded
 
+### Advanced layout configuration
+
+The filesystem destination configuration supports advanced layout customization and the inclusion of additional placeholders. This can be done through `config.toml` or programmatically when initializing via a factory method.
+
+:::tip
+For handling deeply nested layouts, consider enabling automatic directory creation for the local filesystem destination. This can be done by setting `kwargs = '{"auto_mkdir": true}'` to facilitate the creation of directories automatically.
+:::
+
+#### Configuration via `config.toml`
+
+To configure the layout and placeholders using `config.toml`, use the following format:
+
+```toml
+layout = "{table_name}/{test_placeholder}/{YYYY}-{MM}-{DD}/{ddd}/{mm}/{load_id}.{file_id}.{ext}"
+extra_placeholders = { "test_placeholder" = "test_value" }
+current_datetime="2024-04-14T00:00:00"
+```
+
+:::note
+Ensure that the placeholder names match the intended usage. For example, `{test_placeholer}` should be corrected to `{test_placeholder}` for consistency.
+:::
+
+#### Dynamic configuration in the code
+
+Configuration options, including layout and placeholders, can be overridden dynamically when initializing and passing the filesystem destination directly to the pipeline.
+
+```py
+import pendulum
+
+import dlt
+from dlt.destinations import filesystem
+
+pipeline = dlt.pipeline(
+    pipeline_name="data_things",
+    destination=filesystem(
+        layout="{table_name}/{test_placeholder}/{timestamp}/{load_id}.{file_id}.{ext}",
+        current_datetime=pendulum.now(),
+        extra_placeholders={
+            "test_placeholder": "test_value",
+        }
+    )
+)
+```
+
+Furthermore, it is possible to
+
+1. Customize the behavior with callbacks for extra placeholder functionality. Each callback must accept the following positional arguments and return a string.
+2. Customize the `current_datetime`, which can also be a callback function and expected to return a `pendulum.DateTime` instance.
+
+```py
+import pendulum
+
+import dlt
+from dlt.destinations import filesystem
+
+def placeholder_callback(schema_name: str, table_name: str, load_id: str, file_id: str, ext: str) -> str:
+    # Custom logic here
+    return "custom_value"
+
+def get_current_datetime() -> pendulum.DateTime:
+    return pendulum.now()
+
+pipeline = dlt.pipeline(
+    pipeline_name="data_things",
+    destination=filesystem(
+        layout="{table_name}/{placeholder_x}/{timestamp}/{load_id}.{file_id}.{ext}",
+        current_datetime=get_current_datetime,
+        extra_placeholders={
+            "placeholder_x": placeholder_callback
+        }
+    )
+)
+```
+
+### Recommended layout
+
+The currently recommended layout structure is straightforward:
+
+```toml
+layout="{table_name}/{load_id}.{file_id}.{ext}"
+```
+
+Adopting this layout offers several advantages:
+1. **Efficiency:** it's fast and simple to process.
+2. **Compatibility:** supports `replace` as the write disposition method.
+3. **Flexibility:** compatible with various destinations, including Athena.
+4. **Performance:** a deeply nested structure can slow down file navigation, whereas a simpler layout mitigates this issue.
+
 ## Supported file formats
 You can choose the following file formats:
 * [jsonl](../file-formats/jsonl.md) is used by default
@@ -250,6 +388,6 @@ You can choose the following file formats:
 
 ## Syncing of `dlt` state
 This destination does not support restoring the `dlt` state. You can change that by requesting the [feature](https://github.com/dlt-hub/dlt/issues/new/choose) or contributing to the core library 😄
-You can however easily [backup and restore the pipeline working folder](https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e) - reusing the bucket and credentials used to store files.
+You can, however, easily [backup and restore the pipeline working folder](https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e) - reusing the bucket and credentials used to store files.
 
-<!--@@@DLT_TUBA filesystem-->
+<!--@@@DLT_TUBA filesystem-->