Releases: dlt-hub/dlt
1.3.0
Core Library
- Fix try/except in from_reference shadowing MissingDependencyException by @burnash in #1939
- prefers uv over pip if found (when creating virtual envs) by @rudolfix in #1940
- allows to plug new or updated dlt cli commands by @sh-rp in #1938
- Feat/557 rest api add oauth2clientcredentials to built in auth methods by @willi-mueller in #1871
- uses path normalize for columns in arrow tables by @rudolfix in #1947
- Added extended jsonpath_ng parser (rest_api) by @francescomucio in #1941
- Fix/1897 support https endpoints clickhouse by @sh-rp in #1931
- Fix for multiple ignores is not working (rest_api) by @burnash in #1956
- SQL Database: Support including/excluding NULL cursor values by @steinitzu in #1946
- Add
references
table hint and reflect them insql_database
by @steinitzu in #1925 - only truncate or delete from existing tables in refresh modes by @sh-rp in #1926
- adds bigquery partition expiration and motherduck connection string by @rudolfix in #1968
Experimental interfaces
Below we expose a new pipeline._dataset
and dlt._dataset
interfaces that provide unified access to data loaded into destination. We also implement duckdb
-based SQL client on a filesystem
destination to access data in data lakes. We'll add documentation once we stabilize dataset interface. However already now you can benefit from new cursor
implementation of sql_client
that allows to take data frames, arrow tables also in batches:
- dataset factory by @sh-rp in #1945
- expose readable datasets as dataframes and arrow tables by @sh-rp in #1507
PRs below adds pluggy
and a few first plugin hooks. The idea is to make a lot of functionalities in dlt pluggable. Currently you can plug new cli command (or upgrade existing) and you can also plug your own runtime environment (how dlt looks for data, secrets etc.)
- adds registries and plugins by @rudolfix in #1894
- unifies run configuration and run context by @rudolfix in #1944
Docs
- Update url in deploy-with-airflow-composer.md by @FriedrichtenHagen in #1942
- Added info about backend kwargs in pyarrow by @dat-a-man in #1903
- Docs: sync styles with dlthub by @burnash in #1936
- Docs: styles: remove underline for cards in dark mode by @burnash in #1967
New Contributors
- @FriedrichtenHagen made their first contribution in #1942
Full Changelog: 1.2.0...1.3.0
1.2.0
Core Library
- Sqlalchemy merge support by @steinitzu in #1842
- Fix config sections for synching destinations and accessing destination clients by @sh-rp in #1887
- incremental
scd2
withmerge_key
by @jorritsandbrink in #1818 - fix: UUIDs are not an unknown data type (logging) by @neuromantik33 in #1914
- fix: PageNumberPaginator not reset when iterating through multiple pa… by @paul-godhouse in #1924
- Feat/1922 rest api source add mulitple path parameters by @TheOneTrueAnt in #1923
- enables gcs staging for databricks by @rudolfix in #1933
Docs
- Update weaviate reference by @emmanuel-ferdman in #1896
- Docs: Add sftp option for filesystem source by @VioletM in #1845
- Update installation.md by @erikjamesmason in #1899
- Added troubleshooting section to filesystem docs by @dat-a-man in #1900
- Docs: make naming consistent in the cloud storage & file system source by @burnash in #1835
- Docs: add section on resolving multiple path parameters by @burnash in #1929
New Contributors
- @emmanuel-ferdman made their first contribution in #1896
- @erikjamesmason made their first contribution in #1899
- @neuromantik33 made their first contribution in #1914
- @paul-godhouse made their first contribution in #1924
Full Changelog: 1.1.0...1.2.0
1.1.0
What's Changed
- fix intermittent
delta
panic issue by @jorritsandbrink in #1832 - Sqlalchemy staging dataset support and docs by @steinitzu in #1841
- rest_api: allow specifying custom session (feat/1843) by @willi-mueller in #1844
- Allows any duckdb version, fixes databricks az credentials by @rudolfix in #1854
- Fix/1849 Do Not Parse Ignored Empty Responses by @TheOneTrueAnt in #1851
- feat: filesystem delete old pipeline state files by @donotpush in #1838
- supports adding DltResource in RESTAPIConfig dict by @willi-mueller in #1865
- Fix/1858 make all connection string credentials optional by @rudolfix in #1867
Docs
- sqlalchemy destination docs @steinitzu in #1841
- Docs: move REST API helpers to the REST API category by @burnash in #1852
- Docs: rest_api: document
processing_steps
by @burnash in #1872 - Fix the paginator's doc heading by @burnash in #1869
Verified Sources
- Custom filter clauses supported, pyarrow/arrowmongo requirement optional for Mongo by @Pipboyguy
New Contributors
- @TheOneTrueAnt made their first contribution in #1851
Full Changelog: 1.0.0...1.1.0
1.0.0
This is a major dlt
release. Please check the list of breaking changes and deprecations: #1778
Core Library
- move rest_api, sql_database and filesystem sources to dlt core by @willi-mueller in #1728
- drops
foreign_key
, adds nested references (row_key
-parent_key
) by @rudolfix in #1774 - deprecates
complex
data type, changes tojson
by @rudolfix in #1792 - Feat/1749 abort load package and raise exception on terminal errors in jobs by @willi-mueller in #1781
- Feat/1492 extend timestamp config to handle naive timestamps (without timezone) by @donotpush in #1669
- Fix/1571 Incremental: Optionally load or ignore/exclude/include records with
cursor_path
missing or None value by @willi-mueller in #1576 - creates a single source in extract for all resource instances passed as list by @rudolfix in #1535
- Enable BigQuery schema auto-detection with partitioning and clustering hints by @Pipboyguy in #1806
- Sqlalchemy destination (merge support and docs still in progress) by @steinitzu in #1734
- Feat/1730 extend filesystem sftp by @donotpush in #1769
- Stops dumping secrets to dlt traces. by @willi-mueller in #1797
- Don't use Custom Embedding Functions on LanceDB by @Pipboyguy in #1771
- sets default concurrency for blob upload for adlfs to 1 to avoid massive memory usage on large files by @rudolfix in #1779
- Fix/1790 support incremental load with arrow when cursor column is not nullable by @willi-mueller in #1791
- controls row group size and empty tables in memory buffer when writing parquet by @rudolfix in #1782
- fix installation command" by @novica in #1741
- skips tables without jobs when merging delta tables by @rudolfix in #1803
Docs
- display past versions of the documentation (0.5.x / 1.0.0 / devel) by @sh-rp in #1770
- Refactor filesystem doc by @VioletM in #1745
- Update REST API docs by @akelad in #1795
- Add filesystem tutorial by @VioletM in #1775
- adding the sql_database tutorial by @rahuljo in #1796
- structural and content changes to the sql_database doc by @rahuljo in #1623
- Docs: update the introduction, add the rest_api tutorial by @burnash in #1729
- Docs/update deploy dagster by @mariarice15 in #1761
- Correct wrong code example for apply_hints( incremental(xx) ) by @w0ut0 in #1785
- Moves sources and destinations to the top level in docs navigation by @VioletM in #1750
- Fix typo "frequenly" by @ruudwelten in #1800
- Reorder sidebar by @mariarice15 in #1787
New Contributors
- @novica made their first contribution in #1741
- @mariarice15 made their first contribution in #1761
- @w0ut0 made their first contribution in #1785
- @ruudwelten made their first contribution in #1800
Full Changelog: 0.5.4...1.0.0
0.5.4
Core Library
- BigQuery project_id may be different from credentials project_id by @VioletM in #1680
- Enable schema evolution for
merge
write disposition withdelta
table format by @jorritsandbrink in #1742 - Add
storage_options
toDeltaTable.create
by @jorritsandbrink in #1686 - Fix
delta
table dangling Parquet file bug by @jorritsandbrink in #1695 - Add
delta
table partitioning support by @jorritsandbrink in #1696 - fixes load job counter displayed in progress by @rudolfix in #1702
- RESTClient: stops pagination after empty page (Feat/1637) by @willi-mueller in #1677
- Enable
scd2
record reinsert by @jorritsandbrink in #1707 scd2
custom "valid from" / "valid to" value feature by @jorritsandbrink in #1709- feat/1681 collects load job metrics and adds remote url to traces by @rudolfix in #1708
- locks trace format with a contract @rudolfix in #1708
- Feat/1711 create with not exists for dlt tables to reduce racing conditions by @rudolfix in #1740
- provides detail exception messages when cursor stored value cannot be coerced to data by @rudolfix in #1748
- Allows to configure if staging destination is truncated or left intact to config by @VioletM in #1717
- enables external location and named credential in databricks, allows abfss://container@account Azure urls by @rudolfix in #1755
- fixes #1703 and #1754 by @rudolfix in #1755
Docs:
- rest_api: documents pluggable custom auth by @willi-mueller in #1690
- Update Snowflake docs by @akelad in #1747
- Docs/issue 1661 add tip to source docs and update weaviate docs by @dat-a-man in #1662
- Add custom parent-child relationships example by @dat-a-man in #1678
- Correct the library name for mem stats to
psutil
by @deepyaman in #1733 - Replaced "full_refresh" with "dev_mode" by @dat-a-man in #1735
New Contributors
- @deepyaman made their first contribution in #1733
Full Changelog: 0.5.3...0.5.4
0.5.3
Core Library
- Add support for continuously starting load jobs as slots free up in the loader. This will significantly speed up loading packages with many files. by @sh-rp in #1494
- Add
get_delta_tables
helper function to optimize and vacuum tables by @jorritsandbrink in #1664 - Raise/warn on incomplete columns in normalize by @steinitzu in #1504
- Add enable_dataset_name_normalization option by @VioletM in #1676
- updates duckdb/motherduck load job to match parquet by column names by @rudolfix in #1674
- updates duckdb/motherduck load job to fully allow jsonl file format by @rudolfix in #1674
- removes internal locks when loading parquet from multiple threads (duckdb got fixed) #1674
- enables multi transactions statements for Motherduck #1674
- fixes dbt logs line endings
Docs
Verified Sources
- Column selector added to
sql_database
@steinitzu
New Contributors
Full Changelog: 0.5.2...0.5.3
0.5.2
Core Library
- Add
upsert
merge strategy for Postgres and Snowflake, by @jorritsandbrink in #1466 - Add basic
upsert
support fordelta
table format infilesystem
destination by @jorritsandbrink in #1600 - query tagging for snowflake by @rudolfix in #1582
- Support Open Source ClickHouse Deployments (MergeTree engine and more) by @Pipboyguy in #1496
- allows nested types in BigQuery via native
autodetect_schema
by @rudolfix in #1591 - Enable
upsert
merge strategy for more SQL destinations (Athena, BigQuery, Databricks, mssql) by @jorritsandbrink in #1628 - Fix/1512 fixes
current.pipeline()
access by @rudolfix in #1581 - feat: add config dataset_name_prefix to set custom staging dataset name by @donotpush in #1563
- fix: add airflow db reset for all tests by @donotpush in #1559
- Enable S3 compatible storage for
delta
table format by @jorritsandbrink in #1586 - feat/1495 rest_client: renames JSONResponsePaginator to JSONLinkPaginator by @willi-mueller in #1558
- Feat/1596 adds custom config providers + example of yaml config provider supporting profiles and jinja placeholders by @rudolfix in #1642
- Feat/1583 rest client session timeout configuration by @willi-mueller in #1590
- Add clarification for add_limit by @VioletM in #1594
- Fix/1606 fixes validator incremental step order to keep it always last in the pipe by @rudolfix in #1641
- Feat/1593 rest_client: allow setting of request kwargs by @willi-mueller in #1609
- prevent accidental wrapping of sources in resources when using adapters by @sh-rp in #1645
- Add empty source handling for
delta
table format onfilesystem
destination by @jorritsandbrink in #1617 - Surface original err msg from pydantic as extended_info on DataValidationError by @codingcyclist in #1569
- fix(dockerfile): remove extra spaces around equals sign in LABEL inst… by @thisisdope in #1573
- Qdrant uncommitted state restore and test by @steinitzu in #1545
- fix: suppress alembic logs for tests by @donotpush in #1578
Docs
- Document sql source reflection level and type adapter by @steinitzu in #1467
- Add to docs docs configuring file format options by @VioletM in #1543
- Added how dlt uses arrow by jorrit by @dat-a-man in #1577
- docs/514 rest_api: docs on pluggable paginators by @willi-mueller in #1557
- docs: documents new
convert
parameter in rest_api source incremental config by @willi-mueller in #1649 - Docs/1571 docs on handling NULL values at incremental cursor path by @willi-mueller in #1650
- Add note that pg_replication doesn't support scd2 by @akelad in #1608
- docs/505 updates documentation on custom hooks in response_actions by @willi-mueller in #1524
New Contributors
- @donotpush made their first contribution in #1559
- @thisisdope made their first contribution in #1573
- @akelad made their first contribution in #1608
Full Changelog: 0.5.1...0.5.2
0.5.1
This is a major release (0.4 -> 0.5) in our versioning scheme so please review the breaking changes below. Most of them are relevant only for platform builders that use dlt
internals. Some of the long-deprecated components were removed as well
Breaking Changes
PageNumberPaginator
takesbase_page
andpage
arguments instead ofinitial_page
. This allows to paginate APIs that number pages ie. from 0 or from 1. #1509- deprecated
credentials
argument was removed fromdlt.pipeline
. #1537 Please use destination factories to instantiate destinations with explicit credentials. (https://dlthub.com/devel/general-usage/destination#pass-explicit-credentials)
Breaking Changes (internals)
- if
dlt.source
ordlt.resource
decorated function is passed aNone
in a default argument during a function call, it will be handled exactly like in regular Python function call. Previously suchNone
would request argument injection from configuration. Please read more here: (#1430) dlt.config.value
anddlt.secrets.value
were evaluating toNone
at runtime. Now they will evaluate to a sentinel value. All the existing code should be backward compatible. (#1430)full_refresh
flag ofdlt.pipeline
will be deprecated and replaced withdev_mode
. (#1063) and (https://dlthub.com/devel/general-usage/pipeline#do-experiments-with-dev-mode)- the default resource extraction sequence has changed to
round_robin
fromfifo
as a default setting. You can switch back to the previous behavior and learn more about what this means here: (https://dlthub.com/docs/reference/performance#resources-extraction-fifo-vs-round-robin) - if you create an instance of a SPEC (ie.
SnowflakeCredentials
) it will not be marked as resolved even if all required fields are provided. previously some were resolving and some were not. #1489 parse_native_representation
never marks config as resolved. previously some were resolving and some were not. #1489
Core Library
- support
delta
tables withdelta-rs
on top offilesystem
destination. (#1382) LanceDB
destination and examples (#1375)- external files may be imported and loaded without extraction and normalization (https://dlthub.com/devel/general-usage/resource#import-external-files) - includes jsonl, csv, and parquet
- pick the loader file format for particular resource (https://dlthub.com/devel/general-usage/resource#pick-loader-file-format-for-a-particular-resource)
- extended support for various csv formats (https://dlthub.com/devel/dlt-ecosystem/file-formats/csv#change-settings)
- csv support for snowflake (#1470 https://dlthub.com/devel/dlt-ecosystem/destinations/snowflake#custom-csv-formats)
- support case sensitive and insensitive modes for our destinations ie. snowflake, redshift, bigquery, mssql etc. may work in both modes (#998 https://dlthub.com/devel/general-usage/naming-convention)
- you'll be able to fully change naming convention ie. to have LATIN-1 character set or create collision-free names (https://dlthub.com/devel/general-usage/naming-convention#write-your-own-naming-convention)
- two new naming conventions:
sql_cs_v1
(case sensitive) andsql_ci_v1
(case insensitive) to create SQL safe identifiers without snake case transformation (https://dlthub.com/devel/general-usage/naming-convention#available-naming-conventions) - you'll be able to modify destination capabilities via destination factories (https://dlthub.com/devel/general-usage/destination#inspect-destination-capabilities)
- schemas will be reflected with a single SQL statement which will make schema migrations faster
- loader can handle many more jobs (files) than before. we tested with 30k jobs and it looks fine
- we are adding
refresh
modes topipeline.run
that allow to drop and recreate tables - with different granularity. (https://dlthub.com/devel/general-usage/pipeline#refresh-pipeline-data-and-state) - when generating fingerprint for
filesystem
destination only the bucket component is taken into account #1516 - 1272 Support ClickHouse GCS S3 compatibility mode in filesystem destination by @Pipboyguy in #1423
- Ensure arrow field's nullable flag matches the schema column by @steinitzu in #1429
- Fix streamlit bug on chess example by @sh-rp in #1425
- Fix databricks pandas error by @steinitzu in #1443
- Extend orjson dependency allowed range with excluded versions by @steinitzu in #1501
- Fix/1465 fixes snowflake auth credentials by @rudolfix in #1489
- skips non resolvable fields from appearing in sample secrets.toml by @rudolfix in #1432
- RESTClient: pass environment settings to
requests.Session.send
by @burnash in #1452 - fix: service principal auth support for synapse copy job by @jorritsandbrink in #1472
- docs: Fixed markdown issue in duckdb.md by @PabloCastellano in #1528
- Loader parallelism strategies (destination can request the loading strategy ie. sequential or parallel) by @sh-rp in #1457
- Migrate to sentry sdk 2.0 by @sh-rp in #1477
- fix: allow loggeradapter in addition to logger in logcollector by @matsmhans1 in #1483
- Add load_id to arrow tables in extract step instead of normalize by @steinitzu in #1449
- #1356 implements OAuth2 Client Credentials flow by @willi-mueller in #1357
- Add LanceDB custom destination example code by @Pipboyguy in #1323
- fix(incremental): don't filter Arrow tables with empty filters by @IlyaFaer in #1480
- fix:
Pipeline.sql_client
credentials forwarding by @jorritsandbrink in #1499 - RESTClient: fix duplicate params in URL in JSONResponsePaginator by @burnash in #1515
- Update default log output to not have padding on log level by @sh-rp in #1517
- fix: remove obsolete
dremio
destination capabilities by @jorritsandbrink in #1527 - feat(filesystem): use only netloc and scheme for fingerprint by @IlyaFaer in #1516
- removes deprecated credentials argument from Pipeline by @rudolfix in #1537
- improves collision detection when naming convention changes by @rudolfix in #1536
- Fix/1542 rest client: makes request parameters optional by @willi-mueller in #1544
- RESTClient: add integrations tests for paginators by @burnash in #1509
- selects all tables from info schema if number of tables > threshold by @rudolfix in #1547
- configurable staging dataset name by @rudolfix in #1555
Docs
- naming conventions documentation (https://dlthub.com/docs/general-usage/naming-convention)
- methods to manipulate schema settings (https://dlthub.com/docs/general-usage/schema#schema-settings)
- rest_api: add troubleshooting section by @burnash in #1371
- RESTClient: add docs for
init_request
by @burnash in #1442 - Example: fast postgres to postgres by @AstrakhantsevaAA in #1428
- Docs: Updated filesystem docs with explanations for bucket URLs by @dat-a-man in #1435
- docs for loading with contracts to existing tables by @sh-rp in #1441
- Add troubleshooting to incremental docs by @burnash in #1458
- Docs: cover custom authentication, rework paginators section by @burnash in #1493
- rest_api: add an example to the incremental load section by @burnash in #1502
- rest_api: add a quick example to rest_api docs by @burnash in #1531
- Update grouping-resources.md docs by @axellpadilla in #1538
- adds examples and step by step explanation for refresh modes by @rudolfix in #1560
Verified Sources
We worked intensively on rest_api
and sql_database
:
- Add fallback value for tz in row_tuples_to_arrow (sql_database helpers) @khoadaniel dlt-hub/verified-sources#493
- allows SqlAlchemy engine to be passed to sql_table by @rudolfix dlt-hub/verified-sources#498
- Feat/505 rest api hooks in response actions @willi-mueller dlt-hub/verified-sources#512
- Feat/507 transformation function for incremental cursor @willi-mueller dlt-hub/verified-sources#515
- Allows incremental loading to be configured per resource in
sql_database
@rudolfix dlt-hub/verified-sources#478 - Allows to set the reflection level for tables: minimal (names/nullability), full (data types) and full_with_precision (with ie. varchar length). @steinitzu https://github.com/dl...
0.4.12
Core Library
- feat(pipeline): add an ability to auto truncate staging dataset by @IlyaFaer in #1292
- Feat/1406 bumps duckdb 0.10 + dbt to <=1.8.x by @rudolfix in #1407
- Azure service principal credentials support by @steinitzu in #1377
- Support partitioning hints for athena iceberg by @steinitzu in #1403
- Add recommended_file_size cap to limit data writer file size and cap BigQuery to 4gb by @steinitzu in #1368
- limits mssql query size to fit network buffer to prevent errors on large inserts by @rudolfix in #1372
- allows to bubble up exceptions when standalone resource returns by @rudolfix in #1374
- Fix: use .get on column in mssql destination for cases where the yaml… by @Daniel-Vetter-Coverwhale in #1380
- Make path tests Windows compatible by @jorritsandbrink in #1384
- RESTClient: Added "values" to the data pattern of the rest_api helper by @francescomucio in #1399
- corrects single entity path detection by @rudolfix in #1394
- RESTClient: implement AuthConfigBase.bool + update docs by @burnash in #1413
- Fix: ensure custom session can be provided to rest client by @z3z1ma in #1396
Docs
- RESTClient: add an example for creating a custom POST paginator by @burnash in #1358
- Add rest_api verified source documentation by @burnash in #1308
- Fix typo in Slack Docs by @cybermaxs in #1369
- RESTClient: docs: add the troubleshooting section by @burnash in #1367
- Replace weather api example with github in create a pipeline walkthrough by @sultaniman in #1351
- RESTClient: docs: Fixed snippet definition by @burnash in #1373
- docs: destination tables: elaborate on example code by @burnash in #1386
- add naming rules to contributing by @sh-rp in #1291
- Added info about how to reorder the columns to adjust a schema by @dat-a-man in #1364
- rest_api: add response_actions documentation by @burnash in #1362
- Update the tutorial to use
rest_client.paginate
for pagination by @burnash in #1287 - fix command to install dlt by @Benjamin0313 in #1404
- improves sql database docs by @rudolfix in #1383
- add typing classifier and update maintainers in pyproject by @sh-rp in #1391
- Updated installation command in destination docs and a few others by @dat-a-man in #1410
- Update filesystem docs with auto mkdir config by @VioletM in #1416
- add page to docs for openapi generator by @sh-rp in #1417
New Contributors
- @cybermaxs made their first contribution in #1369
- @Daniel-Vetter-Coverwhale made their first contribution in #1380
- @francescomucio made their first contribution in #1399
- @Benjamin0313 made their first contribution in #1404
Full Changelog: 0.4.11...0.4.12
0.4.11
Core Library
- RESTClient: building blocks (auths, paginators, response extractors etc.) to write REST API pipelines by @burnash
- Enable
merge
write disposition forathena
Iceberg by @jorritsandbrink in #1315 - adds std pipe iterator for stdout and stderr by @rudolfix in #1321
- adds _impl_cls to dlt.resource and dynamic config section to standalone resources with dynamic names by @rudolfix in #1324
- Accept :memory: mode for credentials parameter in duckdb factory by @sultaniman in #1297
- allows windows native, UNC and extended paths in filesystem source and destination by @rudolfix in #1335
- improves union validation: user friendly exceptions by @rudolfix in #1327
- improves instantiation and shutdown of thread pools for telemetry trackers by @rudolfix in #1340
- feat(airflow): pass data sources as callables and additional initializers for delayed source evaluation by @IlyaFaer in #1318
- Fix: ignores table options on ALTER TABLE in BigQuery by @rudolfix in #1306
- Fix: use correct check for column prop in column schema by @z3z1ma in #1347
- Streamlit caching and session state store fixes by @sultaniman in #1326
- implements method to merge columns in two table schemas by @rudolfix in #1348
- Extend motherduck client configuration to pass custom user agent by @sultaniman in #1284
- allows fsspec until 2023.1.0 by @rudolfix in #1305
Docs
- REST Client documentation by @burnash https://dlthub.com/docs/general-usage/http/rest-client
- REST API verified source documentation by @burnash @willi-mueller @francescomucio https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api
- Docs/google ads by @dat-a-man in #1313
- Docs: Freshdesk documentation by @dat-a-man in #1228
- Add instruction on installing dlt via pixi and conda by @sultaniman in #1332
Verified Sources
- rest_api verified source: quickly declare REST API endpoints and convert it into regular dlt source by @burnash @willi-mueller @francescomucio
- rest_api launch blog by @adrianbr in #1355
Full Changelog: 0.4.10...0.4.11