Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure intro, getting started and tutorial #702

Merged
merged 34 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
0ee4647
Restructure intro, getting started and tutorial
burnash Oct 19, 2023
bec2624
Fix wording
burnash Oct 19, 2023
9491f08
Add a dlt source example to the intro
burnash Oct 19, 2023
a406da1
Updated wording in the tutorial
burnash Oct 23, 2023
eb45bd3
Fixed link
burnash Oct 23, 2023
a70bb1c
Fix intro snippets
burnash Oct 24, 2023
ff8f10e
Update intro snippets
burnash Oct 24, 2023
5cec470
Clean tutorial snippets
burnash Oct 24, 2023
747bd3a
Fix typos and style
burnash Oct 24, 2023
3d42f35
Clean the snippets
burnash Oct 24, 2023
cd112c0
Fix naming
burnash Oct 24, 2023
2ef3490
Rename how-tos
burnash Oct 24, 2023
00e57fa
Update formatting and snippets
burnash Oct 25, 2023
b02e679
Change wording
burnash Oct 26, 2023
67ff67e
Fix one more title
burnash Oct 26, 2023
5f5cc40
Make title more specific
burnash Oct 26, 2023
bfa8265
Reword the titles; capitalization
burnash Oct 26, 2023
2613020
Fix links in the tutorial
burnash Oct 26, 2023
5e2f473
Add missing example + fix typos
burnash Oct 30, 2023
ecfb139
Extend _examples-header, fix path in transformers
burnash Nov 19, 2023
45f5253
Fix rebase errors
burnash Jan 16, 2024
cceec59
add destination to pdf to fix the example header
burnash Jan 16, 2024
17b5b66
Update the weaviate adapter import path
burnash Jan 16, 2024
a73ac3d
Add the generated example file
burnash Jan 16, 2024
e98dd18
Remove duplicated content
burnash Jan 16, 2024
7156f57
Fix pdf_to_weaviate snippet module import
burnash Jan 16, 2024
98d1200
Fix a typo
burnash Jan 16, 2024
56605bb
Arrange files, remove duplicated content
burnash Jan 16, 2024
9422eb0
Add dispatch to multiple tables to the sidebar
burnash Jan 16, 2024
05ed976
Move pdf to weaviate assets
burnash Jan 16, 2024
5d4cdf4
Add steps to dispatch-to-multiple-tables guide
burnash Jan 16, 2024
23f9a90
Add PyPDF2 install instructions
burnash Jan 16, 2024
29ee711
Update the tutorial's TOC
burnash Jan 16, 2024
9cd4131
update wording
burnash Jan 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
61 changes: 61 additions & 0 deletions docs/examples/pdf_to_weaviate/pdf_to_weaviate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os

import dlt
from dlt.destinations.impl.weaviate import weaviate_adapter
from PyPDF2 import PdfReader


@dlt.resource(selected=False)
def list_files(folder_path: str):
folder_path = os.path.abspath(folder_path)
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
yield {
"file_name": filename,
"file_path": file_path,
"mtime": os.path.getmtime(file_path)
}


@dlt.transformer(primary_key="page_id", write_disposition="merge")
def pdf_to_text(file_item, separate_pages: bool = False):
if not separate_pages:
raise NotImplementedError()
# extract data from PDF page by page
reader = PdfReader(file_item["file_path"])
for page_no in range(len(reader.pages)):
# add page content to file item
page_item = dict(file_item)
page_item["text"] = reader.pages[page_no].extract_text()
page_item["page_id"] = file_item["file_name"] + "_" + str(page_no)
yield page_item

pipeline = dlt.pipeline(
pipeline_name='pdf_to_text',
destination='weaviate'
)

# this constructs a simple pipeline that: (1) reads files from "invoices" folder (2) filters only those ending with ".pdf"
# (3) sends them to pdf_to_text transformer with pipe (|) operator
pdf_pipeline = list_files("assets/invoices").add_filter(
lambda item: item["file_name"].endswith(".pdf")
) | pdf_to_text(separate_pages=True)

# set the name of the destination table to receive pages
# NOTE: Weaviate, dlt's tables are mapped to classes
pdf_pipeline.table_name = "InvoiceText"

# use weaviate_adapter to tell destination to vectorize "text" column
load_info = pipeline.run(
weaviate_adapter(pdf_pipeline, vectorize="text")
)
row_counts = pipeline.last_trace.last_normalize_info
print(row_counts)
print("------")
print(load_info)

import weaviate

client = weaviate.Client("http://localhost:8080")
# get text of all the invoices in InvoiceText class we just created above
print(client.query.get("InvoiceText", ["text", "file_name", "mtime", "page_id"]).do())
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/athena.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ athena_work_group="my_workgroup"

## Data loading

Data loading happens by storing parquet files in an s3 bucket and defining a schema on athena. If you query data via sql queries on athena, the returned data is read by
Data loading happens by storing parquet files in an s3 bucket and defining a schema on athena. If you query data via SQL queries on athena, the returned data is read by
scanning your bucket and reading all relevant parquet files in there.

`dlt` internal tables are saved as Iceberg tables.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/mssql.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ or run:
```
pip install dlt[mssql]
```
This will install dlt with **mssql** extra which contains all the dependencies required by the sql server client.
This will install dlt with **mssql** extra which contains all the dependencies required by the SQL server client.

**3. Enter your credentials into `.dlt/secrets.toml`.**

Expand Down
6 changes: 3 additions & 3 deletions docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
title: Transforming data with dbt
title: Transform the data with dbt
description: Transforming the data loaded by a dlt pipeline with dbt
keywords: [transform, dbt, runner]
---

# Transforming data using dbt
# Transform the data with dbt

[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows simple structuring of your transformations into DAGs. The benefits of
using dbt include:

- End-to-end cross-db compatibility for dlt→dbt pipelines.
- Easy to use by sql analysts, low learning curve.
- Easy to use by SQL analysts, low learning curve.
- Highly flexible and configurable in usage, supports templating, can run backfills etc.
- Supports testing and accelerates troubleshooting.

Expand Down
6 changes: 3 additions & 3 deletions docs/website/docs/dlt-ecosystem/transformations/pandas.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: Transforming data with Pandas
description: Transforming the data loaded by a dlt pipeline with Pandas
title: Transform the data with Pandas
description: Transform the data loaded by a dlt pipeline with Pandas
keywords: [transform, pandas]
---

# Transforming the data using Pandas
# Transform the data with Pandas

You can fetch results of any SQL query as a dataframe. If the destination is supporting that
natively (i.e. BigQuery and DuckDB), `dlt` uses the native method. Thanks to that, reading
Expand Down
6 changes: 3 additions & 3 deletions docs/website/docs/dlt-ecosystem/transformations/sql.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: Transforming data with SQL
description: Transforming the data loaded by a dlt pipeline with SQL client
title: Transform the data with SQL
description: Transforming the data loaded by a dlt pipeline with the dlt SQL client
keywords: [transform, sql]
---

# Transforming data using the `dlt` SQL client
# Transform the data using the `dlt` SQL client

A simple alternative to dbt is to query the data using the `dlt` SQL client and then performing the
transformations using Python. The `execute_sql` method allows you to execute any SQL statement,
Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/airtable.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source.](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -137,7 +136,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `airtable`, you
may also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline](../../walkthroughs/run-a-pipeline).
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/asana.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source guide](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -110,7 +109,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `asana`, you may also use any
custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/chess.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source.md)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source.md).

### Add credentials

Expand Down Expand Up @@ -87,7 +86,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `chess_pipeline`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credential

Expand Down Expand Up @@ -174,7 +173,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `facebook_ads`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/github.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -126,7 +125,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `github_reactions`, you may
also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -230,7 +229,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is
`dlt_google_analytics_pipeline`, you may also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -231,8 +231,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -319,7 +318,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `google_sheets_pipeline`, you
may also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline](../../walkthroughs/run-a-pipeline).
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Data types

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/hubspot.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -131,7 +130,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `hubspot_pipeline`, you may
also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/jira.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -118,7 +117,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/matomo.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credential

Expand Down Expand Up @@ -118,7 +117,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `matomo`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).

### Add credentials

Expand Down Expand Up @@ -190,7 +189,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `local_mongo`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/mux.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source.](../../walkthroughs/add-a-verified-source)


### Add credentials
Expand Down Expand Up @@ -104,7 +103,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is
`mux`, you may also use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/notion.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source.](../../walkthroughs/add-a-verified-source)

### Add credentials

Expand Down Expand Up @@ -109,7 +108,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `notion`, you may also use any
custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
5 changes: 2 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,7 @@ To get started with your data pipeline, follow these steps:
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.

For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
For more information, read the guide on [how to add a verified source.](../../walkthroughs/add-a-verified-source)

### Add credentials

Expand Down Expand Up @@ -109,7 +108,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage
For example, the `pipeline_name` for the above pipeline example is `pipedrive`, you may also use
any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).

## Sources and resources

Expand Down
Loading