# **Recap of [Lesson 3](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-fundamentals-course/lesson_3_pagination_and_authentication_and_dlt_configuration.ipynb) üë©‚ÄçüíªüöÄ**

1. Used pagination with REST APIs.  
2. Applied authentication for REST APIs.  
3. Tried the dlt `RESTClient`.  
4. Used environment variables to manage secrets and configuration.  
5. Learned how to add values to `secrets.toml` and `config.toml`.  
6. Used the special `secrets.toml` environment variable setup for Colab.

---
# **`dlt`‚Äôs pre-built Sources and Destinations** [![Open in molab](https://marimo.io/molab-shield.svg)](https://molab.marimo.io/github/dlt-hub/dlt/blob/master/docs/education/dlt-fundamentals-course/lesson_4_using_pre_build_sources_and_destinations.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dlt-hub/dlt/blob/master/docs/education/dlt-fundamentals-course/lesson_4_using_pre_build_sources_and_destinations.ipynb) [![GitHub badge](https://img.shields.io/badge/github-view_source-2b3137?logo=github)](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-fundamentals-course/lesson_4_using_pre_build_sources_and_destinations.ipynb)


**Here, you will learn:**
- How to initialize verified sources.
- The built-in `rest_api` source.
- The built-in `sql_database` source.
- The built-in `filesystem` source.
- How to switch between destinations.

---

Our verified sources are the simplest way to start building your stack. Choose from any of our fully customizable 30+ pre-built sources, such as SQL databases, Google Sheets, Salesforce, and more.

With our numerous destinations, you can load data into a local database, data warehouse, or data lake. Choose from Snowflake, Databricks, and many others.

![Lesson_4_Using_pre_build_sources_and_destinations_img1](https://storage.googleapis.com/dlt-blog-images/dlt-fundamentals-course/Lesson_4_Using_pre_build_sources_and_destinations_img1.png)

# **Existing verified sources**
To use an [existing verified source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/), just run the `dlt init` command.





There's a base project for each `dlt` verified source + destination combination, which you can adjust according to your needs.

These base project can be initialized with a simple command:

```
dlt init <verified-source> <destination>
```

In [None]:
%%capture
!pip install dlt[duckdb]

List all verified sources:


In [None]:
!dlt init --list-sources

This command shows all available verified sources and their short descriptions. For each source, it checks if your local `dlt` version requires an update and prints the relevant warning.

Consider an example pipeline for the GitHub API:

```
Available dlt single file templates:
---
arrow: The Arrow Pipeline Template will show how to load and transform arrow tables.
dataframe: The DataFrame Pipeline Template will show how to load and transform pandas dataframes.
debug: The Debug Pipeline Template will load a column with each datatype to your destination.
default: The Intro Pipeline Template contains the example from the docs intro page
fruitshop: The Default Pipeline Template provides a simple starting point for your dlt pipeline

---> github_api: The Github API templates provides a starting

point to read data from REST APIs with REST Client helper
requests: The Requests Pipeline Template provides a simple starting point for a dlt pipeline with the requests library
```

### Step 1. Initialize the source

This command will initialize the pipeline example with the GitHub API as the source and DuckBD as the destination:

In [None]:
!dlt --non-interactive init github_api duckdb

Now, check  your files on the left side bar. It should contain all the necessary files to run your GitHub API -> DuckDB pipeline:

- The `.dlt` folder containing `secrets.toml` and `config.toml`
- The pipeline script `github_api_pipeline.py`
- `requirements.txt`
- `.gitignore`

In [None]:
!ls -a

What you would normally do with the project:
- Add your credentials and define configurations
- Adjust the pipeline script as needed
- Run the pipeline script

> If needed, you can adjust the verified source code.

In [None]:
!cat github_api_pipeline.py

From the code, we can see that this pipeline loads **only the `"issues"` endpoint**.  
You can adjust this code as needed: add new endpoints, include additional logic, apply transformations, and more.

### Step 2. Add credentials

In Colab (or Molab), it is more convenient to use environment variables or `dlt.secrets`.

In the pipeline above, the `access_token` parameter is set to `dlt.secrets.value`, which means you need to configure this variable:


```python
@dlt.resource(write_disposition="replace")
def github_api_resource(access_token: Optional[str] = dlt.secrets.value):
  ...
```

In [None]:
from google.colab import userdata

dlt.secrets["SOURCES__ACCESS_TOKEN"] = userdata.get("SECRET_KEY")

### Step 3. Run the pipeline

Let's run the pipeline!

In [None]:
!python github_api_pipeline.py

From the pipeline output, we can get information such as the pipeline name, dataset name, destination path, and more.

> Pipeline **github_api_pipeline** load step completed in 1.23 seconds  
> 1 load package was loaded to the DuckDB destination and into the dataset **github_api_data**.  
> The DuckDB destination used `duckdb:////content/**github_api_pipeline.duckdb**` as the storage location.  
> Load package `1733848559.8195539` is **LOADED** and contains no failed jobs.



## Step 4: Explore your data

Let's explore what tables were created in the destination.

In [None]:
import duckdb

conn = duckdb.connect("github_api_pipeline.duckdb")
conn.sql("SET search_path = 'github_api_data'")
conn.sql("DESCRIBE").df()

In [None]:
data_table = conn.sql("SELECT * FROM github_api_resource").df()
data_table

# **Built-in sources: RestAPI, SQL database & Filesystem**

## **[RestAPI source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api/basic)**

`rest_api` is a generic source that lets you create a `dlt` source from any REST API using a declarative configuration. Since most REST APIs follow similar patterns, this source provides a convenient way to define your integration declaratively.

Using a [declarative configuration](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api/basic#source-configuration), you can specify:

- the API endpoints to pull data from,
- their relationships,
- how to handle pagination,
- authentication.

`dlt` handles the rest for you: **unnesting the data, inferring the schema**, and **writing it to the destination**.

In the previous lesson, you already used the REST API Client. `dlt`‚Äôs **[RESTClient](https://dlthub.com/docs/general-usage/http/rest-client)** is the **low-level abstraction** that powers the RestAPI source.


### Initialize the `rest_api` template

You can initialize the `rest_api` **template** using the `init` command:


In [None]:
!yes | dlt init rest_api duckdb

In the `rest_api_pipeline.py` script, you will find sources for both the GitHub API and the PokeAPI, defined using the `rest_api` source and `RESTAPIConfig`.

Since the `rest_api` source is a **built-in source**, you don't need to initialize it. You can simply **import** it from `dlt.sources` and start using it.

### Example

Here is a simplified example of how to configure the REST API source to load `issues` and issue `comments` from the GitHub API:


In [None]:
import dlt
from dlt.sources.rest_api import RESTAPIConfig, rest_api_source
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

config: RESTAPIConfig = {
    "client": {
        "base_url": "https://api.github.com",
        "auth": {
            "token": dlt.secrets["sources.access_token"],
        },
        "paginator": "header_link",
    },
    "resources": [
        {
            "name": "issues",
            "endpoint": {
                "path": "repos/dlt-hub/dlt/issues",
                "params": {
                    "state": "open",
                },
            },
        },
        {
            "name": "issue_comments",
            "endpoint": {
                "path": "repos/dlt-hub/dlt/issues/{issue_number}/comments",
                "params": {
                    "issue_number": {
                        "type": ("resolve"),
                        "resource": "issues",
                        "field": "number",
                    },
                },
            },
        },
    ],
}

github_source = rest_api_source(config)


rest_api_pipeline = dlt.pipeline(
    pipeline_name="rest_api_github",
    destination="duckdb",
    dataset_name="rest_api_data",
    dev_mode=True,
)

load_info = rest_api_pipeline.run(github_source)
print(load_info)

In [None]:
rest_api_pipeline.dataset().issues.df()

### **Exercise 1: Run `rest_api` source**

Explore the cells above and answer the question below using `sql_client` or `pipeline.dataset()`.

#### **Question**
How many columns does the `issues` table have?

### **Exercise 2: Create a dlt source with `rest_api`**

Add the `contributors` endpoint for the `dlt` repository to the `rest_api` configuration:

- Resource name: **"contributors"**
- Endpoint path: **"repos/dlt-hub/dlt/contributors"**
- No parameters

#### **Question**
How many columns does the `contributors` table have?


---
## **[SQL Databases source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database/)**

SQL databases are management systems (DBMS) that store data in a structured format, commonly used for efficient and reliable data retrieval.

The `sql_database` verified source loads data to your specified destination using one of the following backends:
* SQLAlchemy,
* PyArrow,
* pandas,
* ConnectorX.

### Initialize the `sql_database` template

Initialize the `dlt` template for `sql_database` using the `init` command:


In [None]:
!yes | dlt init sql_database duckdb

The `sql_database` source is also a **built-in source**, you don't have to initialize it, just **import** it from `dlt.sources`.

### Example

The example below shows how you can use dlt to load data from a SQL database (PostgreSQL, MySQL, SQLite, Oracle, IBM DB2, etc.) into a destination.

To make it easy to reproduce, we will load data from the [public MySQL Rfam database](https://docs.rfam.org/en/latest/database.html) into a local DuckDB instance.

In [None]:
%%capture
!pip install pymysql

In [None]:
from dlt.sources.sql_database import sql_database

sql_source = sql_database(
    "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam",
    table_names=[
        "family",
    ],
)

sql_db_pipeline = dlt.pipeline(
    pipeline_name="sql_database_example",
    destination="duckdb",
    dataset_name="sql_data",
    dev_mode=True,
)

load_info = sql_db_pipeline.run(sql_source)
print(load_info)

### **Exercise 3: Run `sql_database` source**

Explore the cells above and answer the question below using `sql_client` or `pipeline.dataset()`.

#### **Question**
How many columns does the `family` table have?

---
## **[Filesystem source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/)**

The filesystem source allows seamless loading of files from the following locations:

* AWS S3
* Google Cloud Storage
* Google Drive
* Azure Blob Storage
* remote filesystem (via SFTP)
* local filesystem

The filesystem source natively supports CSV, Parquet, and JSONL files and allows customization for loading any type of structured file.


**How filesystem source works**

The Filesystem source doesn't just give you an easy way to load data from both remote and local files ‚Äî it also comes with a powerful set of tools that let you customize the loading process to fit your specific needs.

Filesystem source loads data in two steps:

1. It accesses the files in your remote or local file storage **without** actually **reading** the content yet. At this point, you can filter files by metadata or name. You can also set up incremental loading to load only new files.
2. The **transformer** **reads** the files' content and yields the records. At this step, you can filter out the actual data, enrich records with metadata from files, or perform incremental loading based on the file content.

### Initialize the `filesystem` template

Initialize the dlt template for `filesystem` using the `init` command:


In [None]:
!yes | dlt init filesystem duckdb

The `filesystem` source is also a **built-in source**, you don't have to initialize it, just **import** it from `dlt.sources`.

### Example

To illustrate how this **built-in source** works, we first download some file to the local (Colab) filesystem.

In [None]:
import os
import requests

folder_name = "local_data"
os.makedirs(folder_name, exist_ok=True)
full_path = os.path.abspath(folder_name)

url = "https://www.timestored.com/data/sample/userdata.parquet"
resp = requests.get(url)
resp.raise_for_status()

with open(f"{full_path}/userdata.parquet", "wb") as f:
    f.write(resp.content)

In [None]:
import dlt
from dlt.sources.filesystem import filesystem, read_parquet

filesystem_resource = filesystem(bucket_url=full_path, file_glob="**/*.parquet")
filesystem_pipe = filesystem_resource | read_parquet()

# We load the data into the table_name table
fs_pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
load_info = fs_pipeline.run(filesystem_pipe.with_name("userdata"))
print(load_info)

### **Exercise 4: Run `filesystem` source**

Explore the cells above and answer the question below using `sql_client` or `pipeline.dataset()`.

#### **Question**
How many columns does the `userdata` table have?


You can read how to configure **Cloud Storage** in the official  
[dlt documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/basic#configuration).


# [**Built-in Destinations**](https://dlthub.com/docs/dlt-ecosystem/destinations/)


![Lesson_4_Using_pre_build_sources_and_destinations_img2](https://storage.googleapis.com/dlt-blog-images/dlt-fundamentals-course/Lesson_4_Using_pre_build_sources_and_destinations_img2.png)

---
##  **Exploring `dlt` destinations**


To be honest, this is simply a matter of going through the  
[documentation](https://dlthub.com/docs/dlt-ecosystem/destinations/) üëÄ, but to sum it up:

- Most likely, the destination where you want to load data is already a `dlt` integration that undergoes several hundred automated tests every day.
- If not, you can define a custom destination and still benefit from most `dlt`-specific features.  
  *FYI: custom destinations will be covered in the next Advanced course ‚Äî so we expect you to come back for part two‚Ä¶*


## **Choosing a destination**

Switching between destinations in `dlt` is incredibly straightforward. Simply modify the `destination` parameter in your pipeline configuration. For example:

In [None]:
data_pipeline = dlt.pipeline(
    pipeline_name="data_pipeline",
    destination="duckdb",
    dataset_name="data",
)
print(data_pipeline.destination.destination_type)

data_pipeline = dlt.pipeline(
    pipeline_name="data_pipeline",
    destination="bigquery",
    dataset_name="data",
)
print(data_pipeline.destination.destination_type)

This flexibility allows you to easily transition from local development to production-grade environments.



## **Filesystem destination**

The `filesystem` destination enables you to load data into **files stored locally** or in **cloud storage** solutions, making it an excellent choice for lightweight testing, prototyping, or file-based workflows.

Below is an **example** demonstrating how to use the `filesystem` destination to load data in **Parquet** format:

* Step 1: Set up a local bucket or cloud directory for storing files


In [None]:
import os

os.environ["BUCKET_URL"] = "./content"

* Step 2: Define the data source

In [None]:
import dlt
from dlt.sources.sql_database import sql_database

source = sql_database(
    "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam",
    table_names=[
        "family",
    ],
)


pipeline = dlt.pipeline(
    pipeline_name="fs_pipeline",
    destination="filesystem",
    dataset_name="fs_data",
)

load_info = pipeline.run(source, loader_file_format="parquet")
print(load_info)

Look at the files:

In [None]:
! ls ./content/fs_data/family

Look at the loaded data:

In [None]:
# explore loaded data
pipeline.dataset().family.df()

### **Table formats: [Delta tables & Iceberg](https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg)**

dlt supports writing **Delta** and **Iceberg** tables when using the `filesystem` destination.

**How it works:**

dlt uses the `deltalake` and `pyiceberg` libraries to write Delta and Iceberg tables, respectively. One or multiple Parquet files are prepared during the extract and normalize steps. In the load step, these Parquet files are exposed as an Arrow data structure and fed into `deltalake` or `pyiceberg`.

In [None]:
%%capture
!pip install "dlt[pyiceberg]"

In [None]:
load_info = pipeline.run(
    source,
    loader_file_format="parquet",
    table_format="iceberg",
)
print(load_info)

**Note:**

The open-source version of dlt supports basic functionality for **Iceberg**, but the dltHub team is currently working on an **extended** and **more powerful** Iceberg integration.

[Join the waiting list to learn more about dltHub and Iceberg.](https://info.dlthub.com/waiting-list)


# **Spoiler: Custom Sources & Destinations**

`dlt` aims to simplify the process of creating both custom sources  
([REST API Client](https://dlthub.com/docs/general-usage/http/rest-client),  
[`rest_api` source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/rest_api))  
and [custom destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/destination).

We will explore this topic in more detail in the next Advanced course.


‚úÖ ‚ñ∂ Proceed to the [next lesson](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-fundamentals-course/lesson_5_write_disposition_and_incremental_loading.ipynb)!