# Building custom sources using SQL Databases [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_2_custom_sources_sql_databases_.ipynb) [![GitHub badge](https://img.shields.io/badge/github-view_source-2b3137?logo=github)](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_2_custom_sources_sql_databases_.ipynb)

This lesson covers building flexible and powerful custom sources using the `sql_database` verified source.


![Lesson_2_Custom_sources_SQL_Databases_img1](https://storage.googleapis.com/dlt-blog-images/dlt-advanced-course/Lesson_2_Custom_sources_SQL_Databases_img1.png)


## What you will learn

- How to build a custom pipeline using SQL sources
- How to use `query_adapter_callback`, `table_adapter_callback`, and `type_adapter_callback`
- How to load only new data with incremental loading


Setup & install dlt:

In [None]:
%%capture
!pip install pymysql duckdb dlt

## Step 1: Load data from SQL Databases

We’ll use the [Rfam MySQL public DB](https://docs.rfam.org/en/latest/database.html) and load it into DuckDB:

In [None]:
from typing import Any
from dlt.sources.sql_database import sql_database
import dlt

source = sql_database(
    "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam",
    table_names=["family"],
)

pipeline = dlt.pipeline(
    pipeline_name="sql_database_example",
    destination="duckdb",
    dataset_name="sql_data",
    dev_mode=True,
)

load_info = pipeline.run(source)
print(load_info)

Explore the `family` table:

In [None]:
pipeline.dataset().family.df().head()

## Step 2: Customize SQL queries with `query_adapter_callback`

You can fully rewrite or modify the SQL SELECT statement per table.


### Filter rows using a WHERE clause

In [None]:
from sqlalchemy import text
from dlt.sources.sql_database.helpers import SelectClause, Table


def query_adapter_callback(query: SelectClause, table: Table) -> SelectClause:
    return text(f"SELECT * FROM {table.fullname} WHERE rfam_id like '%bacteria%'")

To be able to use `sql_database` and not have to declare the connection string each time, we save it as an environment variable. This can also (should preferably) be done in `secrets.toml`

In [None]:
import os

os.environ[
    "SOURCES__SQL_DATABASE__CREDENTIALS"
] = "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam"

In [None]:
filtered_resource = sql_database(
    query_adapter_callback=query_adapter_callback, table_names=["family"]
)

Let's save this filtered data:

In [None]:
info = pipeline.run(filtered_resource, table_name="bacterias")
print(info)

Explore the data:

In [None]:
pipeline.dataset().bacterias.df().head()

### **Question 1**:

How many rows are present in the `bacterias` table?

>Answer this question and select the correct option in the homework Quiz.


## Step 3: Modify table schema with `table_adapter_callback`

Add columns, change types, or transform schema using this hook.


### Example: Add computed column `max_timestamp`

In [None]:
import sqlalchemy as sa


def add_max_timestamp(table: Table) -> Any:
    max_ts = sa.func.greatest(table.c.created, table.c.updated).label("max_timestamp")
    subq = sa.select(*table.c, max_ts).subquery()
    return subq

Use it with `sql_table`:

In [None]:
from dlt.sources.sql_database import sql_table

table = sql_table(
    table="family",
    table_adapter_callback=add_max_timestamp,
    incremental=dlt.sources.incremental("max_timestamp"),
)

info = pipeline.run(table, table_name="family_with_max_timestamp")
print(info)

Let's check out if this column exists!

In [None]:
pipeline.dataset().family_with_max_timestamp.df().head()

## Step 4: Adapt column data types with `type_adapter_callback`

When the default types don’t match what you want in the destination, you can remap them.

Let's look at the schema that has already been loaded:

In [None]:
schema = pipeline.default_schema.to_dict()["tables"]["family"]["columns"]
for column in schema:
    print(schema[column]["name"], ":", schema[column]["data_type"])

Lets change `hmm_lambda` from decimal to float.

💡 Quick fyi: The `float` data type is:
- Fast and uses less space
- But it's approximate — you may get 0.30000000000000004 instead of 0.3
- Bad for money, great for probabilities, large numeric ranges, scientific values

### Example: Change data types

In [None]:
import sqlalchemy as sa
from sqlalchemy.types import Float


def type_adapter_callback(sql_type: Any) -> Any:
    if isinstance(sql_type, sa.Numeric):
        return Float
    return sql_type

Use it with `sql_database`:

In [None]:
new_source = sql_database(
    type_adapter_callback=type_adapter_callback, table_names=["family"]
)

info = pipeline.run(new_source, table_name="type_changed_family")
print(info)

👀 Can you see how the column data types have changed?

In [None]:
schema1 = pipeline.default_schema.to_dict()["tables"]["family"]["columns"]
schema2 = pipeline.default_schema.to_dict()["tables"]["type_changed_family"]["columns"]
column = "trusted_cutoff"

print("For table 'family':", schema1[column]["name"], ":", schema1[column]["data_type"])
print(
    "For table 'type_changed_family':",
    schema2[column]["name"],
    ":",
    schema2[column]["data_type"],
)

### **Question 2**:

How many columns had their type changed in the `type_changed_family` table?


## Step 5: Incremental loads with `sql_database`
Track only new rows using a timestamp or ID column.

We'll also be looking at where these incremental values are stored.

Hint: they are stored in [dlt state](https://dlthub.com/docs/general-usage/state).

In [None]:
import json

with open(
    "/var/dlt/pipelines/sql_database_example/state.json", "r", encoding="utf-8"
) as f:
    data = json.load(f)

data["sources"]["sql_database"]["resources"]["family"]["incremental"].keys()

In [None]:
from dlt.sources.sql_database import sql_database
import pendulum

source = sql_database(table_names=["family"])
source.family.apply_hints(
    incremental=dlt.sources.incremental(
        "updated", initial_value=pendulum.datetime(2024, 1, 1)
    )
)

info = pipeline.run(source)
print(info)

In [None]:
import json

with open(
    "/var/dlt/pipelines/sql_database_example/state.json", "r", encoding="utf-8"
) as f:
    data = json.load(f)

data["sources"]["sql_database"]["resources"]["family"]["incremental"].keys()

## **Rename tables for `sql_database` source**



In [None]:
import dlt
from dlt.sources.sql_database import sql_database

source = sql_database(table_names=["family"])

# Loop through each resource (table) in the source
for _resource_name, resource in source.resources.items():
    # Rename the target table by prefixing with "xxxx__"
    resource.apply_hints(table_name=f"xxxx__{resource.name}")


pipeline = dlt.pipeline(
    pipeline_name="sql_db_prefixed_tables",
    destination="duckdb",
    dataset_name="renamed_tables",
)


print(pipeline.run(source))
pipeline.dataset().row_counts().df()

✅ ▶ Proceed to the [next lesson](https://colab.research.google.com/drive/1P8pOw9C6J9555o2jhZydESVuVb-3z__y#forceEdit=true&sandboxMode=true)!