chDB

📢 chDB joins the ClickHouse family 🐍+🚀

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse ¹ For more details: The birth of chDB

Features

In-process SQL OLAP Engine, powered by ClickHouse
No need to install ClickHouse
Minimized data copy from C++ to Python with python memoryview
Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats, samples
Support Python DB API 2.0, example

Arch

Get Started

Get started with chdb using our Installation and Usage Examples

Installation

Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty

Data Input

The following methods are available to access on-disk and in-memory data formats:

🗂️ Connection based API (recommended)

import chdb

# Create a connection (in-memory by default)
conn = chdb.connect(":memory:")
# Or use file-based: conn = chdb.connect("test.db")

# Create a cursor
cur = conn.cursor()

# Execute queries
cur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3")

# Fetch data in different ways
print(cur.fetchone())    # Single row: (0, '0')
print(cur.fetchmany(2))  # Multiple rows: ((1, '1'), (2, '2'))

# Get column information
print(cur.column_names())  # ['number', 'str']
print(cur.column_types())  # ['UInt64', 'String']

# Use the cursor as an iterator
cur.execute("SELECT number FROM system.numbers LIMIT 3")
for row in cur:
    print(row)

# Always close resources when done
cur.close()
conn.close()

For more details, see examples/connect.py.

🗂️ Query On File

(Parquet, CSV, JSON, Arrow, ORC and 60+)

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')

🗂️ Query On Table

(Pandas DataFrame, Parquet file/bytes, Arrow bytes)

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",
                  tbl1=df1, tbl2=df2)
print(ret_tbl)
# Query on the DataFrame Table
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))
# Pandas DataFrames are automatically registered as temporary tables in ClickHouse
chdb.query("SELECT * FROM Python(df1) t1 JOIN Python(df2) t2 ON t1.a = t2.c").show()

🗂️ Query with Stateful Session

from chdb import session as chs

## Create DB, Table, View in temp session, auto cleanup when session is deleted.
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")
sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")
sess.query(
    "CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;"
)
print("Select from view:\n")
print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))

🗂️ Query with Python DB-API 2.0

import chdb.dbapi as dbapi
print("chdb driver version: {0}".format(dbapi.get_client_info()))

conn1 = dbapi.connect()
cur1 = conn1.cursor()
cur1.execute('select version()')
print("description: ", cur1.description)
print("data: ", cur1.fetchone())
cur1.close()
conn1.close()

🗂️ Query with UDF (User Defined Functions)

from chdb.udf import chdb_udf
from chdb import query

@chdb_udf()
def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

print(query("select sum_udf(12,22)"))

Some notes on chDB Python UDF(User Defined Function) decorator.

The function should be stateless. So, only UDFs are supported, not UDAFs(User Defined Aggregation Function).
Default return type is String. If you want to change the return type, you can pass in the return type as an argument. The return type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.

The function will be called for each line of input. Something like this:

def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

for line in sys.stdin:
    args = line.strip().split('\t')
    lhs = args[0]
    rhs = args[1]
    print(sum_udf(lhs, rhs))
    sys.stdout.flush()

The function should be pure python function. You SHOULD import all python modules used IN THE FUNCTION.
```
def func_use_json(arg):
    import json
    ...
```
Python interpertor used is the same as the one used to run the script. Get from sys.executable

🗂️ Streaming Query

Process large datasets with constant memory usage through chunked streaming.

from chdb import session as chs

sess = chs.Session()

# Example 1: Basic example of using streaming query
rows_cnt = 0
with sess.send_query("SELECT * FROM numbers(200000)", "CSV") as stream_result:
    for chunk in stream_result:
        rows_cnt += chunk.rows_read()

print(rows_cnt) # 200000

# Example 2: Manual iteration with fetch()
rows_cnt = 0
stream_result = sess.send_query("SELECT * FROM numbers(200000)", "CSV")
while True:
    chunk = stream_result.fetch()
    if chunk is None:
        break
    rows_cnt += chunk.rows_read()

print(rows_cnt) # 200000

# Example 3: Early cancellation demo
rows_cnt = 0
stream_result = sess.send_query("SELECT * FROM numbers(200000)", "CSV")
while True:
    chunk = stream_result.fetch()
    if chunk is None:
        break
    if rows_cnt > 0:
        stream_result.close()
        break
    rows_cnt += chunk.rows_read()

print(rows_cnt) # 65409

# Example 4: Using PyArrow RecordBatchReader for batch export and integration with other libraries
import pyarrow as pa
from deltalake import write_deltalake

# Get streaming result in arrow format
stream_result = sess.send_query("SELECT * FROM numbers(100000)", "Arrow")

# Create RecordBatchReader with custom batch size (default rows_per_batch=1000000)
batch_reader = stream_result.record_batch(rows_per_batch=10000)

# Use RecordBatchReader with external libraries like Delta Lake
write_deltalake(
    table_or_uri="./my_delta_table",
    data=batch_reader,
    mode="overwrite"
)

stream_result.close()

sess.close()

Important Note: When using streaming queries, if the StreamingResult is not fully consumed (due to errors or early termination), you must explicitly call stream_result.close() to release resources, or use the with statement for automatic cleanup. Failure to do so may block subsequent queries.

For more details, see test_streaming_query.py and test_arrow_record_reader_deltalake.py.

🗂️ Python Table Engine

Query on Pandas DataFrame

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
        "dict_col": [
            {'id': 1, 'tags': ['urgent', 'important'], 'metadata': {'created': '2024-01-01'}},
            {'id': 2, 'tags': ['normal'], 'metadata': {'created': '2024-02-01'}},
            {'id': 3, 'name': 'tom'},
            {'id': 4, 'value': '100'},
            {'id': 5, 'value': 101},
            {'id': 6, 'value': 102},
        ],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()
chdb.query("SELECT dict_col.id FROM Python(df) WHERE dict_col.value='100'").show()

Query on Arrow Table

import chdb
import pyarrow as pa
arrow_table = pa.table(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
        "dict_col": [
            {'id': 1, 'value': 'tom'},
            {'id': 2, 'value': 'jerry'},
            {'id': 3, 'value': 'auxten'},
            {'id': 4, 'value': 'tom'},
            {'id': 5, 'value': 'jerry'},
            {'id': 6, 'value': 'auxten'},
        ],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b").show()
chdb.query("SELECT dict_col.id FROM Python(arrow_table) WHERE dict_col.value='tom'").show()

Query on chdb.PyReader class instance

You must inherit from chdb.PyReader class and implement the read method.
The read method should:
1. return a list of lists, the first demension is the column, the second dimension is the row, the columns order should be the same as the first arg col_names of read.
2. return an empty list when there is no more data to read.
3. be stateful, the cursor should be updated in the read method.
An optional get_schema method can be implemented to return the schema of the table. The prototype is def get_schema(self) -> List[Tuple[str, str]]:, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types

import chdb

class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        print("Python func read", col_names, count, self.cursor)
        if self.cursor >= len(self.data["a"]):
            self.cursor = 0
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block

    def get_schema(self):
        return [
            ("a", "int"),
            ("b", "str"),
            ("dict_col", "json")
        ]

reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
        "dict_col": [
            {'id': 1, 'tags': ['urgent', 'important'], 'metadata': {'created': '2024-01-01'}},
            {'id': 2, 'tags': ['normal'], 'metadata': {'created': '2024-02-01'}},
            {'id': 3, 'name': 'tom'},
            {'id': 4, 'value': '100'},
            {'id': 5, 'value': 101},
            {'id': 6, 'value': 102}
        ],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(reader) GROUP BY b ORDER BY b").show()
chdb.query("SELECT dict_col.id FROM Python(reader) WHERE dict_col.value='100'").show()

see also: test_query_py.py and test_query_json.py.

JSON Type Inference

chDB automatically converts Python dictionary objects to ClickHouse JSON types from these sources:

Pandas DataFrame
- Columns with object dtype are sampled (default 10,000 rows) to detect JSON structures.
- Control sampling via SQL settings:
```
SET pandas_analyze_sample = 10000  -- Default sampling
SET pandas_analyze_sample = 0      -- Force String type
SET pandas_analyze_sample = -1     -- Force JSON type
```
- Columns are converted to String if sampling finds non-dictionary values.
Arrow Table
- struct type columns are automatically mapped to JSON columns.
- Nested structures preserve type information.

chdb.PyReader

Implement custom schema mapping in get_schema():

def get_schema(self):
    return [
        ("c1", "JSON"),  # Explicit JSON mapping
        ("c2", "String")
    ]

Column types declared as "JSON" will bypass auto-detection.

When converting Python dictionary objects to JSON columns:

Nested Structures
- Recursively process nested dictionaries, lists, tuples and NumPy arrays.
Primitive Types
- Automatic type recognition for basic types such as integers, floats, strings, and booleans, and more.
Complex Objects
- Non-primitive types will be converted to strings.

Limitations

Column types supported: pandas.Series, pyarrow.array, chdb.PyReader
Data types supported: Int, UInt, Float, String, Date, DateTime, Decimal
Python Object type will be converted to String
Pandas DataFrame performance is all of the best, Arrow Table is better than PyReader

For more examples, see examples and tests.

Demos and Examples

Project Documentation and Usage Examples
Colab Notebooks and other Script Examples

Benchmark

Documentation

For chdb specific examples and documentation refer to chDB docs
For SQL syntax, please refer to ClickHouse SQL Reference

Events

Demo chDB at ClickHouse v23.7 livehouse! and Slides

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. There are something you can help:

Help test and report bugs
Help improve documentation
Help improve code quality and performance

Bindings

We welcome bindings for other languages, please refer to bindings for more details.

Version Guide

Please refer to VERSION-GUIDE.md for more details.

Paper

ClickHouse - Lightning Fast Analytics for Everyone

License

Apache 2.0, see LICENSE for more information.

Acknowledgments

chDB is mainly based on ClickHouse ¹ for trade mark and other reasons, I named it chDB.

Contact

ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners. ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 1,533 Commits
.github		.github
base		base
benchmark		benchmark
chdb		chdb
ci		ci
cmake		cmake
contrib		contrib
docker		docker
docs/_static		docs/_static
examples		examples
packages		packages
programs		programs
pybind		pybind
rust		rust
src		src
tests		tests
utils		utils
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.coveragerc		.coveragerc
.cursorignore		.cursorignore
.editorconfig		.editorconfig
.exrc		.exrc
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.snyk		.snyk
.yamllint		.yamllint
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
PreLoad.cmake		PreLoad.cmake
README-zh.md		README-zh.md
README.md		README.md
SECURITY.md		SECURITY.md
VERSION-GUIDE.md		VERSION-GUIDE.md
bindings.md		bindings.md
format_sources		format_sources
gen_manifest.sh		gen_manifest.sh
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Uh oh!

License

chdb-io/chdb

Folders and files

Latest commit

History

Repository files navigation

chDB

Features

Arch

Get Started

Installation

Usage

Run in command line

Data Input

🗂️ Connection based API (recommended)

🗂️ Query On File

Work with Parquet or CSV

Pandas dataframe output

🗂️ Query On Table

Query On Pandas DataFrame

🗂️ Query with Stateful Session

🗂️ Query with Python DB-API 2.0

🗂️ Query with UDF (User Defined Functions)

🗂️ Streaming Query

🗂️ Python Table Engine

Query on Pandas DataFrame

Query on Arrow Table

Query on chdb.PyReader class instance

JSON Type Inference

Limitations

Demos and Examples

Benchmark

Documentation

Events

Contributing

Bindings

Version Guide

Paper

License

Acknowledgments

Contact

Footnotes

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 56

Sponsor this project

Uh oh!

Uh oh!

Contributors 14

Languages