Skip to content

Latest commit

 

History

History
142 lines (90 loc) · 8.13 KB

20230328_02.md

File metadata and controls

142 lines (90 loc) · 8.13 KB

将 "数据结构、数据存储" 从 "数据库管理系统" 剥离后 - 造就了大量大数据产品(DataFusion, arrow-rs, databend等)

作者

digoal

日期

2023-03-28

标签

PostgreSQL , PolarDB , 数据结构 , 数据存储 , 数据库


背景

数据库管理系统的核心是强大的可靠、事务管理、并发管理、备份恢复管理、数据导引等能力, 是一个非常复杂的应用.

拨开复杂的外衣, 本质是数据存储、修改、读取的能力.

如果不需要复杂的管理, 场景足够简单, 是不是可以把存储结构、存储从管理系统剥离? :

  • 数据存储(简单理解为有一定格式的数据文件), 例如parquet, arrow.
  • 对象存储, 通用的访问协议, 通过校验块实现高可靠性(例如模拟N副本效果), 低廉的价格.

当这两者有机结合, 在某些append only的大数据分析领域, 相比传统的数据库管理系统, 可能更具有优势: 成本更低, 开发体验更好, 架构更简单, 更易于扩展.

例如 arrow-rs , LANCE.

《DuckDB 存储生态: lance(向量存储引擎): Modern columnar data format for ML/超越parquet》

《DuckDB parquet 分区表 / Delta Lake(数据湖) 应用》

《PostgreSQL deltaLake 数据湖用法 - arrow + parquet fdw》

https://github.com/eto-ai/lance

https://github.com/apache/arrow-rs

类似产品: dremio databend

Native Rust implementation of Apache Arrow and Parquet

Coverage Status

Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust.

This repo contains the following main components:

Crate Description Documentation
arrow Core functionality (memory layout, arrays, low level computations) (README)
parquet Support for Parquet columnar file format (README)
arrow-flight Support for Arrow-Flight IPC protocol (README)
object-store Support for object store interactions (aws, azure, gcp, local, in-memory) (README)

There are two related crates in a different repository

Crate Description Documentation
DataFusion In-memory query engine with SQL support (README)
Ballista Distributed query execution (README)

Collectively, these crates support a vast array of functionality for analytic computations in Rust.

For example, you can write an SQL query or a DataFrame (using the datafusion crate), run it against a parquet file (using the parquet crate), evaluate it in-memory using Arrow's columnar format (using the arrow crate), and send to another process (using the arrow-flight crate).

Generally speaking, the arrow crate offers functionality for using Arrow arrays, and datafusion offers most operations typically found in SQL, including joins and window functions.

You can find more details about each crate in their respective READMEs.

Arrow Rust Community

The dev@arrow.apache.org mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found at the Arrow Community page. All major announcements and communications happen there.

The Rust Arrow community also uses the official ASF Slack for informal discussions and coordination. This is
a great place to meet other contributors and get guidance on where to contribute. Join us in the #arrow-rust channel and feel free to ask for an invite via:

  1. the dev@arrow.apache.org mailing list
  2. the GitHub Discussions
  3. the Discord channel

Unlike other parts of the Arrow ecosystem, the Rust implementation uses GitHub issues as the system of record for new features
and bug fixes and this plays a critical role in the release process.

For design discussions we generally collaborate on Google documents and file a GitHub issue linking to the document.

There is more information in the contributing guide.

digoal's wechat