Skip to content

Latest commit

 

History

History
275 lines (201 loc) · 11.6 KB

20240108_04.md

File metadata and controls

275 lines (201 loc) · 11.6 KB

分库架构全局唯一序列生成方案: Snowflake ID implementation for PostgreSQL

作者

digoal

日期

2024-01-08

标签

PostgreSQL , PolarDB , DuckDB , 分库分表 , 全局唯一 , snowflake ID


背景

Snowflake ID implementation for PostgreSQL

当业务使用了分库特性时, 如果业务有全局唯一的需求, 需要能保证全局唯一的序列生成器, 传统方法为 UUID.

也可以使用多个字段来实现, 例如一个nodeid字段, 一个dbid字段, 一个schemaid字段, 一个partitionid字段, 一个sequence字段.

nodeid代表实例 ID(实例之间唯一), dbid代表数据库ID(同一台实例的不同数据库之间唯一), schemaid在同一个数据库内不同schema之间唯一, partitionId代表表分区ID(同一个schema下不同分区之间唯一).

不论是使用Uuid还是多字段实现, 都有其缺陷, uuid属于字符串, 占用空间更大, 由于其无序性容易造成索引膨胀. 多字段需要集中管理元数据, 也比较容易出错, 配置错误会导致不唯一.

snowflake id是int8类型的全局唯一序列生成器, 支持1023个实例.

  • bit 63 - unused (sign of int8)
  • bits 22-62 - timestamp with millisecond precision
  • bits 10-21 - counter within one millisecond
  • bits 0-9 - unique PostgreSQL node number set in postgresql.conf

除了snowflake id, 也有其他的方案, 例如 Instagram 的方案:

安装snowflake id:

git clone --depth 1 https://github.com/pgEdge/snowflake  
cd snowflake  
USE_PGXS=1 make install  

已集成到学习镜像:

Snowflake Sequences for PostgreSQL

Implementation

Snowflake is a PostgreSQL extension providing an int8
and sequence based unique ID solution to optionally replace
the PostgreSQL built-in bigserial data type.

Internally Snowflakes are 64 bit integers represented externally as bigint values. The 64 bits are divided into bit fields

bit  63    - unused (sign of int8)  
bits 22-62 - timestamp with millisecond precision  
bits 10-21 - counter within one millisecond  
bits 0-9   - unique PostgreSQL node number set in postgresql.conf  
  • The timestamp is a 41-bit unsigned value representing millisecond
    precision and an epoch of 2023-01-01.

  • The counter is a 12-bit unsigned value that increments per ID allocation.
    This provides for 4096 unique IDs per millisecond, or 4 million IDs per
    second. Even the most aggressive allocation of sequences cannot reach
    one million per second and a modified pgbench with a Snowflake based
    primary key on the history table can only generate IDs in the double
    digit per millisecond range.

    Should it be possible in the future to generate more than 4096
    snowflakes per millisecond the algorithm is going to bump the
    timestamp one millisecond into the future to keep snowflakes
    unique.

  • The node number is a 10-bit unique identifier of the PostgreSQL
    instance inside a global cluster. This value must be set with the
    GUC snowflake.node in the postgresql.conf file.

With this design a Snowflake ID is unique within one sequence
across multiple PostgreSQL instances in a distributed cluster.

Installation

Installation from pgEdge binaries

Go to pgedge.org and run install.py (takes two seconds) to install pgEdge nodectl
After running the installer simply run the following command, as a non root user, from your home directory

./nodectl install pg17 --start : install snowflake  

Installation from source code

For installation from source code it is assumed that the user is
familiar with how to build standard PostgreSQL extensions from a
source.

cd contrib  
git clone https://github.com/pgEdge/snowflake.git  
cd snowflake  
make  
make install  

Configuration

The Snowflake extension uses a custom GUC snowflake.node.
This configuration variable determines the node part of every
snowflake, generated by this PostgreSQL instance. The
permitted values are 1 thru 1023.

This configuration option has an invalid default value (on purpose).
If not set, the Snowflake extension will throw an exception on
a call to snowflake.nextval(). This (hopefully) prevents you from
accidentally missing this GUC in your postgresql.conf file.

If you ever intend to use Snowflakes in a multi-node, distributed
or replicated setup, it is important to set the GUC to a unique value
for each PostgreSQL instance. There is nothing
in place to prevent you from shooting yourself in the foot by
assigning multiple PostgreSQL instances in a multi-master cluster
the same node number.

API

After creating the extension via

CREATE EXTENSION snowflake;  

the following functions become available:

  • snowflake.nextval([sequence regclass])
    Generates the next snowflake for the given sequence. If no
    sequence is specified the internal, database-wide sequence
    snowflake.id_seq will be used.

    NOTE: snowflakes are only unique per database, per sequence.
    If you need snowflakes to be unique across all snowflake
    columns within a database, you need to use the built-in sequence
    snowflake.id_seq. However, that may interfere with the usage
    of currval() of your application logic.

  • snowflake.currval([sequence regclass])
    Returns the current value of the given sequence (or the default, internal
    sequence). Like for PostgreSQL sequences this value is undefined until
    the function snowflake.nextval() has been called for the sequence in
    the current session.

  • snowflake.get_epoch(snowflake int8)
    Returns the timestamp part of the given snowflake as EPOCH
    (seconds since 2023-01-01) as a NUMERIC value with precision of
    three digits. One can use to_timestamp(snowflake.get_epoch(<value>))
    to convert this into an actual timestamp.

  • snowflake.get_count(snowflake int8)
    Returns the count part of the given snowflake as int4 value.
    This is a unique value within the milliseconds of the snowflake's
    timestamp.

  • snowflake.get_node(snowflake int8)
    Returns the setting of GUC snowflake.node in postgresql.conf at
    the time, this snowflake was allocated.

  • snowflake.format(snowflake int8)
    Returns a jsonb object of the given snowflake like:
    {"node": 1, "ts": "2023-10-16 17:57:26.361+00", "count": 0}

Examples

New table using a snowflake as PK

This SQL code shows how to use snowflake in a newly created table.

-- Assuming extension snowflake has been installed  
  
CREATE TABLE table1 (  
    id          bigint PRIMARY KEY DEFAULT snowflake.nextval(),  
    some_data   text  
);  
  
INSERT INTO table1 (some_data) VALUES ('first row');  
INSERT INTO table1 (some_data) VALUES ('second row');  
  
SELECT id, snowflake.format(id), some_data FROM table1;  

Result:

        id         |                          format                           | some_data  
-------------------+-----------------------------------------------------------+------------  
 18014518154714241 | {"node": 1, "ts": "2023-10-16 18:47:12.257+00", "count": 0} | first row  
 18014518154714242 | {"node": 1, "ts": "2023-10-16 18:47:12.258+00", "count": 0} | second row  
(2 rows)  

Converting an existing bigserial column into a snowflake

This SQL code demonstrates how to convert an existing bigserial or
serial8 column into a snowflake.

CREATE TABLE table2 (  
    id          bigserial PRIMARY KEY,  
    some_data   text  
);  
  
INSERT INTO table2 (some_data) VALUES ('first row');  
INSERT INTO table2 (some_data) VALUES ('second row');  
  
ALTER TABLE table2 ALTER COLUMN id SET DEFAULT snowflake.nextval();  
  
-- Alternatively, if you need to retain the ability to retrieve  
-- the snowflake.currval('table2_id_seq') of the sequence individually, use  
-- ALTER TABLE table2 ALTER COLUMN id SET DEFAULT snowflake.nextval('table2_id_seq');  
  
INSERT INTO table2 (some_data) VALUES ('third row');  
INSERT INTO table2 (some_data) VALUES ('fourth row');  
  
SELECT id, snowflake.format(id), some_data FROM table2;  

Result:

        id         |                          format                           | some_data    
-------------------+-----------------------------------------------------------+------------  
                 1 | {"node": 0, "ts": "2023-01-01 00:00:00.001+00", "count": 0} | first row  
                 2 | {"node": 0, "ts": "2023-01-01 00:00:00.002+00", "count": 0} | second row  
 18014518155600128 | {"node": 1, "ts": "2023-10-16 19:01:58.144+00", "count": 0} | third row  
 18014518155600129 | {"node": 1, "ts": "2023-10-16 19:01:58.145+00", "count": 0} | fourth row  
(4 rows)  

NOTE: The int8 value of ID remains unique, although it does
jump ahead quite a bit (which is in compliance with standard PostgreSQL
sequences that may have gaps). Because the snowflake EPOCH is
2023-01-01; your existing database would have had to use over 18 trillion
sequence numbers before this conversion to cause any trouble with
possible duplicate key values.

digoal's wechat