digoal
2024-01-08
PostgreSQL , PolarDB , DuckDB , 分库分表 , 全局唯一 , snowflake ID
Snowflake ID implementation for PostgreSQL
当业务使用了分库特性时, 如果业务有全局唯一的需求, 需要能保证全局唯一的序列生成器, 传统方法为 UUID.
也可以使用多个字段来实现, 例如一个nodeid字段, 一个dbid字段, 一个schemaid字段, 一个partitionid字段, 一个sequence字段.
nodeid代表实例 ID(实例之间唯一), dbid代表数据库ID(同一台实例的不同数据库之间唯一), schemaid在同一个数据库内不同schema之间唯一, partitionId代表表分区ID(同一个schema下不同分区之间唯一).
不论是使用Uuid还是多字段实现, 都有其缺陷, uuid属于字符串, 占用空间更大, 由于其无序性容易造成索引膨胀. 多字段需要集中管理元数据, 也比较容易出错, 配置错误会导致不唯一.
snowflake id是int8类型的全局唯一序列生成器, 支持1023个实例.
- bit 63 - unused (sign of int8)
- bits 22-62 - timestamp with millisecond precision
- bits 10-21 - counter within one millisecond
- bits 0-9 - unique PostgreSQL node number set in postgresql.conf
除了snowflake id, 也有其他的方案, 例如 Instagram 的方案:
- 《PostgreSQL 全局ID分配(数据字典化)服务 设计实践》
- 《PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8》
安装snowflake id:
git clone --depth 1 https://github.com/pgEdge/snowflake
cd snowflake
USE_PGXS=1 make install
已集成到学习镜像:
Snowflake is a PostgreSQL extension providing an int8
and sequence
based unique ID solution to optionally replace
the PostgreSQL built-in bigserial
data type.
Internally Snowflakes are 64 bit integers represented externally as bigint
values. The 64 bits are divided into bit fields
bit 63 - unused (sign of int8)
bits 22-62 - timestamp with millisecond precision
bits 10-21 - counter within one millisecond
bits 0-9 - unique PostgreSQL node number set in postgresql.conf
-
The timestamp is a 41-bit unsigned value representing millisecond
precision and an epoch of 2023-01-01. -
The counter is a 12-bit unsigned value that increments per ID allocation.
This provides for 4096 unique IDs per millisecond, or 4 million IDs per
second. Even the most aggressive allocation of sequences cannot reach
one million per second and a modified pgbench with a Snowflake based
primary key on thehistory
table can only generate IDs in the double
digit per millisecond range.Should it be possible in the future to generate more than 4096
snowflakes per millisecond the algorithm is going to bump the
timestamp one millisecond into the future to keep snowflakes
unique. -
The
node number
is a 10-bit unique identifier of the PostgreSQL
instance inside a global cluster. This value must be set with the
GUCsnowflake.node
in thepostgresql.conf
file.
With this design a Snowflake ID is unique within one sequence
across multiple PostgreSQL instances in a distributed cluster.
Go to pgedge.org
and run install.py
(takes two seconds) to install pgEdge nodectl
After running the installer simply run the following command, as a non root user, from your home directory
./nodectl install pg17 --start : install snowflake
For installation from source code it is assumed that the user is
familiar with how to build standard PostgreSQL extensions from a
source.
cd contrib
git clone https://github.com/pgEdge/snowflake.git
cd snowflake
make
make install
The Snowflake extension uses a custom GUC snowflake.node
.
This configuration variable determines the node
part of every
snowflake, generated by this PostgreSQL instance. The
permitted values are 1 thru 1023.
This configuration option has an invalid default value (on purpose).
If not set, the Snowflake extension will throw an exception on
a call to snowflake.nextval()
. This (hopefully) prevents you from
accidentally missing this GUC in your postgresql.conf file.
If you ever intend to use Snowflakes in a multi-node, distributed
or replicated setup, it is important to set the GUC to a unique value
for each PostgreSQL instance. There is nothing
in place to prevent you from shooting yourself in the foot by
assigning multiple PostgreSQL instances in a multi-master cluster
the same node number.
After creating the extension
via
CREATE EXTENSION snowflake;
the following functions become available:
-
snowflake.nextval([sequence regclass])
Generates the next snowflake for the given sequence. If no
sequence is specified the internal, database-wide sequence
snowflake.id_seq
will be used.NOTE: snowflakes are only unique per database, per sequence.
If you need snowflakes to be unique across all snowflake
columns within a database, you need to use the built-in sequence
snowflake.id_seq
. However, that may interfere with the usage
ofcurrval()
of your application logic. -
snowflake.currval([sequence regclass])
Returns the current value of the given sequence (or the default, internal
sequence). Like for PostgreSQL sequences this value is undefined until
the functionsnowflake.nextval()
has been called for the sequence in
the current session. -
snowflake.get_epoch(snowflake int8)
Returns the timestamp part of the given snowflake as EPOCH
(seconds since 2023-01-01) as a NUMERIC value with precision of
three digits. One can useto_timestamp(snowflake.get_epoch(<value>))
to convert this into an actual timestamp. -
snowflake.get_count(snowflake int8)
Returns the count part of the given snowflake as int4 value.
This is a unique value within the milliseconds of the snowflake's
timestamp. -
snowflake.get_node(snowflake int8)
Returns the setting of GUCsnowflake.node
in postgresql.conf at
the time, this snowflake was allocated. -
snowflake.format(snowflake int8)
Returns ajsonb
object of the given snowflake like:
{"node": 1, "ts": "2023-10-16 17:57:26.361+00", "count": 0}
This SQL code shows how to use snowflake in a newly created table.
-- Assuming extension snowflake has been installed
CREATE TABLE table1 (
id bigint PRIMARY KEY DEFAULT snowflake.nextval(),
some_data text
);
INSERT INTO table1 (some_data) VALUES ('first row');
INSERT INTO table1 (some_data) VALUES ('second row');
SELECT id, snowflake.format(id), some_data FROM table1;
Result:
id | format | some_data
-------------------+-----------------------------------------------------------+------------
18014518154714241 | {"node": 1, "ts": "2023-10-16 18:47:12.257+00", "count": 0} | first row
18014518154714242 | {"node": 1, "ts": "2023-10-16 18:47:12.258+00", "count": 0} | second row
(2 rows)
This SQL code demonstrates how to convert an existing bigserial
or
serial8
column into a snowflake.
CREATE TABLE table2 (
id bigserial PRIMARY KEY,
some_data text
);
INSERT INTO table2 (some_data) VALUES ('first row');
INSERT INTO table2 (some_data) VALUES ('second row');
ALTER TABLE table2 ALTER COLUMN id SET DEFAULT snowflake.nextval();
-- Alternatively, if you need to retain the ability to retrieve
-- the snowflake.currval('table2_id_seq') of the sequence individually, use
-- ALTER TABLE table2 ALTER COLUMN id SET DEFAULT snowflake.nextval('table2_id_seq');
INSERT INTO table2 (some_data) VALUES ('third row');
INSERT INTO table2 (some_data) VALUES ('fourth row');
SELECT id, snowflake.format(id), some_data FROM table2;
Result:
id | format | some_data
-------------------+-----------------------------------------------------------+------------
1 | {"node": 0, "ts": "2023-01-01 00:00:00.001+00", "count": 0} | first row
2 | {"node": 0, "ts": "2023-01-01 00:00:00.002+00", "count": 0} | second row
18014518155600128 | {"node": 1, "ts": "2023-10-16 19:01:58.144+00", "count": 0} | third row
18014518155600129 | {"node": 1, "ts": "2023-10-16 19:01:58.145+00", "count": 0} | fourth row
(4 rows)
NOTE: The int8
value of ID remains unique, although it does
jump ahead quite a bit (which is in compliance with standard PostgreSQL
sequences that may have gaps). Because the snowflake EPOCH is
2023-01-01; your existing database would have had to use over 18 trillion
sequence numbers before this conversion to cause any trouble with
possible duplicate key values.