Skip to content

Latest commit

 

History

History
366 lines (266 loc) · 12 KB

20200310_01.md

File metadata and controls

366 lines (266 loc) · 12 KB

PostgreSQL 物理流复制从库 - 冲突判定, 谁堵塞了wal replay, 等了多久

作者

digoal

日期

2020-03-10

标签

PostgreSQL , 回放 , redo , 流复制


背景

流复制相关的冲突、回放控制参数:

  • hot_standby_feedback: 从库反馈快照, 上游vacuum时不回收最老快照之后产生的垃圾
  • vacuum_defer_cleanup_age: 主动不回收比这个年轻的垃圾
  • recovery_min_apply_delay: 主动不replay比这个时间新的wal
  • max_standby_streaming_delay: 当startup replay stream wal record时, 如果遇到回复冲突, startup最多等多久
  • max_standby_archive_delay: 当startup replay archive(restore command) wal record时, 如果遇到回复冲突, startup最多等多久

几个疑问:
1、冲突了多少次
2、分别是什么冲突导致
3、当前数据库redo回放有没有被冲突
4、谁和redo回放发生了冲突

冲突次数

db1=# select * from pg_stat_database_conflicts ;  
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock   
-------+-----------+------------------+------------+----------------+-----------------+----------------  
 14187 | postgres  |                0 |          0 |              0 |               0 |              0  
 16385 | db_video  |                0 |          0 |              0 |               0 |              0  
     1 | template1 |                0 |          0 |              0 |               0 |              0  
 14186 | template0 |                0 |          0 |              0 |               0 |              0  
 16387 | db1       |                0 |          0 |              4 |               0 |              0  
 17527 | db123     |                0 |          0 |              0 |               0 |              0  
(6 rows)  

当前是否被冲突

当startup进程的等待事件为空, 表示它被堵塞了.

postgres=# select * from pg_stat_activity where backend_type ='startup' where wait_event is null;  
-[ RECORD 1 ]----+---------------------------------  
datid            |   
datname          |   
pid              | 21060  
usesysid         |   
usename          |   
application_name |   
client_addr      |   
client_hostname  |   
client_port      |   
backend_start    | 2020-02-29 00:26:28.478013+08  
xact_start       |   
query_start      |   
state_change     |   
wait_event_type  |   
wait_event       |   
state            |   
backend_xid      |   
backend_xmin     |   
query            |   
backend_type     | startup  

当前startup等待事件

当startup在回放wal时, 它在等待什么事件, 通常有io等操作. 这个不是conflict堵塞.

postgres=# select * from pg_stat_activity where backend_type ='startup';  
-[ RECORD 1 ]----+------------------------------  
datid            |   
datname          |   
pid              | 21060  
usesysid         |   
usename          |   
application_name |   
client_addr      |   
client_hostname  |   
client_port      |   
backend_start    | 2020-02-29 00:26:28.478013+08  
xact_start       |   
query_start      |   
state_change     |   
wait_event_type  | IO  
wait_event       | DataFileExtend  
state            |   
backend_xid      |   
backend_xmin     |   
query            |   
backend_type     | startup  

RecoveryWalAll 通常表示startup进程正在等待wal, 通常此时standby处于未delay状态.

postgres=# select * from pg_stat_activity where backend_type ='startup';  
-[ RECORD 1 ]----+------------------------------  
datid            |   
datname          |   
pid              | 21060  
usesysid         |   
usename          |   
application_name |   
client_addr      |   
client_hostname  |   
client_port      |   
backend_start    | 2020-02-29 00:26:28.478013+08  
xact_start       |   
query_start      |   
state_change     |   
wait_event_type  | Activity  
wait_event       | RecoveryWalAll  
state            |   
backend_xid      |   
backend_xmin     |   
query            |   
backend_type     | startup  

startup 可能被哪个query堵塞了

通常是时间越早, 越可能是堵塞startup的query.

或者xmin, xid越早, 越可能是堵塞startup的query.

db1=# select *,xact_start,query_start,state,user,query from pg_stat_activity where datname=current_database() and state<>'idle' order by xact_start limit 5;  
-[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------------------------------  
datid            | 16387  
datname          | db1  
pid              | 29015  
usesysid         | 10  
usename          | postgres  
application_name | psql  
client_addr      |   
client_hostname  |   
client_port      | -1  
backend_start    | 2020-03-10 19:01:22.577305+08  
xact_start       | 2020-03-10 19:01:42.257888+08  
query_start      | 2020-03-10 19:01:43.750416+08  
state_change     | 2020-03-10 19:01:43.750577+08  
wait_event_type  | Client  
wait_event       | ClientRead  
state            | idle in transaction  
backend_xid      |   
backend_xmin     | 4556  
query            | select * from abc limit 1;  
backend_type     | client backend  
xact_start       | 2020-03-10 19:01:42.257888+08  
query_start      | 2020-03-10 19:01:43.750416+08  
state            | idle in transaction  
user             | postgres  
query            | select * from abc limit 1;  

通常可以使用以下query进行查询

1、时间最老

select a.* from   
(  
select *,row_number() over (partition by state order by xact_start) as rn   
from pg_stat_activity   
where datname=current_database()   
and pid<>pg_backend_pid()   
and state<>'idle'   
) a,  
(  
select * from pg_stat_activity where backend_type ='startup' and wait_event is null   
)b  
where a.rn <= 1   
order by a.xact_start;   

2、或 (事务号最老)

select a.* from   
(  
select *,row_number() over (partition by state order by least(backend_xid::text::int8,backend_xmin::text::int8)) as rn   
from pg_stat_activity   
where datname=current_database()   
and pid<>pg_backend_pid()   
and state<>'idle'   
) a,  
(  
select * from pg_stat_activity where backend_type ='startup' and wait_event is null   
)b  
where a.rn <= 1   
order by least(a.backend_xid::text::int8,a.backend_xmin::text::int8);   

可能堵塞了wal apply 的 query 如下

-[ RECORD 1 ]----+------------------------------  
datid            | 16387  
datname          | db1  
pid              | 30448  
usesysid         | 10  
usename          | postgres  
application_name | psql  
client_addr      |   
client_hostname  |   
client_port      | -1  
backend_start    | 2020-03-10 19:13:36.670184+08  
xact_start       | 2020-03-10 19:13:38.696822+08  
query_start      | 2020-03-10 19:13:40.856399+08  
state_change     | 2020-03-10 19:13:40.85716+08  
wait_event_type  | Client  
wait_event       | ClientRead  
state            | idle in transaction  
backend_xid      |   
backend_xmin     | 4561  
query            | select * from abc limit 1;  
backend_type     | client backend  
rn               | 1  

注意, 目前只能找到疑似堵塞了startup replay的query, 不精确.

也许未来内核会支持这样的接口, 可以找到精确的堵塞wal replay的query.

目前PG提供的查询blocking的系统函数只能用于查询重量级锁冲突:

-[ RECORD 1 ]-------+-----------------------------------------------------------------------------------------------
Schema              | pg_catalog
Name                | pg_blocking_pids
Result data type    | integer[]
Argument data types | integer
Type                | func
Volatility          | volatile
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | pg_blocking_pids
Description         | get array of PIDs of sessions blocking specified backend PID from acquiring a heavyweight lock
-[ RECORD 2 ]-------+-----------------------------------------------------------------------------------------------
Schema              | pg_catalog
Name                | pg_isolation_test_session_is_blocked
Result data type    | boolean
Argument data types | integer, integer[]
Type                | func
Volatility          | volatile
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | pg_isolation_test_session_is_blocked
Description         | isolationtester support function
-[ RECORD 3 ]-------+-----------------------------------------------------------------------------------------------
Schema              | pg_catalog
Name                | pg_safe_snapshot_blocking_pids
Result data type    | integer[]
Argument data types | integer
Type                | func
Volatility          | volatile
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | pg_safe_snapshot_blocking_pids
Description         | get array of PIDs of sessions blocking specified backend PID from acquiring a safe snapshot

堵塞了多少wal没有被replay

db1=# select pg_is_wal_replay_paused(),  
pg_last_wal_receive_lsn(),pg_last_wal_replay_lsn(),  
pg_size_pretty(pg_wal_lsn_diff(pg_last_wal_receive_lsn(),pg_last_wal_replay_lsn()));  
 pg_is_wal_replay_paused | pg_last_wal_receive_lsn | pg_last_wal_replay_lsn | pg_size_pretty   
-------------------------+-------------------------+------------------------+----------------  
 f                       | 4/BCF4A338              | 4/BCF49EE8             | 1104 bytes  
(1 row)  

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat