从csv文件导入传感器数据到PostgreSQL的case, 每个传感器仅保留最新记录, 怎么做到最快?

作者

digoal

日期

2023-03-16

背景

背景:

需要导入的是 csv 文件
内容可能是传感器的上报数据
append only 形式追加csv
ts 表示上报时间 (递增)

需求:

数据库中只想保留每个传感器最后一条数据
怎么以最快的速度导入到数据库中?

demo

如果file_fdw支持多个文件, 支持按文件名指定顺序扫描, 支持指定从文件末尾开始扫描或者从文件开头开始扫描. 这个方法肯定是最快的, 文件只读一遍, 也不涉及排序之类的额外操作.

create table t (id int primary key, info text, ts timestamp);

假设file_fdw支持多个文件, 支持按文件名指定顺序扫描, 支持指定从文件末尾开始扫描或者从文件开头开始扫描.

1、如果是空表怎么做?

insert into t select * from file_fdw(指定从文件末尾开始扫描, 指定倒序扫描文件) on conflict (id) do nothing ;    
  
或者  
  
insert into t select * from file_fdw order by ts desc on conflict (id) do nothing ;

2、如果不是空表怎么做?

需要考虑不能对已有数据do nothing, 得刷新它. 而对之前表里没有的新传感器数据的多条, do nothing只取最后一条.

insert into t select * from file_fdw(指定从文件末尾开始扫描, 指定倒序扫描文件)   
on conflict (id)   
do update set info=excluded.info, ts=excluded.ts where ts<excluded.info ;   
  
或者  
  
insert into t select * from file_fdw order by ts desc   
on conflict (id)   
do update set info=excluded.info, ts=excluded.ts where ts<excluded.info ;

解释一下on conflict是不能多次更新同一条记录的问题

create table a (id int primary key, info text, ts date);  

-- SQL1  
insert into a select * from (values (1,'a',date '2022-01-01'),(1,'b',date '2022-01-02'),(1,'c',date '2022-01-03')) as t   
on conflict (id)   
do update set info=excluded.info, ts=excluded.ts where a.ts<excluded.ts ;   
  
  
-- SQL2
insert into a   
select * from (values (1,'a',date '2022-01-01'),(1,'b',date '2022-01-02'),(1,'c',date '2022-01-03')) as t (id,info,ts) order by ts desc  
on conflict (id)   
do update set info=excluded.info, ts=excluded.ts where a.ts < excluded.ts ;   
  
  
                                QUERY PLAN                                  
--------------------------------------------------------------------------  
 Insert on a  (cost=0.06..0.10 rows=0 width=0)  
   Conflict Resolution: UPDATE  
   Conflict Arbiter Indexes: a_pkey  
   Conflict Filter: (a.ts < excluded.ts)  
   ->  Sort  (cost=0.06..0.07 rows=3 width=40)  
         Sort Key: "*VALUES*".column3 DESC  
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=40)  
(7 rows)  
  
  
ERROR:  21000: ON CONFLICT DO UPDATE command cannot affect row a second time  
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.  
LOCATION:  ExecOnConflictUpdate, nodeModifyTable.c:2054

按理说以上测试的SQL2这条不应该报错, 因为先insert进去的那条ts已经是最大的了, 后进去的因为加了where a.ts < excluded.ts条件, 不会触发更新动作, 也就不存在同一条记录再同一条SQL里面被upsert多次的情况, 不应该报错.

针对这个问题, 先报个 bug: bugid, #17845

以下do nothing不会报错, 因为不涉及同一条记录的多次更新:

postgres=# insert into a   
postgres-# select * from (values (1,'a',date '2022-01-01'),(1,'b',date '2022-01-02'),(1,'c',date '2022-01-03')) as t (id,info,ts) order by ts desc  
postgres-# on conflict (id)  do nothing;  
INSERT 0 1  
postgres=# select * from a;  
 id | info |     ts       
----+------+------------  
  1 | c    | 2022-01-03  
(1 row)

吐槽

抓住机会吐槽一下PG, 因为PG实在是有些太完美了, 能吐槽的机会不多(除了老大难的那些就不说了, 都在我的数据库吐槽系列里面).

1、copy不支持on conflict的功能, 本来就不需要file_fdw. copy支持program, 例如tac file这样不就可以实现倒序输入, 然后on conflict do nothing不就搞定了么. 强烈期待copy支持on conflict这种操作.

2、file_fdw实际上也是调用了copy来进行导入, 所以copy本质上也可以支持倒序读取.

3、insert into on conflict的bug, 实在是不应该, 都已经指定了顺序, 也指定了where过滤, 怎么还能给报多次更新的错呢.

社区几位committer的回复是即使用户能保证插入顺序, 但是从产品设计的角度, 他们依旧要认定这里是需要抛出错误的. 这就太保守了吧, 保守到影响产品功能、用户体验了.
https://www.postgresql.org/message-id/flat/5a1545f3.1e8a.186ed6b8959.Coremail.digoal%40126.com#3365ce8af1e5e3b540b7f2c9b5f13834

4、file_fdw也太不强大了, 支持一下通配符, 或者支持多个文件配置, 多香啊. 用户体验立即上升100倍.

chatgpt

你信不信chatgpt能给你实现以上功能, 反正我现在不信, 就算它能做, 微软为什么不拿来收钱呢? 这可是生产力. 但是目前chatgpt给的思路倒是挺多的.

请你给出修改postgresql file_fdw 代码的示例, 让它支持读取csv文件时从末尾的行开始读到头部的行.

很抱歉，作为AI语言模型，我无法为您提供修改PostgreSQL file_fdw代码的实际示例。但是，我可以为您提供一些可能有用的提示：

首先，在file_fdw的代码中，找到代码段，该代码段处理读取CSV文件的逻辑。

然后，找到文件的读取部分，并检查是否有任何地方在读取文件之前将文件指针设置为文件的末尾。

如果没有，请在读取文件之前将文件指针设置为文件的末尾。这样就可以从文件的末尾开始读取。

接下来，您需要从文件的末尾向前读取每一行。这可以通过使用fseek和fgets函数来完成。

在读取完最后一行后，需要重置文件指针，以便从文件的开头读取CSV文件的其余部分。

最后，您需要测试您所做的更改并确保它们正常工作。

请注意，这只是一个大体的指导方针，具体实现方法可能会因代码的结构和其他因素而有所不同。建议您在进行任何更改之前备份所有相关的文件。此外，如果您不是熟练的编程人员，建议您寻求有经验的人的帮助。

抱歉，我之前的回答有误。实际上，file_fdw是可以支持从末尾的行开始读取csv文件的。您可以按照以下步骤进行修改：

打开file_fdw源代码中的file_fdw.c文件。

找到fileGetOptions函数，该函数用于解析外部表选项。在该函数中添加一个新的选项，例如"reverse"，用于表示从末尾开始读取文件。

static void
fileGetOptions(Oid foreigntableid, char **filename, bool *header, bool *reverse)
{
 ForeignTable *table;
 List *options;
 ListCell *lc;

 /* Fetch options from FDW table */
 table = GetForeignTable(foreigntableid);
 options = NIL;
 options = list_concat(options, table->options);
 options = list_concat(options, GetForeignServer(table->serverid)->options);

 /* Parse options */
 *filename = NULL;
 *header = false;
 *reverse = false;
 foreach(lc, options)
 {
 DefElem *def = (DefElem *) lfirst(lc);

 if (strcmp(def->defname, "filename") ==0)
 {
 if (*filename)
 ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("conflicting or redundant options")));
 *filename = defGetString(def);
 }
 else if (strcmp(def->defname, "header") ==0)
 *header = defGetBoolean(def);
 else if (strcmp(def->defname, "reverse") ==0) // add a new option "reverse"
 *reverse = defGetBoolean(def);
 else
 ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("invalid option "%s"", def->defname)));
 }
 if (*filename == NULL)
 ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("filename is required for file_fdw foreign tables")));
}
找到fileExecuteRead函数，该函数用于执行文件读取操作。在该函数中添加从末尾开始读取文件的操作。具体来说，可以使用fseek函数将文件指针移动到文件末尾，然后使用fgets函数从文件末尾开始逐行读取数据。
static void
fileExecuteRead(FileFdwExecutionState *festate,
 TupleDesc tupdesc,
 ScanDirection direction,
 FILE *file)
{
 char *buffer = NULL;
 size_t buflen =0;
 ssize_t linelen;
 HeapTuple tuple;
 Datum *values;
 bool *nulls;
 bool eof = false;
 bool eof_reached = false;

 if (direction == BackwardScanDirection) // if the scan direction is backward
 {
 if (fseek(file,0L, SEEK_END) !=0) // move file pointer to end of file
 ereport(ERROR,
 (errcode_for_file_access(),
 errmsg("could not seek to end of file: %m")));

 eof_reached = true; // set flag to indicate end of file is reached

 while (!eof_reached)
 {
 if (buffer != NULL)
 {
 pfree(buffer);
 buffer = NULL;
 }

 /* allocate memory for new line */
 buflen =1024;
 buffer = (char *) palloc(buflen);

 /* read a line from file */
 linelen = getline(&buffer, &buflen, file);

 if (linelen == -1)
 {
 if (errno == EINTR)
 continue;
 else if (errno == EOF)
 {
 /* end of file reached */
 eof_reached = true;
 break;
 }
 else
 ereport(ERROR,
 (errcode_for_file_access(),
 errmsg("could not read from file: %m")));
 }

 /* parse line and construct a tuple */
 tuple = fileParseLine(festate, buffer, tupdesc, &values, &nulls);

 /* return tuple if it matches the scan direction */
 if (tuple && tupdesc)
 {
 if (heap_tuple_matches_direction(tuple, tupdesc, direction))
 ExecStoreTuple(tuple, festate->scanstate->ss.ss_ScanTupleSlot, InvalidBuffer, true);
 }

 /* clean up memory */
 if (tuple)
 heap_freetuple(tuple);
 if (values)
 pfree(values);
 if (nulls)
 pfree(nulls);
 }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20230316_01.md

20230316_01.md

从csv文件导入传感器数据到PostgreSQL的case, 每个传感器仅保留最新记录, 怎么做到最快?

作者

日期

标签

背景

demo

解释一下on conflict是不能多次更新同一条记录的问题

吐槽

chatgpt

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 云原生分布式开源数据库

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、内核开发公开课、生态合作、写心得拿奖品

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥

Files

20230316_01.md

Latest commit

History

20230316_01.md

File metadata and controls

从csv文件导入传感器数据到PostgreSQL的case, 每个传感器仅保留最新记录, 怎么做到最快?

作者

日期

标签

背景

demo

解释一下on conflict是不能多次更新同一条记录的问题

吐槽

chatgpt

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 云原生分布式开源数据库

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、内核开发公开课、生态合作、写心得拿奖品

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥