Skip to content

Latest commit

 

History

History
85 lines (52 loc) · 3.97 KB

20240422_01.md

File metadata and controls

85 lines (52 loc) · 3.97 KB

使用DuckDB sniff_csv 解析csv文件元数据格式

作者

digoal

日期

2024-04-22

标签

PostgreSQL , PolarDB , DuckDB , sniff_csv , csv , 元数据格式


背景

如果要把csv导入数据库, 需要先解析csv内容的元数据格式, 例如是否有列名, 每列什么类型, 分隔符是什么, 换行符是什么, 逃逸字符是什么等等.

DuckDB的csv解析做得比较丝滑, 可以配置采样条数, 并自动解析出元数据的内容. 使用DuckDB read_csv时我们几乎不需要手工填写任何csv元数据信息.

https://duckdb.org/docs/data/csv/auto_detection

同时DuckDB也提供了csv文件的解析函数sniff_csv, 配置采样条数, 自动解析出元数据的内容.

1、sniff_csv Function

It is possible to run the CSV sniffer as a separate step using the sniff_csv(filename) function, which returns the detected CSV properties as a table with a single row. The sniff_csv function accepts an optional sample_size parameter to configure the number of rows sampled.

FROM sniff_csv('my_file.csv');  
FROM sniff_csv('my_file.csv', sample_size = 1000);  

返回元数据

Column name Description Example
Delimiter delimiter ,
Quote quote character "
Escape escape \
NewLineDelimiter new-line delimiter \r\n
SkipRow number of rows skipped 1
HasHeader whether the CSV has a header true
Columns column types encoded as a LIST of STRUCTs ({'name': 'VARCHAR', 'age': 'BIGINT'})
DateFormat date Format %d/%m/%Y
TimestampFormat timestamp Format %Y-%m-%dT%H:%M:%S.%f
UserArguments arguments used to invoke sniff_csv sample_size = 1000
Prompt prompt ready to be used to read the CSV FROM read_csv('my_file.csv', auto_detect=false, delim=',', ...)

2、Prompt, 返回元数据提示

The Prompt column contains a SQL command with the configurations detected by the sniffer.

-- use line mode in CLI to get the full command  
.mode line  
SELECT Prompt FROM sniff_csv('my_file.csv');  
返回  
Prompt = FROM read_csv('my_file.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={...});  

digoal's wechat