digoal
2024-04-22
PostgreSQL , PolarDB , DuckDB , sniff_csv , csv , 元数据格式
如果要把csv导入数据库, 需要先解析csv内容的元数据格式, 例如是否有列名, 每列什么类型, 分隔符是什么, 换行符是什么, 逃逸字符是什么等等.
DuckDB的csv解析做得比较丝滑, 可以配置采样条数, 并自动解析出元数据的内容. 使用DuckDB read_csv时我们几乎不需要手工填写任何csv元数据信息.
https://duckdb.org/docs/data/csv/auto_detection
同时DuckDB也提供了csv文件的解析函数sniff_csv, 配置采样条数, 自动解析出元数据的内容.
1、sniff_csv
Function
It is possible to run the CSV sniffer as a separate step using the sniff_csv(filename) function, which returns the detected CSV properties as a table with a single row. The sniff_csv function accepts an optional sample_size parameter to configure the number of rows sampled.
FROM sniff_csv('my_file.csv');
FROM sniff_csv('my_file.csv', sample_size = 1000);
返回元数据
Column name | Description | Example |
---|---|---|
Delimiter |
delimiter | , |
Quote |
quote character | " |
Escape |
escape | \ |
NewLineDelimiter |
new-line delimiter | \r\n |
SkipRow |
number of rows skipped | 1 |
HasHeader |
whether the CSV has a header | true |
Columns |
column types encoded as a LIST of STRUCT s |
({'name': 'VARCHAR', 'age': 'BIGINT'}) |
DateFormat |
date Format | %d/%m/%Y |
TimestampFormat |
timestamp Format | %Y-%m-%dT%H:%M:%S.%f |
UserArguments |
arguments used to invoke sniff_csv |
sample_size = 1000 |
Prompt |
prompt ready to be used to read the CSV | FROM read_csv('my_file.csv', auto_detect=false, delim=',', ...) |
2、Prompt, 返回元数据提示
The Prompt column contains a SQL command with the configurations detected by the sniffer.
-- use line mode in CLI to get the full command
.mode line
SELECT Prompt FROM sniff_csv('my_file.csv');
返回
Prompt = FROM read_csv('my_file.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={...});