-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
Thanks to the great work from @metesynnada @devinjdangelo and others, it is now possible to have DataFusion insert data into CSV and JSON tables.
However, the User Experience / UX is tough as it is not easy to write datasets because: External tables require an existing location to exist, even if empty
So for example, if I wanted to write to json files in /tmp/my_table I currently need to create a the target file / directory externally
mkdir /tmp/my_table
echo "" > /tmp/my_table/1.json
datafusion-cli❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.
❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.006 seconds.
❯
\q$ cat /tmp/my_table/1.json
{"x":1,"y":2}
{"x":3,"y":4}Describe the solution you'd like
I would like to be able to have datafusion create the targets directly
Something like this, without any setup:
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.
❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.006 seconds.$ cat /tmp/my_table/1.json
{"x":1,"y":2}
{"x":3,"y":4}Describe alternatives you've considered
Option 1: Automatically create the target
One option is simply to remove the existence check on CREATE EXTERNAL TABLE
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.
❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.006 seconds.The downside is that the reason CREATE EXTERNAL TABLE errors if the target doesn't exist is to help people debug errors when reading
Option 2: Add new DDL to CREATE EXTERNAL TABLE
Perhaps we could add a phrase to CREATE EXTERNAL TABLE like FOR WRITE
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.The semantics would be if the target file/directory doesn't already exist, then create it rather than error
Option 3: Add a config parameter to control the default behavior
We could also add a config parameter like
❯ set catalog.auto_create_tables = true;
SET
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.with the same semantics as option 2 (create it target doesn't exist)
Additional context
No response