Skip to content

Create new empty external table #7228

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Thanks to the great work from @metesynnada @devinjdangelo and others, it is now possible to have DataFusion insert data into CSV and JSON tables.

However, the User Experience / UX is tough as it is not easy to write datasets because: External tables require an existing location to exist, even if empty

So for example, if I wanted to write to json files in /tmp/my_table I currently need to create a the target file / directory externally

mkdir /tmp/my_table
echo "" > /tmp/my_table/1.json
datafusion-cli
❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.

❯
\q
$ cat /tmp/my_table/1.json

{"x":1,"y":2}
{"x":3,"y":4}

Describe the solution you'd like

I would like to be able to have datafusion create the targets directly

Something like this, without any setup:

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.
$ cat /tmp/my_table/1.json

{"x":1,"y":2}
{"x":3,"y":4}

Describe alternatives you've considered

Option 1: Automatically create the target

One option is simply to remove the existence check on CREATE EXTERNAL TABLE

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table';
0 rows in set. Query took 0.002 seconds.

❯ insert into my_table values (1,2), (3, 4);
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.006 seconds.

The downside is that the reason CREATE EXTERNAL TABLE errors if the target doesn't exist is to help people debug errors when reading

Option 2: Add new DDL to CREATE EXTERNAL TABLE

Perhaps we could add a phrase to CREATE EXTERNAL TABLE like FOR WRITE

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.

The semantics would be if the target file/directory doesn't already exist, then create it rather than error

Option 3: Add a config parameter to control the default behavior

We could also add a config parameter like

set catalog.auto_create_tables = true;
SET

❯ create external table my_table(x int, y int) stored as JSON location '/tmp/my_table' FOR WRITE;
0 rows in set. Query took 0.002 seconds.

with the same semantics as option 2 (create it target doesn't exist)

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions