Add support for \copy command via script/trigger #82

jasonmp85 · 2015-03-09T15:36:12Z

The script provides an easy way for users to COPY to a distributed table. It accepts a filename and table name, prepares the table for COPY, performs the COPY, then outputs the number of rows copied. Flags are supported to enable various COPY OPTIONS in the underlying SQL statement.

This branch still needs Makefile changes, better documentation, and a unit test, but I wanted to get the code out there to kick off a review. I'll be pushing the remaining changes here as they come up.

Fixes #61

Code Review Tasks

Switch back to providing sequence to function
Make sequence argument optional
Add regression test for sequence behavior
Switch to distributed table for regression tests
Add COPY test with partial failure
Modify script to use sequence-based UDF
Remove as much shell interpolation as possible (favor psql variables)
Swap order of arguments in usage shell function

Gets the COPY function and script in for initial review.

jasonmp85 · 2015-03-09T15:36:53Z

scripts/copy_to_distributed_table.sh

+			options="${options}, HEADER true"
+			;;
+		h)
+			usage


Just realized I didn't write a usage function yet. Fixing in the afternoon.

jasonmp85 · 2015-03-09T15:43:45Z

High-level description of the approach: prepare_distributed_table_for_copy accepts a table and creates a function specifically for that table based on its columns. Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW, but these types of field accesses aren't permitted by pg_shard (at the moment). It rejects them as not being constant expressions.

So the only type of statement that could work is: INSERT INTO foo VALUES ($1, $2) USING NEW.bar, NEW.baz. Since the dynamic text of that statement isn't in the INSERT string but in the function itself (in the USING clause), this necessitates generating a complete function. So that's what we do.

Once the function has been generated for the given table, it is installed as a trigger on a temporary table exactly like the target table. Then any COPY executing against the temporary table will be redirected to the distributed table in a manner accepted by pg_shard.

sumedhpathak · 2015-03-09T21:45:07Z

Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW,

Is USING an INSERT keyword? I don't see it in the Postgres insert documentation?

jasonmp85 · 2015-03-09T21:46:58Z

It's part of the EXECUTE statement in PL/pgSQL.

sumedhpathak · 2015-03-09T21:48:37Z

pg_shard--1.1.sql

+		SELECT relname INTO STRICT table_name FROM pg_class WHERE oid = relation;
+		temp_table_name = format('%s_copy_facade', table_name);
+
+		SELECT array_agg(attname) INTO STRICT attr_names FROM pg_attribute WHERE attrelid = relation AND


Does this mean we only support COPY of all attributes?

This means the INSERT trigger will generate a statement with all user-visible attributes fully specified. If the overarching COPY does not specify all columns, one of two things could happen:

If NEW.missing_col is provided as NULL, we'd explicitly include a NULL in the INSERT. This could be wrong if there is a default value specified

If NEW.missing_col already replaces unspecified columns with their default values, the trigger would just explicitly specify the default value, which would be correct.

The copy shell script presently doesn't allow omitting columns: it's expected the CSV or whatnot has the same dimensions as the table. So this is kind of a moot point… I'll check about the NULL vs default value distinction.

jasonmp85 · 2015-03-09T21:51:41Z

We don't frequently write this heavily in PL/pgSQL, so here is a brief primer of features used to help in review:

Variables must be listed in DECLARE blocks at the top of functions (though you can also have nested blocks in the function itself to have limited scope variables)
They can be marked constant to prevent reassignment
Double dollar signs can be used to delimit strings (and they can have a token within the dollar signs to allow nested quoting; quotes don't end until a matching dollar symbol is found)
format has %I and %L specifiers for "identifier" and "literal", respectively. Helps format things that need to appear in an SQL statement as an identifier (INSERT INTO identifier) or literal (VALUES ('literal')
SELECT INTO variable allows the result of a SELECT to be assigned to a variable. STRICT means we expect a single row
|| is string concatenation
EXECUTE is basically eval: it executes a string which may contain positional parameters ($1, $2, etc.) bound at runtime to values supplied in a USING clause.

Based on review discussions.

sumedhpathak · 2015-03-10T21:29:49Z

pg_shard--1.1.sql

+											END;
+										$cti$ LANGUAGE plpgsql VOLATILE;
+									$$;
+		table_tmpl CONSTANT text := 'CREATE TEMPORARY TABLE %I (LIKE %s)';


Why does this statement use quotes instead of $$? Could we be consistent?

I guess I had a style that I used normal quotes if it fit on one line and $$ for multi-line things. Kind of like how you wouldn’t use a here doc for every string declaration in a shell script…

—Jason

On Mar 10, 2015, at 3:29 PM, sumedhpathak notifications@github.com wrote:

In pg_shard--1.1.sql #82 (comment):

temp_table_name text;

attr_names text[];

attr_list text;

param_list text;

using_list text;

insert_command text;

func_tmpl CONSTANT text := $$ CREATE FUNCTION pg_temp.copy_to_insert() RETURNS trigger

AS $cti$

BEGIN

EXECUTE %L USING %s;

PERFORM nextval(%L);

RETURN NULL;

END;

$cti$ LANGUAGE plpgsql VOLATILE;

$$;

table_tmpl CONSTANT text := 'CREATE TEMPORARY TABLE %I (LIKE %s)';
Why does this statement use quotes instead of $$? Could we be consistent?

—
Reply to this email directly or view it on GitHub https://github.com/citusdata/pg_shard/pull/82/files#r26168240.

Changing type of quotes, expanding 'trg_tmpl'.

Makes it more standalone for direct use by users.

Now the shell script can handle counting successfully-INSERTed rows itself. It just installs a trigger _after_ the proxy one and enables write through. This allows it to see the successful rows, which it counts and discards to prevent any writes to the temp table.

See http://redsymbol.net/articles/unofficial-bash-strict-mode/

Without this, we insert NULL in missing columns. With this, unspecified columns will have the correct default values.

Looks nicer on the command line without suffix, and other PostgreSQL extensions use a bin folder for scripts, so we'll do the same.

PGXS installs scripts into the PostgreSQL install's bin directory.

Tests INSERT/COPY functionality on proxy table, once with writethrough off (default) and once with it on.

Oops, meant to do this long ago.

Changed suffix used by proxy generation. Should we use COPY instead of \copy to permit interpolation of colon variables?

Gives better error messages and statuses.

Using -f prohibits the use of pipes, devices, or sockets. While the latter two might be of questionable use, prohibiting the first prevents users from using named pipes or process substitutions, which could be very useful in certain instances.

sumedhpathak · 2015-03-11T23:18:09Z

bin/copy_to_distributed_table

+      Default: 'public'
+E_O_USAGE
+
+	exit $1;


Minor. Should the order of arguments be inverted? (Output device first, followed by exit status?)

I initially just had one argument (status) but realized that commands I consider "well-behaved" output to stdout (and exit 0) when you ask them for help (i.e. pass -h) but output to stderr (with non-zero status) when you make some usage mistake. So that necessitated adding a second argument.

I agree that reversing the order makes more sense.

sumedhpathak · 2015-03-11T23:21:37Z

I had three higher-level things:

I think I am preferring the previous solution of giving a sequence to the function, which it increments. Is there a reason to not do this? i.e., Is there a use-case for the write-through mode in the function apart from being able to install another trigger on the temp-table to count the rows? If not, then I think just passing a sequence makes sense. If no sequence is specified the function doesn't increment it maybe?

I like that the function returns the tablename, so it makes it easier to de-couple the calls and run them manually.
Could you add a regression test which supplies a sequence (if you make the above change), and then select's curr_val to validate the function incremented the sequence?
Could you also add a regression test after COPY'ing some good rows followed by junk data to validate that the correct rows are recorded as inserted? (Maybe also followed by a select to validate that we inserted the previous 'x' good rows?

jasonmp85 · 2015-03-12T01:43:28Z

On each of your high-level points:

I agree it is more straightforward to explain, but think it's more modular. It just routes the INSERT and knows how to do nothing else. Why pass a sequence? What's special about that, as opposed to—say—passing a function name and having it call that after each row?

Ultimately, I'm going to change it, but do think the "INSERT-only trigger with an option to chain more triggers" is a more composable building block than a "INSERT trigger that can update a sequence, if supplied"
Yes
This is a little trickier. You might have overlooked that I'm not actually using a distributed table in my test. I'm just making a proxy to a normal table. This was because I wanted the test to check the INSERT logic alone and not actually set up a pg_shard table.

I find it best to have the behavior that if a feature breaks, only the unit tests related to that feature break. It helps diagnosis. By using a real distributed table, if table distribution breaks, a bunch of tests break, which doesn't aid in finding the problem.

Of course I'm still playing devil's advocate. I think the INSERT trigger is enough of a hack that we definitely want a test to actually test it against a distributed table. So I'll update that too, even though it'll complicate the test.

jasonmp85 · 2015-03-12T02:03:48Z

Since you didn't mention the naming, I'm assuming you're fine with the names of:

The shell script
The UDF

I'll make a checklist at the top of this PR to keep track of the things I need to do and when they're done it sounds like this is a .

I had one concern: because the UDF returns the proxy table name, I should be able to store it in a variable and interpolate it, or at least I would were I using COPY (server-side). From the documentation (emphasis mine):

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY. Because of this, special parsing rules apply to the \copy command. In particular, psql's variable substitution rules and backslash escapes do not apply.

This is why I'm directly interpolating the table name myself in the shell script, which I don't like. Though \copy can be slower than COPY (data must pass through the client), it has two benefits:

It understands relative paths (@rsolari had mentioned this being annoying with the previous solution)
You can easily connect from a remote machine and copy over a file from that machine rather than doing it directly on the master
Permissions are easier to understand (your user must be able to read the file, not the server)

So COPY might be something we explore if the performance of \copy is an issue, but otherwise the usability of the latter wins out so I'm sticking with it. I am going to update this to avoid so much direct interpolation, but wanted to call out the tradeoffs in either direction.

sumedhpathak · 2015-03-12T21:03:15Z

@jasonmp85 I am OK with the names of both the script and the UDF.

Decided it was easiest to understand. Added test for sequence functionality.

Lets us check partial failure modes of copy workflow.

Kills off trigger complication.

Casting to regclass requires that the input string already be properly quoted as an identifier. If there is a space or quote in the input to the cast, it fails. Explicitly quoting the identifier fixes this, and I added a test to verify the fix.

Oops.

Safer to let the PostgreSQL-aware environment do our quoting and do as little quoting ourselves as possible.

Reordered arguments and defined some constants to clarify calls.

@sumedhpathak

Add support for \copy command via script/trigger cr: @sumedhpathak

Initial COPY trigger work

9a652f2

Gets the COPY function and script in for initial review.

jasonmp85 reviewed Mar 9, 2015
View reviewed changes

jasonmp85 assigned sumedhpathak Mar 9, 2015

sumedhpathak reviewed Mar 9, 2015
View reviewed changes

jasonmp85 mentioned this pull request Mar 9, 2015

Generalize COPY-to-INSERT trigger #61

Closed

Fix formatting, add documentation and comments

df73ddf

Based on review discussions.

sumedhpathak reviewed Mar 10, 2015
View reviewed changes

Code review feedback

39467e5

Changing type of quotes, expanding 'trg_tmpl'.

jasonmp85 force-pushed the feature-copy_trigger-#61 branch from 5e2498c to 39467e5 Compare March 10, 2015 23:49

jasonmp85 added 11 commits March 10, 2015 17:52

Rework UDF

b0a8cb0

Makes it more standalone for direct use by users.

Use "unofficial bash strict mode"

64be963

See http://redsymbol.net/articles/unofficial-bash-strict-mode/

Use INCLUDING DEFAULTS during proxy table creation

90d5023

Without this, we insert NULL in missing columns. With this, unspecified columns will have the correct default values.

Kill sh suffix on copy script, move to bin

e399bb0

Looks nicer on the command line without suffix, and other PostgreSQL extensions use a bin folder for scripts, so we'll do the same.

Add Makefile line to copy script

c5b922d

PGXS installs scripts into the PostgreSQL install's bin directory.

Add unit test for proxy creation function

4a80048

Tests INSERT/COPY functionality on proxy table, once with writethrough off (default) and once with it on.

Switch to bash

6bfcaa4

Oops, meant to do this long ago.

Fix shell script

8766e5b

Changed suffix used by proxy generation. Should we use COPY instead of \copy to permit interpolation of colon variables?

Add argument validation to script

668ce60

Gives better error messages and statuses.

Relax file existence check

c5ac977

Using -f prohibits the use of pipes, devices, or sockets. While the latter two might be of questionable use, prohibiting the first prevents users from using named pipes or process substitutions, which could be very useful in certain instances.

sumedhpathak reviewed Mar 11, 2015
View reviewed changes

jasonmp85 changed the title ~~Add support for COPY command via script/trigger~~ Add support for \copy command via script/trigger Mar 12, 2015

jasonmp85 added 7 commits March 12, 2015 17:09

Switch back to sequence implementation

1e7df47

Decided it was easiest to understand. Added test for sequence functionality.

Add copy proxy test using distributed table

8649466

Lets us check partial failure modes of copy workflow.

Update script to use new latest UDF

d0a95ab

Kills off trigger complication.

Fix quoting bug in proxy creation

accdd99

Casting to regclass requires that the input string already be properly quoted as an identifier. If there is a space or quote in the input to the cast, it fails. Explicitly quoting the identifier fixes this, and I added a test to verify the fix.

Add 's' to list of allowed flags

ae59184

Oops.

Rewrite shell script to use psql variables

bcbe3f4

Safer to let the PostgreSQL-aware environment do our quoting and do as little quoting ourselves as possible.

Refactor usage impl. and call sites

cc096c7

Reordered arguments and defined some constants to clarify calls.

jasonmp85 added a commit that referenced this pull request Mar 13, 2015

Merge pull request #82 from citusdata/feature-copy_trigger-#61

8e35b7f

Add support for \copy command via script/trigger cr: @sumedhpathak

jasonmp85 merged commit 8e35b7f into develop Mar 13, 2015

jasonmp85 deleted the feature-copy_trigger-#61 branch March 13, 2015 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for \copy command via script/trigger #82

Add support for \copy command via script/trigger #82

jasonmp85 commented Mar 9, 2015

jasonmp85 Mar 9, 2015

jasonmp85 commented Mar 9, 2015

sumedhpathak commented Mar 9, 2015

jasonmp85 commented Mar 9, 2015

sumedhpathak Mar 9, 2015

jasonmp85 Mar 9, 2015

jasonmp85 commented Mar 9, 2015

sumedhpathak Mar 10, 2015

jasonmp85 Mar 10, 2015

sumedhpathak Mar 11, 2015

jasonmp85 Mar 12, 2015

sumedhpathak commented Mar 11, 2015

jasonmp85 commented Mar 12, 2015

jasonmp85 commented Mar 12, 2015

sumedhpathak commented Mar 12, 2015

Add support for \copy command via script/trigger #82

Add support for \copy command via script/trigger #82

Conversation

jasonmp85 commented Mar 9, 2015

jasonmp85 Mar 9, 2015

Choose a reason for hiding this comment

jasonmp85 commented Mar 9, 2015

sumedhpathak commented Mar 9, 2015

jasonmp85 commented Mar 9, 2015

sumedhpathak Mar 9, 2015

Choose a reason for hiding this comment

jasonmp85 Mar 9, 2015

Choose a reason for hiding this comment

jasonmp85 commented Mar 9, 2015

sumedhpathak Mar 10, 2015

Choose a reason for hiding this comment

jasonmp85 Mar 10, 2015

Choose a reason for hiding this comment

sumedhpathak Mar 11, 2015

Choose a reason for hiding this comment

jasonmp85 Mar 12, 2015

Choose a reason for hiding this comment

sumedhpathak commented Mar 11, 2015

jasonmp85 commented Mar 12, 2015

jasonmp85 commented Mar 12, 2015

sumedhpathak commented Mar 12, 2015