Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for \copy command via script/trigger #82

Merged
merged 21 commits into from Mar 13, 2015

Conversation

jasonmp85
Copy link
Collaborator

The script provides an easy way for users to COPY to a distributed table. It accepts a filename and table name, prepares the table for COPY, performs the COPY, then outputs the number of rows copied. Flags are supported to enable various COPY OPTIONS in the underlying SQL statement.

This branch still needs Makefile changes, better documentation, and a unit test, but I wanted to get the code out there to kick off a review. I'll be pushing the remaining changes here as they come up.

Fixes #61

Code Review Tasks

  • Switch back to providing sequence to function
  • Make sequence argument optional
  • Add regression test for sequence behavior
  • Switch to distributed table for regression tests
  • Add COPY test with partial failure
  • Modify script to use sequence-based UDF
  • Remove as much shell interpolation as possible (favor psql variables)
  • Swap order of arguments in usage shell function

Gets the COPY function and script in for initial review.
options="${options}, HEADER true"
;;
h)
usage
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I didn't write a usage function yet. Fixing in the afternoon.

@jasonmp85
Copy link
Collaborator Author

High-level description of the approach: prepare_distributed_table_for_copy accepts a table and creates a function specifically for that table based on its columns. Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW, but these types of field accesses aren't permitted by pg_shard (at the moment). It rejects them as not being constant expressions.

So the only type of statement that could work is: INSERT INTO foo VALUES ($1, $2) USING NEW.bar, NEW.baz. Since the dynamic text of that statement isn't in the INSERT string but in the function itself (in the USING clause), this necessitates generating a complete function. So that's what we do.

Once the function has been generated for the given table, it is installed as a trigger on a temporary table exactly like the target table. Then any COPY executing against the temporary table will be redirected to the distributed table in a manner accepted by pg_shard.

@sumedhpathak
Copy link
Contributor

Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW,

Is USING an INSERT keyword? I don't see it in the Postgres insert documentation?

@jasonmp85
Copy link
Collaborator Author

It's part of the EXECUTE statement in PL/pgSQL.

SELECT relname INTO STRICT table_name FROM pg_class WHERE oid = relation;
temp_table_name = format('%s_copy_facade', table_name);

SELECT array_agg(attname) INTO STRICT attr_names FROM pg_attribute WHERE attrelid = relation AND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we only support COPY of all attributes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the INSERT trigger will generate a statement with all user-visible attributes fully specified. If the overarching COPY does not specify all columns, one of two things could happen:

  • If NEW.missing_col is provided as NULL, we'd explicitly include a NULL in the INSERT. This could be wrong if there is a default value specified
  • If NEW.missing_col already replaces unspecified columns with their default values, the trigger would just explicitly specify the default value, which would be correct.

The copy shell script presently doesn't allow omitting columns: it's expected the CSV or whatnot has the same dimensions as the table. So this is kind of a moot point… I'll check about the NULL vs default value distinction.

@jasonmp85
Copy link
Collaborator Author

We don't frequently write this heavily in PL/pgSQL, so here is a brief primer of features used to help in review:

  • Variables must be listed in DECLARE blocks at the top of functions (though you can also have nested blocks in the function itself to have limited scope variables)
  • They can be marked constant to prevent reassignment
  • Double dollar signs can be used to delimit strings (and they can have a token within the dollar signs to allow nested quoting; quotes don't end until a matching dollar symbol is found)
  • format has %I and %L specifiers for "identifier" and "literal", respectively. Helps format things that need to appear in an SQL statement as an identifier (INSERT INTO identifier) or literal (VALUES ('literal')
  • SELECT INTO variable allows the result of a SELECT to be assigned to a variable. STRICT means we expect a single row
  • || is string concatenation
  • EXECUTE is basically eval: it executes a string which may contain positional parameters ($1, $2, etc.) bound at runtime to values supplied in a USING clause.

END;
$cti$ LANGUAGE plpgsql VOLATILE;
$$;
table_tmpl CONSTANT text := 'CREATE TEMPORARY TABLE %I (LIKE %s)';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this statement use quotes instead of $$? Could we be consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I had a style that I used normal quotes if it fit on one line and $$ for multi-line things. Kind of like how you wouldn’t use a here doc for every string declaration in a shell script…

—Jason

On Mar 10, 2015, at 3:29 PM, sumedhpathak notifications@github.com wrote:

In pg_shard--1.1.sql #82 (comment):

  •   temp_table_name text;
    
  •   attr_names text[];
    
  •   attr_list text;
    
  •   param_list text;
    
  •   using_list text;
    
  •   insert_command text;
    
  •   func_tmpl CONSTANT text  := $$  CREATE FUNCTION pg_temp.copy_to_insert() RETURNS trigger
    
  •                                   AS $cti$
    
  •                                       BEGIN
    
  •                                           EXECUTE %L USING %s;
    
  •                                           PERFORM nextval(%L);
    
  •                                           RETURN NULL;
    
  •                                       END;
    
  •                                   $cti$ LANGUAGE plpgsql VOLATILE;
    
  •                               $$;
    
  •   table_tmpl CONSTANT text := 'CREATE TEMPORARY TABLE %I (LIKE %s)';
    
    Why does this statement use quotes instead of $$? Could we be consistent?


Reply to this email directly or view it on GitHub https://github.com/citusdata/pg_shard/pull/82/files#r26168240.

Changing type of quotes, expanding 'trg_tmpl'.
Makes it more standalone for direct use by users.
Now the shell script can handle counting successfully-INSERTed rows
itself. It just installs a trigger _after_ the proxy one and enables
write through. This allows it to see the successful rows, which it
counts and discards to prevent any writes to the temp table.
Without this, we insert NULL in missing columns. With this, unspecified
columns will have the correct default values.
Looks nicer on the command line without suffix, and other PostgreSQL
extensions use a bin folder for scripts, so we'll do the same.
PGXS installs scripts into the PostgreSQL install's bin directory.
Tests INSERT/COPY functionality on proxy table, once with writethrough
off (default) and once with it on.
Oops, meant to do this long ago.
Changed suffix used by proxy generation. Should we use COPY instead of
\copy to permit interpolation of colon variables?
Gives better error messages and statuses.
Using -f prohibits the use of pipes, devices, or sockets. While the
latter two might be of questionable use, prohibiting the first prevents
users from using named pipes or process substitutions, which could be
very useful in certain instances.
Default: 'public'
E_O_USAGE

exit $1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. Should the order of arguments be inverted? (Output device first, followed by exit status?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially just had one argument (status) but realized that commands I consider "well-behaved" output to stdout (and exit 0) when you ask them for help (i.e. pass -h) but output to stderr (with non-zero status) when you make some usage mistake. So that necessitated adding a second argument.

I agree that reversing the order makes more sense.

@sumedhpathak
Copy link
Contributor

I had three higher-level things:

  • I think I am preferring the previous solution of giving a sequence to the function, which it increments. Is there a reason to not do this? i.e., Is there a use-case for the write-through mode in the function apart from being able to install another trigger on the temp-table to count the rows? If not, then I think just passing a sequence makes sense. If no sequence is specified the function doesn't increment it maybe?

    I like that the function returns the tablename, so it makes it easier to de-couple the calls and run them manually.

  • Could you add a regression test which supplies a sequence (if you make the above change), and then select's curr_val to validate the function incremented the sequence?

  • Could you also add a regression test after COPY'ing some good rows followed by junk data to validate that the correct rows are recorded as inserted? (Maybe also followed by a select to validate that we inserted the previous 'x' good rows?

@jasonmp85
Copy link
Collaborator Author

On each of your high-level points:

  • I agree it is more straightforward to explain, but think it's more modular. It just routes the INSERT and knows how to do nothing else. Why pass a sequence? What's special about that, as opposed to—say—passing a function name and having it call that after each row?

    Ultimately, I'm going to change it, but do think the "INSERT-only trigger with an option to chain more triggers" is a more composable building block than a "INSERT trigger that can update a sequence, if supplied"

  • Yes

  • This is a little trickier. You might have overlooked that I'm not actually using a distributed table in my test. I'm just making a proxy to a normal table. This was because I wanted the test to check the INSERT logic alone and not actually set up a pg_shard table.

    I find it best to have the behavior that if a feature breaks, only the unit tests related to that feature break. It helps diagnosis. By using a real distributed table, if table distribution breaks, a bunch of tests break, which doesn't aid in finding the problem.

    Of course I'm still playing devil's advocate. I think the INSERT trigger is enough of a hack that we definitely want a test to actually test it against a distributed table. So I'll update that too, even though it'll complicate the test.

@jasonmp85
Copy link
Collaborator Author

Since you didn't mention the naming, I'm assuming you're fine with the names of:

  • The shell script
  • The UDF

I'll make a checklist at the top of this PR to keep track of the things I need to do and when they're done it sounds like this is a :shipit:.

I had one concern: because the UDF returns the proxy table name, I should be able to store it in a variable and interpolate it, or at least I would were I using COPY (server-side). From the documentation (emphasis mine):

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY. Because of this, special parsing rules apply to the \copy command. In particular, psql's variable substitution rules and backslash escapes do not apply.

This is why I'm directly interpolating the table name myself in the shell script, which I don't like. Though \copy can be slower than COPY (data must pass through the client), it has two benefits:

  • It understands relative paths (@rsolari had mentioned this being annoying with the previous solution)
  • You can easily connect from a remote machine and copy over a file from that machine rather than doing it directly on the master
  • Permissions are easier to understand (your user must be able to read the file, not the server)

So COPY might be something we explore if the performance of \copy is an issue, but otherwise the usability of the latter wins out so I'm sticking with it. I am going to update this to avoid so much direct interpolation, but wanted to call out the tradeoffs in either direction.

@jasonmp85 jasonmp85 changed the title Add support for COPY command via script/trigger Add support for \copy command via script/trigger Mar 12, 2015
@sumedhpathak
Copy link
Contributor

@jasonmp85 I am OK with the names of both the script and the UDF.

Decided it was easiest to understand. Added test for sequence
functionality.
Lets us check partial failure modes of copy workflow.
Kills off trigger complication.
Casting to regclass requires that the input string already be properly
quoted as an identifier. If there is a space or quote in the input to
the cast, it fails.

Explicitly quoting the identifier fixes this, and I added a test to
verify the fix.
Safer to let the PostgreSQL-aware environment do our quoting and do as
little quoting ourselves as possible.
Reordered arguments and defined some constants to clarify calls.
jasonmp85 added a commit that referenced this pull request Mar 13, 2015
Add support for \copy command via script/trigger

cr: @sumedhpathak
@jasonmp85 jasonmp85 merged commit 8e35b7f into develop Mar 13, 2015
@jasonmp85 jasonmp85 deleted the feature-copy_trigger-#61 branch March 13, 2015 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generalize COPY-to-INSERT trigger
2 participants