Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow modification commands which don't contain unpure functions #240

Closed
citus-github-bot opened this issue Feb 4, 2016 · 24 comments
Closed
Labels

Comments

@citus-github-bot
Copy link

Issue by gmcquillan
Tuesday May 05, 2015 at 22:53 GMT
Originally opened as citusdata/pg_shard#108


I'm experimenting with using HyperLogLog (HLL) data types in some columns. One problem with these is that they take up quite a lot more space than a BIGINT. pg_shard potentially allays a lot of those issues. The extension that provides the HLL datatype is this one by aggregateknowledge. These two extensions seem complimentary for warehousing purposes.

Surprisingly, these data types work with sharded tables for most types of reads, but not for writes (see below). When I attempt an update like so:

update test_hll_shard set users = users||hll_hash_text('foobar') where date = date('2015-01-08');

I get:

ERROR:  cannot plan sharded modification containing values which are not constants or constant expressions

I can sort of work around this by setting the literal bytes in this field. Which works fine -- but adding HLL values requires a read and a write. Since pg_shard (understandably) doesn't allow for more than a single statement transaction, this leaves my use-case vulnerable to race conditions in multi-writer environments.

Since this function is available on the workers, and is deterministic based on the value of the existing row and the new HLL value to be added, there shouldn't be any issue with dispatching this expression through to the workers.

Is there a hard limitation preventing pg_shard from dispatching modifications for non-constant expressions?

@citus-github-bot citus-github-bot added this to the Next milestone Feb 4, 2016
@citus-github-bot
Copy link
Author

Comment by gmcquillan
Wednesday May 06, 2015 at 18:40 GMT


Similarly, as mentioned in the pg_shard-users list, subqueries might be another work around to the function issue, but they are also not supported yet. (e.g. perform the function on data from a different table on the master and insert only constant values into the sharded table).

@citus-github-bot
Copy link
Author

Comment by jasonmp85
Wednesday May 13, 2015 at 18:31 GMT


Since this function is available on the workers, and is deterministic based on the value of the existing row and the new HLL value to be added, there shouldn't be any issue with dispatching this expression through to the workers.

Are you sure this issue is due to the hll_hash_text function and not the date one? The HLL function is marked as IMMUTABLE, which means it should be folded into a constant before being pushed to the workers…

This might be an argument that the error message should be clarified to say which expression is non-constant.

Is the HLL extension installed on the master? Does it know the details of these functions?

@citus-github-bot
Copy link
Author

Comment by jasonmp85
Wednesday May 13, 2015 at 18:33 GMT


Is there a hard limitation preventing pg_shard from dispatching modifications for non-constant expressions?

At the moment, yes. pg_shard performs UPDATEs by grabbing a lock for a shard and updating all replicas of that shard. This means there is no first-class replication happening but rather a replica-by-replica execution of the query at hand. If that query had e.g. a now() call, the output might differ among replicas, resulting in data divergence.

So unless we can fold an expression into a constant expression, modifications are unsafe if replication is in play.

@citus-github-bot
Copy link
Author

Comment by gmcquillan
Wednesday May 13, 2015 at 20:26 GMT


To usefully combine or update HLL datastructures, you need to call hll_add, which is not immutable.

I didn't think about replication. That's interesting. I assumed that each shard was responsible for its own replication (streaming WAL logs), not that it was something that pg_shard handled for me (if I'm understanding you correctly).

Thanks for taking the time to correct my misunderstanding.

@citus-github-bot
Copy link
Author

Comment by jasonmp85
Wednesday May 13, 2015 at 20:33 GMT


Streaming replication replicates the entire database which is incompatible with our use of many small "logical" shards. Because of this we've been keeping an eye on BDR/UDR, but don't have a timeline or even any concrete designs at this point.

We could probably expand the use of functions to those which are STABLE so long as we can ensure the snapshots on each replica are identical when we run the command… We'll definitely take your input into consideration during the next cycle, as we like the HLL extension a lot, too and want it to work well with our software.

@citus-github-bot
Copy link
Author

Comment by gmcquillan
Wednesday May 13, 2015 at 20:57 GMT


Yeah, one really, really nice property of storing data as HLL data types is that mutation is idempotent, which has a nice resiliency in distributed systems.

@citus-github-bot
Copy link
Author

Comment by jasonmp85
Thursday May 14, 2015 at 21:35 GMT


Hm… so I just checked hll_add and it is also listed as IMMUTABLE

We find the HLL extension very useful for a number of our customers, so I want to make sure it's working well with pg_shard. I'm curious about the specifics of what you're doing, especially since you've earned yourself the distinction of being the first external party to open a pull request against pg_shard 😁 🏆!

Could you shoot me an email at engage at citusdata dot com to have a quick chat about the problem you're working on?

@citus-github-bot
Copy link
Author

Comment by jasonmp85
Monday Aug 17, 2015 at 18:58 GMT


OK, so I've investigated what's going on here. Previously we had the assumption that IMMUTABLE functions would be collapsed into constants during our call to eval_const_expressions. Because of that, we expected all allowable function calls to have been reduced to constant expressions by the time our planner gets a hold of a query.

This is obviously true when transforming something like 2 + 2 into a constant 4. But if the query were something like UPDATE table SET counter = counter + 1 then the presence of a Var (counter) would keep eval_const_expressions from doing its thing (rightly so).

I think we can relax this check by replacing it with a call to contain_mutable_functions instead, allowing immutable functions to be pushed down to remote nodes even if they can't be fully reduced at the master.

Because UPDATEs execute against a single shard and its replicas, and because UPDATE is one-at-a-time (due to locking), it's safe to do this pushdown.

We'll look into this during a future cycle.

@ozgune
Copy link
Contributor

ozgune commented Mar 16, 2016

@jasonmp85 - how long would it take to incorporate contain_mutable_functions to resolve this issue?

(Also, we're tracking performance / locking improvements for UPDATE and DELETE commands in #370.)

@anarazel
Copy link
Contributor

I think we should fix this by evaluating all functions on the master,
because that'll allow sequences, now(), et al. to work. At least until
we have multi-master/masterless. That seems like a big advantage.

@jasonmp85
Copy link
Contributor

@anarazel — I agree, but that would be issue #213 :-D.

@anarazel
Copy link
Contributor

On 2016-03-17 19:39:52 -0700, Jason Petersen wrote:

@anarazel — I agree, but that would be issue #213 :-D.

Yea, but going for a borked architecture because it's separate tickets
doesn't seem like the right approach ;)

@samay-sharma
Copy link
Contributor

Ran into this during a customer engagement. The use case was:

UPDATE table SET updated_at = now() WHERE ....

@jasonmp85
Copy link
Contributor

@samay-sharma: I'm not sure that applies here, as now is decidedly VOLATILE.

@anarazel
Copy link
Contributor

anarazel commented Apr 12, 2016 via email

@samay-sharma
Copy link
Contributor

@jasonmp85 : Oops, Should I move my comment to #213, then ?

@lithp
Copy link
Contributor

lithp commented Apr 21, 2016

This just came up in engage@, someone wanted to run a query like

UPDATE bobby_tables SET blob = replace(blob, 'hello', 'goodbye') WHERE part_key = 10;

The workaround I suggested was to use an upsert:

INSERT INTO bobby_tables (part_key) VALUES (10) ON CONFLICT (part_key) DO UPDATE SET blob = replace(bobby_tables.blob, 'hello', 'goodbye');

Which obviously isn't ideal.

Here's how upsert handles it.

@jasonmp85
Copy link
Contributor

I assume upsert doesn't actually check this part of the query? Can you use random in an upsert? That seems problematic.

@jasonmp85 jasonmp85 self-assigned this Apr 21, 2016
@marcocitus
Copy link
Member

Upsert uses !contain_mutable_functions whereas update uses !IsA(targetEntry->expr, Const).

postgres=# INSERT INTO test VALUES ('foo','bar') ON CONFLICT (key) DO UPDATE SET value = random();
ERROR:  cannot plan sharded modification containing values which are not constants or constant expressions
postgres=# INSERT INTO test VALUES ('foo','bar') ON CONFLICT (key) DO UPDATE SET value = test.key;
INSERT 0 1

@jasonmp85
Copy link
Contributor

Ah, ok, good. Yeah, we should just fix this.

@lithp
Copy link
Contributor

lithp commented Apr 21, 2016

I walked to Sumedh and he seemed okay with fixing this given that it
wouldn't take too long. With my current understanding, this might take up
to a day or two?

Does that sound like a reasonable estimate to someone who knows more than
me?

21 Nis 2016 Per 20:03 tarihinde Jason Petersen notifications@github.com
şunu yazdı:

Ah, ok, good. Yeah, we should just fix this.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#240 (comment)

@marcocitus
Copy link
Member

Sounds reasonable. It's probably 5 lines of code plus a bunch of testing.

@anarazel
Copy link
Contributor

Yes, sounds reasonable here.

On April 21, 2016 1:57:58 PM CDT, Brian Cloutier notifications@github.com wrote:

I walked to Sumedh and he seemed okay with fixing this given that it
wouldn't take too long. With my current understanding, this might take
up
to a day or two?

Does that sound like a reasonable estimate to someone who knows more
than
me?

21 Nis 2016 Per 20:03 tarihinde Jason Petersen
notifications@github.com
şunu yazdı:

Ah, ok, good. Yeah, we should just fix this.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

#240 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#240 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@marcocitus
Copy link
Member

Once we solve this issue we can also provide a sneaky workaround for #211 by defining a PL/pgSQL wrapper for nextval that is marked as immutable. In that case it will get evaluated on the master as if it was a constant expression.

@lithp lithp changed the title Allow IMMUTABLE functions in modification commands Allow modification commands which don't contain unpure functions Apr 22, 2016
@lithp lithp removed the in progress label Apr 28, 2016
@jasonmp85 jasonmp85 removed their assignment Apr 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants