JAMES-3435 Cassandra: No longer rely on LWT for domain and users #255

chibenwa · 2020-10-23T04:37:17Z

No description provided.

rouazana · 2020-10-23T08:02:26Z

I'm not sure to get the value of this change. Managing domain and users is not a very recurring operation, so why do you expect some performance improvement by relaxing these constraints?

chibenwa · 2020-10-23T08:17:10Z

Managing domain and users is not a very recurring operation

Reading them is.

And SERIAL consistency is required in such cases, implying a read of paxos system table.

That's why I do you expect some performance improvement by relaxing these constraints.

That being said, it looks like SERIAL consistency level was not even specified there (!!!) giving a false feeling of safety (we likely had untested concurrency issues with this validation behavior).

mbaechler

How do you know you are not overwriting an existing domain? From a user point of view there's no such thing as upsert comment, only create/update right?

mbaechler · 2020-10-27T10:28:53Z

...data-library/src/test/java/org/apache/james/domainlist/lib/DomainListIdempotentContract.java

-        assertThatThrownBy(() -> domainList().addDomain(DOMAIN_UPPER_5))
-            .isInstanceOf(DomainListException.class);
+
+        assertThat(domainList().getDomains().stream().filter(domain -> domain.equals(DOMAIN_5) || domain.equals(DOMAIN_UPPER_5)))


you don't know which domain to expect?

chibenwa · 2020-10-28T01:20:48Z

How do you know you are not overwriting an existing domain?

WebAdmin, the de-facto API for creating users, domains only exposes an UPSERT.

Why ask this question at the data layer and not at the presentation layer?

chibenwa · 2020-12-01T08:06:31Z

For one of our upcoming deployments, we are performing a load-testing campaign against a testing infrastructure. This load testing campaign aims at finding the limits of the aforementioned platform.

We successfully succeeded to load James JMAP endpoint to a break-point at 5400 users (isolation).

Above that number, evidence suggest that we are CPU bound (requests )

On a Cassandra standpoints, there is a high CPU usage (load of 10) that we linked to the usage of lightweight transactions / paxos usage, for ACLs [1] [2] [3] [4]. Detailed analysis is on the references.

This is a topic I'm arguing for months [5], we need to take a strong decision, and enforce it.

Infrastructure:

3x Cassandra nodes (8 cores, 32 GB RAM, 200 GB SSD)
4x James server (4 cores, 8 GB RAM)
ElasticSearch servers: not measured.

Action to conduct

Perform a test run with ACL paxos turned off.
-> This aims at confirming the deleterious impact of their usage
-> Benoit & René are responsible to deploy and test a modified instance of James on PRE-PROD, with ACL turned off
-> Benoit will continue lobbying AGAINST the usage of strong consistency in the community [5], which is overall a Cassandra bad practice and a mis-fit.
-> If conclusive, Benoit will present a data-race proofed ACL implementation on top of Cassandra leveraging CRDT and eventual consistency.

Runs details

[6] [7] shows a (successfull!) run of JMAP scenario alone on top of James.

[8] [9] shows a run hitting a throughtput limit point (5400 simultaneous users, 320 req/s) from which the performance highly downgrades. This is the system breaking point.

References

[1] https://blog.pythian.com/lightweight-transactions-cassandra/ documents the CPU / memory / bandwith impact of using LWT.

dstat-cassandra.txt

[2] dstat-cassandra.txt highlights a CPU over-usage on Cassandra node. This behavior is NOT NORMAL. Read-heavy workload are not supposed to be CPU-bound.

cassandra-tablestats.txt

[3] cassandra-tablestats.txt shouws table usage. We can notice BY FAR that our most used table is the system.paxos table.

compaction-history.txt

[4] compaction-history.txt highlights how often we do compact the paxos system table in comparison to other tables further higlighting this to be a hot-spot.

[5] Benoit proposition to review lightweight transaction / paxos usage in James: #255

[6] 4000-stats.png shows good statistics of a run with 4000 users
[7] 4000-latency.png shows latency evolution in regard to the number of users with 4000 users
[8] 6000-stats.png shows good statistics of a run with 6000 users
[9] 6000-latency.png shows latency evolution in regard to the number of users with 6000 users. Performance breakage can be seen at 5400 users.

chibenwa · 2021-04-01T11:26:01Z

LWT are only done on modifications which are rare for users and domains.

Let's close this for now.

JAMES-3435 Cassandra: No longer rely on LWT for domain and users

b1fe140

chibenwa mentioned this pull request Oct 23, 2020

[NO_REVIEW] JAMES-3435 Cassandra: No longer rely on LWT for domain and users linagora/james-project#3941

Closed

fixup! JAMES-3435 Cassandra: No longer rely on LWT for domain and users

c4bf19d

Arsnael approved these changes Oct 23, 2020

View reviewed changes

mbaechler reviewed Oct 27, 2020

View reviewed changes

chibenwa mentioned this pull request Dec 1, 2020

Turn off PAXOS on ACLs table linagora/james-project#4098

Closed

chibenwa closed this Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAMES-3435 Cassandra: No longer rely on LWT for domain and users #255

JAMES-3435 Cassandra: No longer rely on LWT for domain and users #255

chibenwa commented Oct 23, 2020

rouazana commented Oct 23, 2020

chibenwa commented Oct 23, 2020

mbaechler left a comment

mbaechler Oct 27, 2020

chibenwa commented Oct 28, 2020

chibenwa commented Dec 1, 2020 •

edited

Loading

chibenwa commented Apr 1, 2021

JAMES-3435 Cassandra: No longer rely on LWT for domain and users #255

JAMES-3435 Cassandra: No longer rely on LWT for domain and users #255

Conversation

chibenwa commented Oct 23, 2020

rouazana commented Oct 23, 2020

chibenwa commented Oct 23, 2020

mbaechler left a comment

Choose a reason for hiding this comment

mbaechler Oct 27, 2020

Choose a reason for hiding this comment

chibenwa commented Oct 28, 2020

chibenwa commented Dec 1, 2020 • edited Loading

Action to conduct

Runs details

References

chibenwa commented Apr 1, 2021

chibenwa commented Dec 1, 2020 •

edited

Loading