Performace meta issue #39

plexus · 2018-04-30T10:36:28Z

While the site is usable at the moment, pages often take longer to load than is comfortable. This is a meta issue to track some of the performance issues and improvements.

What's happened so far

Use a CDN, not just for assets but also for rendered HTML (Cloudflare)
Add caching headers to all responses to indicate pages can be cached up to 30 days
Improve markdown parser performance
Increase the memory available to the datomic transactor and peer
Add Memcached to Datomic

With that performance of pages that have somewhat recently been visited is spiffy, but pages that are cold still take a pretty long time. I think this is mostly due to Datomic queries being slow.

Things to try/do

Get some benchmarks in production to get a better sense of where the time is spent
- maybe collect ongoing stats regarding page and query load times to see where to focus on
Save rendered pages to disk as HTML files, so nginx will pick them up
- this would be cool because it's a cache that survices restarts/reboots
Share a Datomic db instance across requests. The DB only really changes when we do an import.
- at the moment the peer's memory usage stays well below its allotted memory, this should prevent it from garbage collecting data it will need later on.
Further tweak the memory settings for datomic transactor / peer / memcached. We have 12GB available and I'm not sure how best to allocate it. Memcached currently gets 4GB, the peer object cache is set to 7GB but it seems to stay around 4GB.
Add our caching of certain queries, or preload certain data
- in particular the list of channels, and the amount of messages per channel per day could easily be computed once and held in memory.
Further tweak/optimize existing queries
Review Datomic schema. I think we have all the important indexes in there, and we're not doing stuff like plain text indexing, but I'd love to have a second opinion on this.

Odie · 2018-05-05T06:08:34Z

I've mostly been getting a "504 Gateway timed out" error from cloudflare whenever I hit a production link. How difficult would it be to bring up a clone of the production environment to test things on? Is it just the matter of running your ansible deployment script?

plexus · 2018-05-06T15:10:00Z

Yes, the ansible script should contain everything that's needed. The hardware we're using is this OVH "VPS Cloud RAM 2" https://www.ovh.com/world/vps/vps-cloud-ram.xml

I've mostly been getting a "504 Gateway timed out" error

At the moment the site is keeping up again, I finally managed to get it to actually make use of the RAM it has. Turns out it's not enough to tweak the datomic parameters, you also need to tell the JVM it's ok to use more heap space than the default 2GB.

I also dumped all the index pages as HTML files, so they are served directly from Nginx, and memoized all queries as well as the datomic DB.

(doseq [v [#'clojurians-log.db.queries/user-names
           #'clojurians-log.db.queries/channel-thread-messages-of-day
           #'clojurians-log.db.queries/channel
           #'clojurians-log.db.queries/channel-id-map
           #'clojurians-log.db.queries/channel-list
           #'clojurians-log.db.queries/channel-days
           #'clojurians-log.db.queries/channel-day-messages
           #'datomic.api/db]]
  (alter-var-root v (fn [f] (memoize f))))

This way the whole app always uses the same db instance.

I've also been keeping an eye on the server logs. The thing is we're getting an extremely low amount of real traffic, maybe a page per minute, but several bots are trying to crawl the site, which it's having a hard time with. Google, Yandex, Semrush and moz.com. Those last two are marketing tools, which I've disallowed with a robots.txt, that might take a while to take effect. I also added rel="nofollow" to the timestamp links, because if Google and Yandex are going to crawl each individual message they'll never finish.

I'll keep an eye on it the coming days, see if we stay up. I'm starting to think that we probably should go back to an approach where we use the app to generate html files up front.

Odie · 2018-05-07T05:18:19Z

So, I tried running the ansible script against a local vm.

2 quick things:

vars/clojurians_log_secrets.yml is encrypted
Can you post an example/redacted version of this file? It's not clear if you're using mostly defaults or if there are vars in the file that might effect the execution of the various ansible roles being run.
Repo "github.com/plexus/clojurians-log" is not public
The plexus.clojurians-log role attempts to checkout existing logs by cloning the said git repo, but it does not appear to be publicly accessible.

plexus · 2018-05-07T07:16:02Z

these are the keys in clojurians_log_secrets

datomic_license_key: 
datomic_pro_download_key: 
datomic_pro_email: 
database_password: 
slack_api_token: 
clojure_app_ssh_id_rsa_pub: 
clojure_app_ssh_id_rsa:

database password can be anything. The SSH key is used to access the github repo with logs. You'll need the token if you want to do a full import, since for that it needs to first fetch users and channels before it can import the messages.

Odie · 2018-05-16T06:10:03Z

So, looked into the perf issue a bit.

I didn't end up trying to setup a test environment with ansible. What I did do was import all the log data into the local dev environment. This means that everything is just off of an in-memory datomic database.

Here's some of the timing data when visiting: http://localhost:4983/figwheel/2018-04-05

Process	Time
`log-route`	2356 ms (100%)
messages query	8.7 ms (0.3%)
thread-messages query	2327 ms (98%)
log page templating	17 ms (0.7%)

Without the thread-messages query, page generation would take a quite reasonable ~29ms.

Though this isn't a measurement off an exact replica of the production environment, it still lets us spot the bottleneck. 98% of the time is used in querying thread messages. Looking at the query itself, clojurians-log.db.queries/channel-thread-messages-of-day, it looks like the query basically have to perform some kind of processing on every message entity in the db before any filtering can be performed. This means this query is just going to get slower as the number of messages in the db grow.

The Fix

I think we should references to the child messages with the parent message. This way, retrieving all children messages should be fast. I guess we need to update the schema with something like:

   #:db {:ident       :message/child-messages
         :valueType   :db.type/ref
         :cardinality :db.cardinality/many}

The import code will have to change a bit also.

Odie · 2018-05-19T08:14:43Z

So, the previous proposed change requires:

schema migration
reimport and/or reprocess all existing messages

It's a little hard to work with this without properly setting up a test environment. But, it appears that there is a fix that maybe be "good enough" with minimal code changes. The basic idea is to write a new query that retrieves all thread messages based on the :message/ts of the thread parents.

This change improves the query performance significantly.

Process	Time
`log-route`	123 ms (100%)
messages query	10 ms (8%)
thread-messages query	92 ms (73%)

With this change, we're improving the performance by 25x, bringing the response time to a reasonable range.

Will open a PR when the code is cleaned up.

plexus · 2019-09-11T15:37:40Z

Performance is pretty ok now, it turned out there were a few queries that got stats across channels, or across all dates for a single channel, and these were really slow. I addressed it by doing these queries regularly (once every hour) and caching the result.

Odie mentioned this issue May 19, 2018

Faster thread-message query #46

Merged

3 tasks

plexus closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performace meta issue #39

Performace meta issue #39

plexus commented Apr 30, 2018

Odie commented May 5, 2018

plexus commented May 6, 2018

Odie commented May 7, 2018

plexus commented May 7, 2018

Odie commented May 16, 2018

Odie commented May 19, 2018 •

edited

plexus commented Sep 11, 2019

Performace meta issue #39

Performace meta issue #39

Comments

plexus commented Apr 30, 2018

Odie commented May 5, 2018

plexus commented May 6, 2018

Odie commented May 7, 2018

plexus commented May 7, 2018

Odie commented May 16, 2018

The Fix

Odie commented May 19, 2018 • edited

plexus commented Sep 11, 2019

Odie commented May 19, 2018 •

edited