Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performace meta issue #39

Closed
7 tasks
plexus opened this issue Apr 30, 2018 · 7 comments
Closed
7 tasks

Performace meta issue #39

plexus opened this issue Apr 30, 2018 · 7 comments

Comments

@plexus
Copy link
Member

plexus commented Apr 30, 2018

While the site is usable at the moment, pages often take longer to load than is comfortable. This is a meta issue to track some of the performance issues and improvements.

What's happened so far

  • Use a CDN, not just for assets but also for rendered HTML (Cloudflare)
  • Add caching headers to all responses to indicate pages can be cached up to 30 days
  • Improve markdown parser performance
  • Increase the memory available to the datomic transactor and peer
  • Add Memcached to Datomic

With that performance of pages that have somewhat recently been visited is spiffy, but pages that are cold still take a pretty long time. I think this is mostly due to Datomic queries being slow.

Things to try/do

  • Get some benchmarks in production to get a better sense of where the time is spent
    • maybe collect ongoing stats regarding page and query load times to see where to focus on
  • Save rendered pages to disk as HTML files, so nginx will pick them up
    • this would be cool because it's a cache that survices restarts/reboots
  • Share a Datomic db instance across requests. The DB only really changes when we do an import.
    • at the moment the peer's memory usage stays well below its allotted memory, this should prevent it from garbage collecting data it will need later on.
  • Further tweak the memory settings for datomic transactor / peer / memcached. We have 12GB available and I'm not sure how best to allocate it. Memcached currently gets 4GB, the peer object cache is set to 7GB but it seems to stay around 4GB.
  • Add our caching of certain queries, or preload certain data
    • in particular the list of channels, and the amount of messages per channel per day could easily be computed once and held in memory.
  • Further tweak/optimize existing queries
  • Review Datomic schema. I think we have all the important indexes in there, and we're not doing stuff like plain text indexing, but I'd love to have a second opinion on this.
@Odie
Copy link
Contributor

Odie commented May 5, 2018

I've mostly been getting a "504 Gateway timed out" error from cloudflare whenever I hit a production link. How difficult would it be to bring up a clone of the production environment to test things on? Is it just the matter of running your ansible deployment script?

@plexus
Copy link
Member Author

plexus commented May 6, 2018

Yes, the ansible script should contain everything that's needed. The hardware we're using is this OVH "VPS Cloud RAM 2" https://www.ovh.com/world/vps/vps-cloud-ram.xml

I've mostly been getting a "504 Gateway timed out" error

At the moment the site is keeping up again, I finally managed to get it to actually make use of the RAM it has. Turns out it's not enough to tweak the datomic parameters, you also need to tell the JVM it's ok to use more heap space than the default 2GB.

I also dumped all the index pages as HTML files, so they are served directly from Nginx, and memoized all queries as well as the datomic DB.

(doseq [v [#'clojurians-log.db.queries/user-names
           #'clojurians-log.db.queries/channel-thread-messages-of-day
           #'clojurians-log.db.queries/channel
           #'clojurians-log.db.queries/channel-id-map
           #'clojurians-log.db.queries/channel-list
           #'clojurians-log.db.queries/channel-days
           #'clojurians-log.db.queries/channel-day-messages
           #'datomic.api/db]]
  (alter-var-root v (fn [f] (memoize f))))

This way the whole app always uses the same db instance.

I've also been keeping an eye on the server logs. The thing is we're getting an extremely low amount of real traffic, maybe a page per minute, but several bots are trying to crawl the site, which it's having a hard time with. Google, Yandex, Semrush and moz.com. Those last two are marketing tools, which I've disallowed with a robots.txt, that might take a while to take effect. I also added rel="nofollow" to the timestamp links, because if Google and Yandex are going to crawl each individual message they'll never finish.

I'll keep an eye on it the coming days, see if we stay up. I'm starting to think that we probably should go back to an approach where we use the app to generate html files up front.

@Odie
Copy link
Contributor

Odie commented May 7, 2018

So, I tried running the ansible script against a local vm.

2 quick things:

  1. vars/clojurians_log_secrets.yml is encrypted
    Can you post an example/redacted version of this file? It's not clear if you're using mostly defaults or if there are vars in the file that might effect the execution of the various ansible roles being run.

  2. Repo "github.com/plexus/clojurians-log" is not public
    The plexus.clojurians-log role attempts to checkout existing logs by cloning the said git repo, but it does not appear to be publicly accessible.

@plexus
Copy link
Member Author

plexus commented May 7, 2018

these are the keys in clojurians_log_secrets

datomic_license_key: 
datomic_pro_download_key: 
datomic_pro_email: 
database_password: 
slack_api_token: 
clojure_app_ssh_id_rsa_pub: 
clojure_app_ssh_id_rsa: 

database password can be anything. The SSH key is used to access the github repo with logs. You'll need the token if you want to do a full import, since for that it needs to first fetch users and channels before it can import the messages.

@Odie
Copy link
Contributor

Odie commented May 16, 2018

So, looked into the perf issue a bit.

I didn't end up trying to setup a test environment with ansible. What I did do was import all the log data into the local dev environment. This means that everything is just off of an in-memory datomic database.

Here's some of the timing data when visiting: http://localhost:4983/figwheel/2018-04-05

Process Time
log-route 2356 ms (100%)
messages query 8.7 ms (0.3%)
thread-messages query 2327 ms (98%)
log page templating 17 ms (0.7%)

Without the thread-messages query, page generation would take a quite reasonable ~29ms.

Though this isn't a measurement off an exact replica of the production environment, it still lets us spot the bottleneck. 98% of the time is used in querying thread messages. Looking at the query itself, clojurians-log.db.queries/channel-thread-messages-of-day, it looks like the query basically have to perform some kind of processing on every message entity in the db before any filtering can be performed. This means this query is just going to get slower as the number of messages in the db grow.

The Fix

I think we should references to the child messages with the parent message. This way, retrieving all children messages should be fast. I guess we need to update the schema with something like:

   #:db {:ident       :message/child-messages
         :valueType   :db.type/ref
         :cardinality :db.cardinality/many}

The import code will have to change a bit also.

@Odie
Copy link
Contributor

Odie commented May 19, 2018

So, the previous proposed change requires:

  1. schema migration
  2. reimport and/or reprocess all existing messages

It's a little hard to work with this without properly setting up a test environment. But, it appears that there is a fix that maybe be "good enough" with minimal code changes. The basic idea is to write a new query that retrieves all thread messages based on the :message/ts of the thread parents.

This change improves the query performance significantly.

Process Time
log-route 123 ms (100%)
messages query 10 ms (8%)
thread-messages query 92 ms (73%)

With this change, we're improving the performance by 25x, bringing the response time to a reasonable range.

Will open a PR when the code is cleaned up.

@Odie Odie mentioned this issue May 19, 2018
3 tasks
@plexus
Copy link
Member Author

plexus commented Sep 11, 2019

Performance is pretty ok now, it turned out there were a few queries that got stats across channels, or across all dates for a single channel, and these were really slow. I addressed it by doing these queries regularly (once every hour) and caching the result.

@plexus plexus closed this as completed Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants