New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performace meta issue #39
Comments
I've mostly been getting a "504 Gateway timed out" error from cloudflare whenever I hit a production link. How difficult would it be to bring up a clone of the production environment to test things on? Is it just the matter of running your ansible deployment script? |
Yes, the ansible script should contain everything that's needed. The hardware we're using is this OVH "VPS Cloud RAM 2" https://www.ovh.com/world/vps/vps-cloud-ram.xml
At the moment the site is keeping up again, I finally managed to get it to actually make use of the RAM it has. Turns out it's not enough to tweak the datomic parameters, you also need to tell the JVM it's ok to use more heap space than the default 2GB. I also dumped all the index pages as HTML files, so they are served directly from Nginx, and memoized all queries as well as the datomic DB. (doseq [v [#'clojurians-log.db.queries/user-names
#'clojurians-log.db.queries/channel-thread-messages-of-day
#'clojurians-log.db.queries/channel
#'clojurians-log.db.queries/channel-id-map
#'clojurians-log.db.queries/channel-list
#'clojurians-log.db.queries/channel-days
#'clojurians-log.db.queries/channel-day-messages
#'datomic.api/db]]
(alter-var-root v (fn [f] (memoize f)))) This way the whole app always uses the same db instance. I've also been keeping an eye on the server logs. The thing is we're getting an extremely low amount of real traffic, maybe a page per minute, but several bots are trying to crawl the site, which it's having a hard time with. Google, Yandex, Semrush and moz.com. Those last two are marketing tools, which I've disallowed with a robots.txt, that might take a while to take effect. I also added I'll keep an eye on it the coming days, see if we stay up. I'm starting to think that we probably should go back to an approach where we use the app to generate html files up front. |
So, I tried running the ansible script against a local vm. 2 quick things:
|
these are the keys in clojurians_log_secrets
database password can be anything. The SSH key is used to access the github repo with logs. You'll need the token if you want to do a full import, since for that it needs to first fetch users and channels before it can import the messages. |
So, looked into the perf issue a bit. I didn't end up trying to setup a test environment with ansible. What I did do was import all the log data into the local dev environment. This means that everything is just off of an in-memory datomic database. Here's some of the timing data when visiting: http://localhost:4983/figwheel/2018-04-05
Without the thread-messages query, page generation would take a quite reasonable ~29ms. Though this isn't a measurement off an exact replica of the production environment, it still lets us spot the bottleneck. 98% of the time is used in querying thread messages. Looking at the query itself, The FixI think we should references to the child messages with the parent message. This way, retrieving all children messages should be fast. I guess we need to update the schema with something like:
The import code will have to change a bit also. |
So, the previous proposed change requires:
It's a little hard to work with this without properly setting up a test environment. But, it appears that there is a fix that maybe be "good enough" with minimal code changes. The basic idea is to write a new query that retrieves all thread messages based on the This change improves the query performance significantly.
With this change, we're improving the performance by 25x, bringing the response time to a reasonable range. Will open a PR when the code is cleaned up. |
Performance is pretty ok now, it turned out there were a few queries that got stats across channels, or across all dates for a single channel, and these were really slow. I addressed it by doing these queries regularly (once every hour) and caching the result. |
While the site is usable at the moment, pages often take longer to load than is comfortable. This is a meta issue to track some of the performance issues and improvements.
What's happened so far
With that performance of pages that have somewhat recently been visited is spiffy, but pages that are cold still take a pretty long time. I think this is mostly due to Datomic queries being slow.
Things to try/do
The text was updated successfully, but these errors were encountered: