Skip to content
This repository

Monitoring #56

Closed
tsujigiri opened this Issue · 29 comments

3 participants

Helge Rausch Philipp Bayer Bastian Greshake
Helge Rausch

To get a better insight into how the whole thing is performing over time and which parts are using the most resources we should (re)employ some monitoring.

I have good experiences with Munin, which creates nice graphs from existing and custom scripts, which are easy to write in a language of ones own choice, and NewRelic which gives an idea of which parts of the app are consuming the most resources.

More suggestions are welcome.

Helge Rausch

We should also consider using the pg_stat_statements module for Postgres. It makes statistics about e.g. query counts and times queryable. These stats could also be monitored over time, e.g. via Munin.

Philipp Bayer
Collaborator

Seconding the New Relic-thing; I still haven't been able to upgrade Postgres (stupid Ubuntu always wants to replace the installation in-place, which might result in loss of data. Will probably just install Postgres manually and then do the pg_update)

I still got my New Relic account, want me to install the gem again and see what happens? Something broke last time, I can't remember what. I know it conflicted with some gem... But that was ages ago and might be fixed now.

Edit: Installed gem - seems to work, but newest rails-versions somehow broke things. Investigating.

Philipp Bayer
Collaborator

Tried to install pg_stat_statements, got this:

   * The PostgreSQL server failed to start. Please check the log output:
2013-02-03 03:48:50 CET LOG:  loaded library "pg_stat_statements"
2013-02-03 03:48:50 CET FATAL:  could not create shared memory segment: Invalid argument
2013-02-03 03:48:50 CET DETAIL:  Failed system call was shmget(key=5432001, size=40484864, 03600).
2013-02-03 03:48:50 CET HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.  You can either reduce the request size or reconfigure the kernel with larger SHMMAX.  To reduce the request size (currently 40484864 bytes), reduce PostgreSQL's shared_buffers parameter (currently 3072) and/or its max_connections parameter (currently 103).
    If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
    The PostgreSQL documentation contains more information about shared memory configuration

However, increasing SHMMAX requires a reboot of the entire server, something which I'd rather not do right now but in conjunction with a Kernel-upgrade etc.

Do we have any procedures for an entire system-upgrade? Could be that an upgrade of postgresql leads to overwriting of the PG_DATA-folder (Oh God that would be scary).

Helge Rausch

Then let's keep that in mind for the next reboot.
And, re: the Postgres data, as long as we have the sql dumps, we should be fine, right? :)

Philipp Bayer
Collaborator

Theoretically we should be fine :D

Currently trying to downgrade my Postgresql to 8.4 so I can see what happens when I use yum to update to 9.2... Should behave identically to apt-get upgrade. Hopefully.

Philipp Bayer
Collaborator

Have now added pg_stat_statements - had to change kernel.shmmax from 33554432 to 17179869184 and kernel.shmall from 2097152 to 4194304. If that breaks anything you can always reset both values using sysctl -w, for example: sysctl -w kernel.shmall=2097152

Will monitor what the module does.

Philipp Bayer
Collaborator

Here's some output I just ran:

https://gist.github.com/drsnuggles/5881974

It's formatted a bit weirdly, mostly because the INSERT-statement from genotyping-parsing is so long. Looks like the SELECT(user_snps) takes the biggest amount of time, which is interesting and looks fixable, same goes for the two COUNTs. There should also be a better way to get all SNPs, namely just to get the subset that is displayed on the current page.

I don't think we can improve on the INSERT-statement.

Philipp Bayer
Collaborator

Here's a screenshot from inside New Relic: https://twitter.com/PhilippBayer/status/350475349425016832/photo/1

As you can see, the show-action takes most of the time (over the last 24 hours), and if I compare that with the postgreSQL-output in last comment I think the line
@user_count = @snp.user_snps.select('distinct(user_id)').count in the snps_controller is what's so slow... Have to check more.

Helge Rausch

Can you find out (maybe from the server logs) how often this is called on distinct snps whithin 24 hours?

Depending on whether this number needs to be a hundred percent accurate all the time, we could either cache it for some time or keep track of it in the Snp model, in- or decrementing a Redis counter each time a UserSnp gets added or removed.

Helge Rausch

NewRelic also has a view, telling you exactly which queries are the slowest. I just never can remember where to find it.

Helge Rausch

Secondly, the user_snps table doesn't have an index on user_id. That should knock the first query in your list from it's place.

Philipp Bayer
Collaborator

The view in NewRelic for queries is (I think) only in the paid non-lite version, and our testing time has long run out... But it would be better to cache the amount of users per SNP, possibly the index will already help enough to make it not that important.

Helge Rausch

For that to work it would need an additional index on user_id and snp_id. But you are right, we should try that first.

Postgres can even build indexes without locking the table. We should consider doing that, as the table is fairly big and building two indexes would take some time.

Philipp Bayer
Collaborator

I'm having a look now, there's already a index_user_snps_on_user_id_and_snp_name, so I guess we made a small mistake in the past :)

trying this now:
CREATE INDEX CONCURRENTLY index_user_snps_on_user_id_and_snp_id ON user_snps (user_id, snp_id)

Helge Rausch

I'm very curious how long this takes. How many user_snps are there currently?

Bastian Greshake
Owner

I don't have the exact number, but there are 450 genotyping files. Not all of them contain a full 1 million SNPs, so let's say it's more like 400 files, makes around 400 million user_snps from my estimate. But Philipp might be able to give you the real number.

Regarding that call: Do we really need it? Displaying that number might look nice, but in the end: We wouldn't save that SNP if we wouldn't have data for at least a single person. And as you can't sort our views or search for SNPs with respect to the number of users having data for it I guess it doesn't help too much. Or am I missing something?

Helge Rausch

I'm against removing it as I would like to see, how much of an improvement the adding of the indexes brings. :)

Philipp Bayer
Collaborator

We could slightly change the users-count - make it appear only when the user has clicked on the "Users sharing this SNP..."-tab, because then we load the entire table per AJAX anyway.

Right now, we load that table twice - once to count the entries, and once when the user clicks that tab.

(UserSnp.count is currently running)

Philipp Bayer
Collaborator

Currently, "only" 326,630,389 user_snps, but of the 450 genotyping files not all files have been parsed, still around 700 parsing jobs left.

Helge Rausch

While we are optimizing this table:

  • Why does it have an snp_name column?
  • local_genotype can be stored as an ENUM in Postgres
Philipp Bayer
Collaborator

the indexing took about 4 hours.

Not sure why the snp_name column exists - aren't some relations followed over snp_name rather than snp_id because of friendly_id? Not sure anymore

Helge Rausch

I remember something like that, yes. I'm sure friendly_id's functionality can be made working without that column fairly easy. The snp_name column should take about 3 GB at the moment (avg snp_name length is 9.19). Also, looking up the name in the snps table, which is a lot shorter, and then joining that with the user_snps by their snp_id would probably be much faster than scanning all the snp_names in that huge table.

local_genotype, at 2 bytes per row makes it to 653 MB. With an Enum (16 permutations -> 4 bit, assuming it works that way within Postgres) that would be about 163 MB.

Helge Rausch

Snp is also associated to UserSnp via the snp_name. That would have to be fixed as well, then.

Helge Rausch

Postgres' Enum type still takes 4 Bytes... but there are bit strings. :)

Bastian Greshake
Owner

I think we did the association over the snp_name by purpose and if I remember correctly we did so because this made querying user-snps easier/faster in some instance? Would have to go back into the mailinglist archives to figure that out. I think @tsujigiri implemented it back then :D

Helge Rausch

Haha, might be. I wasn't aware that this table would eventually grow so much. I guess this is assigned, then? :D

Helge Rausch tsujigiri referenced this issue from a commit in tsujigiri/snpr
Helge Rausch tsujigiri dropping snp_name column from user_snps f44e493
Helge Rausch

This, among other things, changes the DasController, for which there are still no tests.
Also: :exclamation: IT DROPS THE SNP_NAME COLUMN FROM USER_SNPS :exclamation:

:)

I will add tests when I have the time. Unless someone else does it first. :wink:

Philipp Bayer
Collaborator

Looks very nice, can you send a pull request for that? Then I can merge it and Travis can have a look at the tests

Philipp Bayer
Collaborator

I think all of this is now implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.