You can clone with
To get a better insight into how the whole thing is performing over time and which parts are using the most resources we should (re)employ some monitoring.
I have good experiences with Munin, which creates nice graphs from existing and custom scripts, which are easy to write in a language of ones own choice, and NewRelic which gives an idea of which parts of the app are consuming the most resources.
More suggestions are welcome.
We should also consider using the pg_stat_statements module for Postgres. It makes statistics about e.g. query counts and times queryable. These stats could also be monitored over time, e.g. via Munin.
Seconding the New Relic-thing; I still haven't been able to upgrade Postgres (stupid Ubuntu always wants to replace the installation in-place, which might result in loss of data. Will probably just install Postgres manually and then do the pg_update)
I still got my New Relic account, want me to install the gem again and see what happens? Something broke last time, I can't remember what. I know it conflicted with some gem... But that was ages ago and might be fixed now.
Edit: Installed gem - seems to work, but newest rails-versions somehow broke things. Investigating.
Tried to install pg_stat_statements, got this:
* The PostgreSQL server failed to start. Please check the log output:
2013-02-03 03:48:50 CET LOG: loaded library "pg_stat_statements"
2013-02-03 03:48:50 CET FATAL: could not create shared memory segment: Invalid argument
2013-02-03 03:48:50 CET DETAIL: Failed system call was shmget(key=5432001, size=40484864, 03600).
2013-02-03 03:48:50 CET HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 40484864 bytes), reduce PostgreSQL's shared_buffers parameter (currently 3072) and/or its max_connections parameter (currently 103).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration
However, increasing SHMMAX requires a reboot of the entire server, something which I'd rather not do right now but in conjunction with a Kernel-upgrade etc.
Do we have any procedures for an entire system-upgrade? Could be that an upgrade of postgresql leads to overwriting of the PG_DATA-folder (Oh God that would be scary).
Then let's keep that in mind for the next reboot.
And, re: the Postgres data, as long as we have the sql dumps, we should be fine, right? :)
Theoretically we should be fine :D
Currently trying to downgrade my Postgresql to 8.4 so I can see what happens when I use yum to update to 9.2... Should behave identically to apt-get upgrade. Hopefully.
Have now added pg_stat_statements - had to change kernel.shmmax from 33554432 to 17179869184 and kernel.shmall from 2097152 to 4194304. If that breaks anything you can always reset both values using sysctl -w, for example: sysctl -w kernel.shmall=2097152
sysctl -w kernel.shmall=2097152
Will monitor what the module does.
Here's some output I just ran:
It's formatted a bit weirdly, mostly because the INSERT-statement from genotyping-parsing is so long. Looks like the SELECT(user_snps) takes the biggest amount of time, which is interesting and looks fixable, same goes for the two COUNTs. There should also be a better way to get all SNPs, namely just to get the subset that is displayed on the current page.
I don't think we can improve on the INSERT-statement.
Here's a screenshot from inside New Relic: https://twitter.com/PhilippBayer/status/350475349425016832/photo/1
As you can see, the show-action takes most of the time (over the last 24 hours), and if I compare that with the postgreSQL-output in last comment I think the line @user_count = @snp.user_snps.select('distinct(user_id)').count in the snps_controller is what's so slow... Have to check more.
@user_count = @snp.user_snps.select('distinct(user_id)').count
Can you find out (maybe from the server logs) how often this is called on distinct snps whithin 24 hours?
Depending on whether this number needs to be a hundred percent accurate all the time, we could either cache it for some time or keep track of it in the Snp model, in- or decrementing a Redis counter each time a UserSnp gets added or removed.
NewRelic also has a view, telling you exactly which queries are the slowest. I just never can remember where to find it.
Secondly, the user_snps table doesn't have an index on user_id. That should knock the first query in your list from it's place.
The view in NewRelic for queries is (I think) only in the paid non-lite version, and our testing time has long run out... But it would be better to cache the amount of users per SNP, possibly the index will already help enough to make it not that important.
For that to work it would need an additional index on user_id and snp_id. But you are right, we should try that first.
Postgres can even build indexes without locking the table. We should consider doing that, as the table is fairly big and building two indexes would take some time.
I'm having a look now, there's already a index_user_snps_on_user_id_and_snp_name, so I guess we made a small mistake in the past :)
trying this now:
CREATE INDEX CONCURRENTLY index_user_snps_on_user_id_and_snp_id ON user_snps (user_id, snp_id)
I'm very curious how long this takes. How many user_snps are there currently?
I don't have the exact number, but there are 450 genotyping files. Not all of them contain a full 1 million SNPs, so let's say it's more like 400 files, makes around 400 million user_snps from my estimate. But Philipp might be able to give you the real number.
Regarding that call: Do we really need it? Displaying that number might look nice, but in the end: We wouldn't save that SNP if we wouldn't have data for at least a single person. And as you can't sort our views or search for SNPs with respect to the number of users having data for it I guess it doesn't help too much. Or am I missing something?
I'm against removing it as I would like to see, how much of an improvement the adding of the indexes brings. :)
We could slightly change the users-count - make it appear only when the user has clicked on the "Users sharing this SNP..."-tab, because then we load the entire table per AJAX anyway.
Right now, we load that table twice - once to count the entries, and once when the user clicks that tab.
(UserSnp.count is currently running)
Currently, "only" 326,630,389 user_snps, but of the 450 genotyping files not all files have been parsed, still around 700 parsing jobs left.
While we are optimizing this table:
the indexing took about 4 hours.
Not sure why the snp_name column exists - aren't some relations followed over snp_name rather than snp_id because of friendly_id? Not sure anymore
I remember something like that, yes. I'm sure friendly_id's functionality can be made working without that column fairly easy. The snp_name column should take about 3 GB at the moment (avg snp_name length is 9.19). Also, looking up the name in the snps table, which is a lot shorter, and then joining that with the user_snps by their snp_id would probably be much faster than scanning all the snp_names in that huge table.
local_genotype, at 2 bytes per row makes it to 653 MB. With an Enum (16 permutations -> 4 bit, assuming it works that way within Postgres) that would be about 163 MB.
Snp is also associated to UserSnp via the snp_name. That would have to be fixed as well, then.
Postgres' Enum type still takes 4 Bytes... but there are bit strings. :)
I think we did the association over the snp_name by purpose and if I remember correctly we did so because this made querying user-snps easier/faster in some instance? Would have to go back into the mailinglist archives to figure that out. I think @tsujigiri implemented it back then :D
Haha, might be. I wasn't aware that this table would eventually grow so much. I guess this is assigned, then? :D
dropping snp_name column from user_snps
This, among other things, changes the DasController, for which there are still no tests.
Also: IT DROPS THE SNP_NAME COLUMN FROM USER_SNPS
I will add tests when I have the time. Unless someone else does it first.
Looks very nice, can you send a pull request for that? Then I can merge it and Travis can have a look at the tests
I think all of this is now implemented.