-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
and then I loaded 300 projects / 6,815 repos into facade and tried building cache 🤣 #31
Comments
I do think I can put the mysql temp directory on another device and squeeze more out of the database, btw ... but I think I have reached a scale where db optimizations alone are not going to be sufficient. As an old DBA, I recognize that when I can't blame the network, I can definitely blame the app 🤣 |
I continue to be astounded by how you're using this at scale, @sgoggins. Very, very impressive (and way more than I've ever tried!) I would agree, it's probably the database that's the issue. If I had to point my finger to one suspect, it's the way that Facade neurotically stuffs data into the database at every opportunity. I had originally done it this way so that if it failed mid-run, not much would be lost. That said, the sheer volume of database transactions creates a maaaaaaasive amount of overhead in big databases. I think there's a fairly simple fix that will be low hanging fruit, whereby the database transactions for a single repo are accumulated into a temporary in-memory database and then only pushed into the analysis_data table at the very end. In the case where facade-worker.py fails, you'll still lose the stuff that was in-memory, but the performance gains should make up for it. The reason I say this is the lowest hanging fruit is that by reducing database traffic, it may become feasible to use the native Python MySQL connector ( |
@brianwarner : did a lot of optimization fo the MariaDB parameters (which are the same as MySQL, but you know that story I am sure.) I have them in a file in my fork right now, but perhaps they belong instead in a read me. Speeds up sorting when building facade cachetmp_table_size = 16106127360 Timeouts deal with long running connections. Mostly needed when first scanning a large number of reposthread_pool_idle_timeout = 40000
128 Gigs of RAM on server.innodb_buffer_pool_size = 107374182400 Helps with sortingkey_buffer = 256M Connections are not a facade issuemax_connections = 250 A little caching helps with some of the queriestable_cache = 16K Nice to knowlog_slow_queries = /var/log/mysql/mysql-slow.log |
128 gig of ram, solid state drives …. got all the repos and analysis_data .. its been 21 hours loading project_weekly_cache .. no cpu usage, so I am guessing I have the database eating disk …
I’ve already made a set of modification and database config notes on my fork at sgoggins/facade:augur branch … I’m thinking I an rewrite the query that loads cache to go after one repository or project group at a time .. since this is a nominal, 4 hour thing for me (a very experienced database guy / formerly well compensated Oracle DBA) I thought I would circle back and see if you would approve a pull request that modularized some of the functions in facade-worker.py into another python file. Or how you would recommend doing this.
The refactoring would change how cache is built and have options for execution. I think:
What do you think @brianwarner ?
The text was updated successfully, but these errors were encountered: