-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to speed by inserting genotype calls via /COPY #19
Conversation
…original add record helper function
…w function meant for copying
…ngs up (for our expected cases >;-))
…ing calls to be copied in. Cleaned up a couple of other funcs.
…rrently hard-coded.
…10,000 now that testing is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the code looks good :-) See specific comments.
api/genotypes_loader.api.inc
Outdated
* either "handle errors" or not- this way, if we see a certain error (such as not unique), we can handle it | ||
* from outside of the function. | ||
* | ||
* Also - don't forget to refer to Lacey's new helper function on Github. This function needs to evolve | ||
* regardless! Whether or not we go with copying from a file remains to be seen... | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function docblock needs a lot more documentation ;-)... Please say what the function does and describe it's parameters. Also, make sure there's no empty line between the docblock and it's function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elseif ($mode == $both) { | ||
|
||
// If we want to insert values, we can merge our values to have all the information we need | ||
$values = array_merge($select_values, $insert_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So $insert_values doesn't stand-alone? This should be documented in the function docblock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in commit 1277bb2 👍
api/genotypes_loader.api.inc
Outdated
function genotypes_loader_helper_copy_records($record_count, $record_type, $table, $insert_values = array(), $file_name = NULL, $final_chunk = FALSE) { | ||
|
||
// The number of records we want to reach before copying them in all at once | ||
$copy_chunk_size = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would allow this to be set via variable_get/set().
- Change this line to
$copy_chunk_size = variable_get('genotypes_loader_cp_chunk_size', 10000)
- Set via variable_set() in the first drush function after exposing an option to the user.
This way you don't have to worry about trying to pass the value through the chain of functions while still providing a way for your user to change this value per file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created issue #20 to address this!
api/genotypes_loader.api.inc
Outdated
$record_type = str_replace(' ', '_', $record_type); | ||
$record_type = strtolower($record_type); | ||
$file_name = $file_stub . $table . '-' . $record_type . '.csv'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated filename should include the database you're copying into. Otherwise you will end up with collisions when loading a file on portal and testing the next one on clone. The connection information is in global $databases;
so just dpm that to find the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Otherwise, open the file to write to since this is needed by fputcsv. | ||
// @TODO "Lacey: I'm concerned this might be a point of slowness. Perhaps there is a better | ||
// way to create CSV that doesn't require fputcsv and is more reliable | ||
// then simply imploding the values with comma's in-between." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a better way to do this :-( We'll have to see how it goes...
|
||
genotypes_loader_remote_copy($copy_command, $file_name); | ||
|
||
// Wipe the file clean by reopening it as write-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be another point of slowness although less of a concern since it only happens once per chunk.
api/genotypes_loader.api.inc
Outdated
* security risks in a web setting. We use a combination of drush sql-cli, | ||
* which opens a psql session, and psql's \copy command to take advantage of | ||
* the fact that Tripal Jobs are run on the command-line. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be re-worded as it's currently specific to nd_genotypes. Perhaps
PostgreSQL COPY is extremely effective at copying/inserting large amount of data. This modules uses it to make inserting massive genotype files more efficient. However, using chado_query('COPY...') poses many security risks in a web setting. Instead we use Drupal to determine the psql connection string, and psql's \copy command to take advantage of the fact that Tripal Jobs are run on the command-line.
Note: Bolded sections indicate my changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also addressed in commit 1277bb2
includes/genotypes_loader.vcf.inc
Outdated
|
||
// Set the file name to be used for remote copy | ||
if ($options['storage_method'] == 'genotype_call') { | ||
$remote_copy_filename = '/tmp/genotypes_loader.remotecopy.csv'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I would follow the same convention that you use at https://github.com/UofS-Pulse-Binfo/genotypes_loader/pull/19/files?utf8=%E2%9C%93&diff=unified&w=1#diff-eb8a4ddc56c5aa596c7c31e403e0f285R138
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit 034a87b. I opted to keep only the table name since I knew in this particular case it would be redundant (table name and record name would both be 'genotype_call'). I am still opting to use both table and record in the API function since I want to keep that function generic for the future and thus they are both important to ensure uniqueness.
…clared to stick to a consistent naming scheme
…UofS-Pulse-Binfo/genotypes_loader into 4-Improve-insert-speed-via-copy
I successfully tested this branch on a development site with genotypic data from our diversity panel (the driving force for this module, in fact!) spread over 8 files (1 per chr):
Stats:
Overall, I would say this a great improvement from my previous benchmark of ~2000 SNPs/minute!! |
Issue #4
Description
2 new functions were added to the API to handle copying in of genotypic calls from VCF format to the genotype_call table. I opted for the COPY method as per Lacey's suggestion. Genotype calls get saved to a CSV file and loaded into the database once 10,000 records are reached. Rinse and repeat until all calls are loaded. The module ensures that any calls remaining in the file once the VCF is fully processed will also get loaded into the database.
This change was crucial to ensure we can load enormous amounts of genotypic data in a reasonable amount of time. ;-)
Please note the following:
Testing?
I implemented my changes in a cloud9 environment (see Pull Request #18 for how I set up my environment), however cloud9 appears to struggle with command-line execution of the copy command (also Lacey's experience). My final testing took place on the clone site using cats.list and cats.vcf as my input files. No cvterms need to be added/configured, but I added organisms, chromosome and project as demonstrated in #18. I also set up a .pgpass file in my home directory with connection details to the clone's database (see https://gist.github.com/sabman/978352). This is to ensure you don't get prompted for a connection password for every batch being loaded in (this step is technically optional, as the functionality still works, but annoying if omitted). I loaded genotypes using the command:
drush load-genotypes sample_files/cats.vcf sample_files/cats.list --marker-type="genetic_marker" --variant-type="SNP" --organism="Felis catus" --project-name="My SNP Discovery Project" --ndgeolocation="here"
At present, I tested where copy_chunk_size is greater and less than the number of records I expect (with cats.vcf: 22 records expected). More test cases to be reported soon! For now, a code review would be immensely helpful 😊