Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to speed by inserting genotype calls via /COPY #19

Merged
merged 19 commits into from
Jul 3, 2018

Conversation

carolyncaron
Copy link
Member

Issue #4

Description

2 new functions were added to the API to handle copying in of genotypic calls from VCF format to the genotype_call table. I opted for the COPY method as per Lacey's suggestion. Genotype calls get saved to a CSV file and loaded into the database once 10,000 records are reached. Rinse and repeat until all calls are loaded. The module ensures that any calls remaining in the file once the VCF is fully processed will also get loaded into the database.

This change was crucial to ensure we can load enormous amounts of genotypic data in a reasonable amount of time. ;-)

Please note the following:

  • Testing with very large datasets is currently in progress, and further optimization may be required
  • This functionality is currently specific to the genotype_call table method ONLY, and will only work for VCF file inputs (implementing this for the other input formats should be trivial however - let me know if there is demand for this!)

Testing?

I implemented my changes in a cloud9 environment (see Pull Request #18 for how I set up my environment), however cloud9 appears to struggle with command-line execution of the copy command (also Lacey's experience). My final testing took place on the clone site using cats.list and cats.vcf as my input files. No cvterms need to be added/configured, but I added organisms, chromosome and project as demonstrated in #18. I also set up a .pgpass file in my home directory with connection details to the clone's database (see https://gist.github.com/sabman/978352). This is to ensure you don't get prompted for a connection password for every batch being loaded in (this step is technically optional, as the functionality still works, but annoying if omitted). I loaded genotypes using the command:
drush load-genotypes sample_files/cats.vcf sample_files/cats.list --marker-type="genetic_marker" --variant-type="SNP" --organism="Felis catus" --project-name="My SNP Discovery Project" --ndgeolocation="here"

At present, I tested where copy_chunk_size is greater and less than the number of records I expect (with cats.vcf: 22 records expected). More test cases to be reported soon! For now, a code review would be immensely helpful 😊

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks good :-) See specific comments.

* either "handle errors" or not- this way, if we see a certain error (such as not unique), we can handle it
* from outside of the function.
*
* Also - don't forget to refer to Lacey's new helper function on Github. This function needs to evolve
* regardless! Whether or not we go with copying from a file remains to be seen...
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function docblock needs a lot more documentation ;-)... Please say what the function does and describe it's parameters. Also, make sure there's no empty line between the docblock and it's function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docblock above this part describes the function, let me know if it needs more description. I moved the @todo within the previous block and removed the empty line in commit 1277bb2

elseif ($mode == $both) {

// If we want to insert values, we can merge our values to have all the information we need
$values = array_merge($select_values, $insert_values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So $insert_values doesn't stand-alone? This should be documented in the function docblock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1277bb2 👍

function genotypes_loader_helper_copy_records($record_count, $record_type, $table, $insert_values = array(), $file_name = NULL, $final_chunk = FALSE) {

// The number of records we want to reach before copying them in all at once
$copy_chunk_size = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would allow this to be set via variable_get/set().

  1. Change this line to $copy_chunk_size = variable_get('genotypes_loader_cp_chunk_size', 10000)
  2. Set via variable_set() in the first drush function after exposing an option to the user.
    This way you don't have to worry about trying to pass the value through the chain of functions while still providing a way for your user to change this value per file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created issue #20 to address this!

$record_type = str_replace(' ', '_', $record_type);
$record_type = strtolower($record_type);
$file_name = $file_stub . $table . '-' . $record_type . '.csv';
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated filename should include the database you're copying into. Otherwise you will end up with collisions when loading a file on portal and testing the next one on clone. The connection information is in global $databases; so just dpm that to find the name.

Copy link
Member Author

@carolyncaron carolyncaron Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit b0c01bd and 034a87b!

// Otherwise, open the file to write to since this is needed by fputcsv.
// @TODO "Lacey: I'm concerned this might be a point of slowness. Perhaps there is a better
// way to create CSV that doesn't require fputcsv and is more reliable
// then simply imploding the values with comma's in-between."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a better way to do this :-( We'll have to see how it goes...


genotypes_loader_remote_copy($copy_command, $file_name);

// Wipe the file clean by reopening it as write-only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be another point of slowness although less of a concern since it only happens once per chunk.

* security risks in a web setting. We use a combination of drush sql-cli,
* which opens a psql session, and psql's \copy command to take advantage of
* the fact that Tripal Jobs are run on the command-line.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be re-worded as it's currently specific to nd_genotypes. Perhaps

PostgreSQL COPY is extremely effective at copying/inserting large amount of data. This modules uses it to make inserting massive genotype files more efficient. However, using chado_query('COPY...') poses many security risks in a web setting. Instead we use Drupal to determine the psql connection string, and psql's \copy command to take advantage of the fact that Tripal Jobs are run on the command-line.

Note: Bolded sections indicate my changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also addressed in commit 1277bb2


// Set the file name to be used for remote copy
if ($options['storage_method'] == 'genotype_call') {
$remote_copy_filename = '/tmp/genotypes_loader.remotecopy.csv';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 034a87b. I opted to keep only the table name since I knew in this particular case it would be redundant (table name and record name would both be 'genotype_call'). I am still opting to use both table and record in the API function since I want to keep that function generic for the future and thus they are both important to ensure uniqueness.

@carolyncaron
Copy link
Member Author

I successfully tested this branch on a development site with genotypic data from our diversity panel (the driving force for this module, in fact!) spread over 8 files (1 per chr):

  • 324 samples and germplasm (stocks were added by module, germplasm was pre-existing)
  • 327,564 genetic markers + 327,564 sequence variants
  • 104,723,915 non-missing SNPs
  • Chunk size: 1,000,000

Stats:

  • Database increased by ~29GB
  • Total cumulative time for loading: 23h 13m
  • Average SNPs/minute: 75179

Overall, I would say this a great improvement from my previous benchmark of ~2000 SNPs/minute!!

@carolyncaron carolyncaron merged commit 94741e1 into master Jul 3, 2018
@carolyncaron carolyncaron deleted the 4-Improve-insert-speed-via-copy branch March 22, 2019 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants