Improvements to speed by inserting genotype calls via /COPY #19

carolyncaron · 2018-06-05T23:22:17Z

Issue #4

Description

2 new functions were added to the API to handle copying in of genotypic calls from VCF format to the genotype_call table. I opted for the COPY method as per Lacey's suggestion. Genotype calls get saved to a CSV file and loaded into the database once 10,000 records are reached. Rinse and repeat until all calls are loaded. The module ensures that any calls remaining in the file once the VCF is fully processed will also get loaded into the database.

This change was crucial to ensure we can load enormous amounts of genotypic data in a reasonable amount of time. ;-)

Please note the following:

Testing with very large datasets is currently in progress, and further optimization may be required
This functionality is currently specific to the genotype_call table method ONLY, and will only work for VCF file inputs (implementing this for the other input formats should be trivial however - let me know if there is demand for this!)

Testing?

I implemented my changes in a cloud9 environment (see Pull Request #18 for how I set up my environment), however cloud9 appears to struggle with command-line execution of the copy command (also Lacey's experience). My final testing took place on the clone site using cats.list and cats.vcf as my input files. No cvterms need to be added/configured, but I added organisms, chromosome and project as demonstrated in #18. I also set up a .pgpass file in my home directory with connection details to the clone's database (see https://gist.github.com/sabman/978352). This is to ensure you don't get prompted for a connection password for every batch being loaded in (this step is technically optional, as the functionality still works, but annoying if omitted). I loaded genotypes using the command:
drush load-genotypes sample_files/cats.vcf sample_files/cats.list --marker-type="genetic_marker" --variant-type="SNP" --organism="Felis catus" --project-name="My SNP Discovery Project" --ndgeolocation="here"

At present, I tested where copy_chunk_size is greater and less than the number of records I expect (with cats.vcf: 22 records expected). More test cases to be reported soon! For now, a code review would be immensely helpful 😊

…ert only

…original add record helper function

…pe call table

…w function meant for copying

…ngs up (for our expected cases >;-))

…ing calls to be copied in. Cleaned up a couple of other funcs.

…rrently hard-coded.

…10,000 now that testing is done.

laceysanderson

Overall the code looks good :-) See specific comments.

laceysanderson · 2018-06-07T20:15:16Z

api/genotypes_loader.api.inc

  *  either "handle errors" or not- this way, if we see a certain error (such as not unique), we can handle it
  *  from outside of the function.
  *
-  *  Also - don't forget to refer to Lacey's new helper function on Github. This function needs to evolve
-  *  regardless! Whether or not we go with copying from a file remains to be seen...
  */



This function docblock needs a lot more documentation ;-)... Please say what the function does and describe it's parameters. Also, make sure there's no empty line between the docblock and it's function.

The docblock above this part describes the function, let me know if it needs more description. I moved the @todo within the previous block and removed the empty line in commit 1277bb2

laceysanderson · 2018-06-07T20:17:44Z

api/genotypes_loader.api.inc

+  elseif ($mode == $both) {
+
+    // If we want to insert values, we can merge our values to have all the information we need
+    $values = array_merge($select_values, $insert_values);


So $insert_values doesn't stand-alone? This should be documented in the function docblock.

Addressed in commit 1277bb2 👍

laceysanderson · 2018-06-07T20:25:28Z

api/genotypes_loader.api.inc

+function genotypes_loader_helper_copy_records($record_count, $record_type, $table, $insert_values = array(), $file_name = NULL, $final_chunk = FALSE) {
+
+  // The number of records we want to reach before copying them in all at once
+  $copy_chunk_size = 10000;


I would allow this to be set via variable_get/set().

Change this line to $copy_chunk_size = variable_get('genotypes_loader_cp_chunk_size', 10000)

Set via variable_set() in the first drush function after exposing an option to the user.
This way you don't have to worry about trying to pass the value through the chain of functions while still providing a way for your user to change this value per file.

I created issue #20 to address this!

laceysanderson · 2018-06-07T20:31:25Z

api/genotypes_loader.api.inc

+    $record_type = str_replace(' ', '_', $record_type);
+    $record_type = strtolower($record_type);
+    $file_name = $file_stub . $table . '-' . $record_type . '.csv';
+  }


The generated filename should include the database you're copying into. Otherwise you will end up with collisions when loading a file on portal and testing the next one on clone. The connection information is in global $databases; so just dpm that to find the name.

Done in commit b0c01bd and 034a87b!

laceysanderson · 2018-06-07T20:38:48Z

api/genotypes_loader.api.inc

+  // Otherwise, open the file to write to since this is needed by fputcsv.
+  // @TODO "Lacey: I'm concerned this might be a point of slowness. Perhaps there is a better
+  //       way to create CSV that doesn't require fputcsv and is more reliable
+  //       then simply imploding the values with comma's in-between."


I don't know of a better way to do this :-( We'll have to see how it goes...

laceysanderson · 2018-06-07T20:45:36Z

api/genotypes_loader.api.inc

+
+    genotypes_loader_remote_copy($copy_command, $file_name);
+
+    // Wipe the file clean by reopening it as write-only


This might be another point of slowness although less of a concern since it only happens once per chunk.

laceysanderson · 2018-06-07T20:50:00Z

api/genotypes_loader.api.inc

+ * security risks in a web setting. We use a combination of drush sql-cli,
+ * which opens a psql session, and psql's \copy command to take advantage of
+ * the fact that Tripal Jobs are run on the command-line.
+ *


This needs to be re-worded as it's currently specific to nd_genotypes. Perhaps

PostgreSQL COPY is extremely effective at copying/inserting large amount of data. This modules uses it to make inserting massive genotype files more efficient. However, using chado_query('COPY...') poses many security risks in a web setting. Instead we use Drupal to determine the psql connection string, and psql's \copy command to take advantage of the fact that Tripal Jobs are run on the command-line.

Note: Bolded sections indicate my changes.

Also addressed in commit 1277bb2

laceysanderson · 2018-06-07T21:02:20Z

includes/genotypes_loader.vcf.inc

+
+  // Set the file name to be used for remote copy
+  if ($options['storage_method'] == 'genotype_call') {
+    $remote_copy_filename = '/tmp/genotypes_loader.remotecopy.csv';


For consistency, I would follow the same convention that you use at https://github.com/UofS-Pulse-Binfo/genotypes_loader/pull/19/files?utf8=%E2%9C%93&diff=unified&w=1#diff-eb8a4ddc56c5aa596c7c31e403e0f285R138

Done in commit 034a87b. I opted to keep only the table name since I knew in this particular case it would be redundant (table name and record name would both be 'genotype_call'). I am still opting to use both table and record in the API function since I want to keep that function generic for the future and thus they are both important to ensure uniqueness.

…clared to stick to a consistent naming scheme

…UofS-Pulse-Binfo/genotypes_loader into 4-Improve-insert-speed-via-copy

carolyncaron · 2018-07-03T22:15:05Z

I successfully tested this branch on a development site with genotypic data from our diversity panel (the driving force for this module, in fact!) spread over 8 files (1 per chr):

324 samples and germplasm (stocks were added by module, germplasm was pre-existing)
327,564 genetic markers + 327,564 sequence variants
104,723,915 non-missing SNPs
Chunk size: 1,000,000

Stats:

Database increased by ~29GB
Total cumulative time for loading: 23h 13m
Average SNPs/minute: 75179

Overall, I would say this a great improvement from my previous benchmark of ~2000 SNPs/minute!!

carolyncaron and others added 14 commits May 14, 2018 22:28

Setup the add record helper function to perform COPY when mode is ins…

8ff6afb

…ert only

Adapted Lacey's template for the add_record_with_mode helper function

e65c94f

Dealt with (or started) the 3 points in my comment to Github Issue #4

3baa15d

Moved existing copy functionality to a new function and restored the …

45ca90b

…original add record helper function

Merge branch 'master' into 4-Improve-insert-speed-via-copy

65095b1

Modified the helper function for inserting genotypes call into genoty…

79f4297

…pe call table

Dropped call_user_func when inserting genotype calls and created a ne…

b49aed4

…w function meant for copying

Cleaned up and commented more from the previous commit

0d57217

Removed commented code and switched around if statements to speed thi…

c00e581

…ngs up (for our expected cases >;-))

Added a new parameter to the copy helper function to determine remain…

5967d01

…ing calls to be copied in. Cleaned up a couple of other funcs.

Added ability to pass file name for COPY purposes to the API functions.

1af143a

Implemented the COPY command for genotype calls. The table name is cu…

dd2eab4

…rrently hard-coded.

Modified COPY command to allow use of a file on a separate box

b333ec2

Cleaned out commented code in the API and set the copy_chunk_size to …

6dac6b1

…10,000 now that testing is done.

carolyncaron requested a review from laceysanderson June 5, 2018 23:22

laceysanderson self-assigned this Jun 6, 2018

Update PULL_REQUEST_TEMPLATE.md

9334272

laceysanderson approved these changes Jun 7, 2018

View reviewed changes

laceysanderson assigned carolyncaron and unassigned laceysanderson Jun 8, 2018

carolyncaron added 4 commits June 13, 2018 15:28

Cleaned up documentation in the API

1277bb2

Included the database name in the filename for remote copy

b0c01bd

Updated the remote COPY filename in both locations where it can be de…

034a87b

…clared to stick to a consistent naming scheme

Merge branch '4-Improve-insert-speed-via-copy' of https://github.com/…

9b055ab

…UofS-Pulse-Binfo/genotypes_loader into 4-Improve-insert-speed-via-copy

carolyncaron mentioned this pull request Jul 3, 2018

Allow the user to set the copy chunk size through the settings form #20

Open

carolyncaron merged commit 94741e1 into master Jul 3, 2018

carolyncaron deleted the 4-Improve-insert-speed-via-copy branch March 22, 2019 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to speed by inserting genotype calls via /COPY #19

Improvements to speed by inserting genotype calls via /COPY #19

carolyncaron commented Jun 5, 2018

laceysanderson left a comment

laceysanderson Jun 7, 2018

carolyncaron Jun 13, 2018

laceysanderson Jun 7, 2018

carolyncaron Jun 13, 2018

laceysanderson Jun 7, 2018

carolyncaron Jul 3, 2018

laceysanderson Jun 7, 2018

carolyncaron Jun 14, 2018 •

edited

Loading

laceysanderson Jun 7, 2018

laceysanderson Jun 7, 2018

laceysanderson Jun 7, 2018

carolyncaron Jun 13, 2018

laceysanderson Jun 7, 2018

carolyncaron Jun 14, 2018

carolyncaron commented Jul 3, 2018


		genotypes_loader_remote_copy($copy_command, $file_name);

		// Wipe the file clean by reopening it as write-only

Improvements to speed by inserting genotype calls via /COPY #19

Improvements to speed by inserting genotype calls via /COPY #19

Conversation

carolyncaron commented Jun 5, 2018

Description

Testing?

laceysanderson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carolyncaron Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carolyncaron commented Jul 3, 2018

carolyncaron Jun 14, 2018 •

edited

Loading