[issue/273] dbSNP 2 VCF reformat util #300

akotlar · 2023-10-18T04:43:20Z

Takes N DbSNP 2 VCF files (https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/), extracts every population from the Freq=(.*) field in a separate INFO field, drops the Freq field, drops the first allele for each population (which is reference). Then it writes the population-specific fields to the info header, and updates the yml config file to point to the newly formatted vcf (the original vcf will not be overwritten).

This is necessary because the dbSNP 2 vcf does not make good use of the VCF spec; the Freq field is the combination of multiple fields, each of which is an Allelic type, but where the first allele is the reference, which is not the standard use.

This will enable us to reproducibly fetch, transform, build dbSNP files, from 1 yaml config, once we add 1 more utility, which translates the RefSeq NC_* chromosome identifiers to chr1-22,X,Y,M.

Will remove [wip] once test added

…e/273_dbSNP_vcf_reformat_utils

…tlar/bystro into feature/273_dbSNP_vcf_reformat_utils

poneill · 2023-10-20T15:37:33Z

perl/t/utils/dbsnp2FormatInfo.t

+ok(
+  <$fh>
+    == "NC_000001.11    10001   rs1570391677    T       A,C     .       .       RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;KOREAN=0.0109,.;SGDP_PRJ=0,.;dbGaP_PopFreq=.,0",
+  '1st data row wiht KOREAN, SGDP_PRJ, dbGap freqs are correctly processed'


nit: wiht -> with

bah :) need CI spellcheck, clearly

thanks

poneill · 2023-10-20T15:40:14Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+  $self->{_localFiles} = $localFilesAref;
+}
+
+# TODO: error check opening of file handles, write tests


is this comment still in force? (looks like we have tests now for this module)

no it isn't, thanks

poneill · 2023-10-20T15:43:43Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+
+    my $base_name = basename($input_vcf);
+    $base_name =~ s/\.[^.]+$//; # Remove last file extension (if present)
+    $base_name


how many different kinds of file extensions are we likely to have to handle here, and could we match explicitly on those instead of sedding off two extensions? The worry is someone uploading a file named something like foo.bar.vcf, which would get pared down to foo [?]

poneill · 2023-10-20T15:59:54Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+      }
+
+      # If it's an INFO line
+      if (/FREQ=/) {


I'm probably missing context here but this comment / condition pair is a little surprising because we're not matching against INFO explicitly-- do all INFO lines in this file match /FREQ=/?

Yeah, I found this curious too. This is actually a clever solution, unless I'm missing something. ChatGPT came up with this, and I ended up keeping it because it was cute and quite elegant.

It checks rows for the FREQ=. If it doesn't find that, there's nothing to do but add the row as is. FREQ isn't guaranteed to be in the INFO field. It is also not found any other field, and never will be.

Then we split on ";". That will result in 1 field that contains everything up until the first INFO value, and the rest of the info values. We find the field containing FREQ=, extract the FREQ=(.*) value, expand the individual population POP=VAL as new INFO fields, join by ";" to the first field, which results in a complete file with correct delimiters.

I would have written this by splitting each field by "\t", then search INFO, but didn't see any downsides with this approach. FREQ is guaranteed to never appear 2x, though it could potentially appear 0 times, and this handles that case :)

INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.

My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.

poneill · 2023-10-20T16:07:01Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+}
+
+# TODO: error check opening of file handles, write tests
+sub go {


This function seems like it's doing a lot of different things at a few different levels of abstraction. One way to address that might be to break it up into helpers roughly along the lines of:

input validation

setting up filepaths

the core nested loop where we process $in_fh into $output_data_fh

updating the VCF header

writing everything out and cleaning up

But I defer to your judgment as to whether the juice is worth the squeeze there?

wingolab · 2023-10-20T21:48:59Z

perl/lib/Seq/Role/IO.pm

What do you think about adding or switching to checking magic strings to deduce the compression type rather than matching extensions exclusively? File::LibMagic is a Perl package that binds to the c library that seems like it would be a good fit and probably much easier to use than our own implementation for checking magic strings.

Sounds reasonable, though beyond the scope of this PR, as the file extension solution is used everywhere (the change here was to remove accepting $innerFile, which was not handled if provided). I have a tracking ticket #312 to evaluate the switch to File::LibMagic, scheduled for Sprint 4.

wingolab · 2023-10-20T21:53:49Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+use 5.10.0;
+use strict;
+use warnings;
+use DDP;


Do you still need DDP after development is done?

wingolab · 2023-10-20T21:57:44Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+
+    my $output_data_fh = $self->getWriteFh($output_data_path);
+
+    $self->log( 'info', "Writing to $output_data_path" );


Everything to this point is getting setup to do actual work. This seems like a logical place to split apart the function into two sections, which will also aid testing.

wingolab · 2023-10-20T22:00:41Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+            my $freq_data = $1;
+            my @pops      = split( /\|/, $freq_data );
+
+            foreach my $pop (@pops) {


This seems like the heart of what you're doing - how about making it a func that you'd test?

I sent you my suggestion for sharding the functions and testing via teams.

wingolab · 2023-10-21T02:13:06Z

perl/t/utils/dbsnp2FormatInfo.t

+# Read the processed file to check the INFO field
+$fh = path($expected_output_vcf_path)->openr;
+
+ok( <$fh> == "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' );


You're comparing a string using == the interpreter gave warnings with prove -lv, which makes me concerned that the ok wasn't really evaluated or something along those lines since you should expected '\n' characters if you're using <>. My suggestion is to write these as arrays of expected lines and cycle through them. Alternatively, I think we should use the eq operator and chomp before comparing the string.

e.g.,

$fh = path($expected_output_vcf_path)->openr; my $str = <$fh>; chomp $str; ok( $str eq "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' ); $str = <$fh>; chomp $str; ok( $str eq "##INFO=<ID=RS,Number=1,Type=String,Description=\"dbSNP ID\">", 'RS population is correctly processed' );

wingolab · 2023-10-21T02:20:31Z

perl/lib/Utils/DbSnp2FormatInfo.pm

+      }
+
+      # If it's an INFO line
+      if (/FREQ=/) {


INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.

My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.

Addressed your comments Thomas

wingolab

Added commit to include the spec URL and a note on processing in the module. Testing could be more transparent but that could be addressed later with some minor refactoring.

akotlar added 3 commits October 17, 2023 00:39

wip dbsnp util

e381a56

rename utility for clarity

221ec00

working dbsnp2 reformatter

a104a5e

akotlar changed the title ~~[wip] Feature/273 dbSNP 2 VCF reformat utils~~ [wip] Feature/273 dbSNP 2 VCF reformat util Oct 18, 2023

akotlar added 3 commits October 18, 2023 00:45

cleanup

a9b0614

Merge branch 'master' of github.com:bystrogenomics/bystro into featur…

e81355d

…e/273_dbSNP_vcf_reformat_utils

add dbsnp reformat test

dbf6d92

akotlar changed the title ~~[wip] Feature/273 dbSNP 2 VCF reformat util~~ [issue/273] dbSNP 2 VCF reformat util Oct 19, 2023

akotlar requested a review from wingolab October 19, 2023 01:10

akotlar assigned wingolab Oct 19, 2023

cleanup

70d7abf

akotlar requested a review from cristinaetrv October 19, 2023 01:12

akotlar assigned cristinaetrv Oct 19, 2023

akotlar mentioned this pull request Oct 19, 2023

Annotation Sprint 2 Task List #273

Closed

5 tasks

akotlar added 3 commits October 20, 2023 00:20

Merge branch 'master' into feature/273_dbSNP_vcf_reformat_utils

63f8a69

tidy

3bddd45

Merge branch 'feature/273_dbSNP_vcf_reformat_utils' of github.com:ako…

6009aee

…tlar/bystro into feature/273_dbSNP_vcf_reformat_utils

poneill approved these changes Oct 20, 2023

View reviewed changes

wingolab previously requested changes Oct 21, 2023

View reviewed changes

akotlar added 3 commits October 21, 2023 18:53

address comments

fbb00bd

fix typo

0038814

Merge branch 'master' into feature/273_dbSNP_vcf_reformat_utils

ae3962f

akotlar requested a review from wingolab October 21, 2023 18:54

add specificaiton description

a977d43

wingolab approved these changes Oct 23, 2023

View reviewed changes

akotlar merged commit 37b4cfe into bystrogenomics:master Oct 23, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[issue/273] dbSNP 2 VCF reformat util #300

[issue/273] dbSNP 2 VCF reformat util #300

akotlar commented Oct 18, 2023 •

edited by poneill

Loading

poneill Oct 20, 2023

akotlar Oct 20, 2023

poneill Oct 20, 2023

akotlar Oct 20, 2023

poneill Oct 20, 2023

poneill Oct 20, 2023

akotlar Oct 20, 2023 •

edited

Loading

wingolab Oct 21, 2023

poneill Oct 20, 2023

wingolab Oct 20, 2023

akotlar Oct 21, 2023 •

edited

Loading

wingolab Oct 20, 2023

wingolab Oct 20, 2023

akotlar Oct 22, 2023

wingolab Oct 20, 2023

wingolab Oct 21, 2023

wingolab Oct 21, 2023

wingolab Oct 21, 2023

wingolab left a comment


		my $output_data_fh = $self->getWriteFh($output_data_path);

		$self->log( 'info', "Writing to $output_data_path" );

[issue/273] dbSNP 2 VCF reformat util #300

[issue/273] dbSNP 2 VCF reformat util #300

Conversation

akotlar commented Oct 18, 2023 • edited by poneill Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akotlar Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akotlar Oct 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wingolab left a comment

Choose a reason for hiding this comment

akotlar commented Oct 18, 2023 •

edited by poneill

Loading

akotlar Oct 20, 2023 •

edited

Loading

akotlar Oct 21, 2023 •

edited

Loading