Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[issue/273] dbSNP 2 VCF reformat util #300

Merged

Conversation

akotlar
Copy link
Collaborator

@akotlar akotlar commented Oct 18, 2023

  • Takes N DbSNP 2 VCF files (https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/), extracts every population from the Freq=(.*) field in a separate INFO field, drops the Freq field, drops the first allele for each population (which is reference). Then it writes the population-specific fields to the info header, and updates the yml config file to point to the newly formatted vcf (the original vcf will not be overwritten).

This is necessary because the dbSNP 2 vcf does not make good use of the VCF spec; the Freq field is the combination of multiple fields, each of which is an Allelic type, but where the first allele is the reference, which is not the standard use.

This will enable us to reproducibly fetch, transform, build dbSNP files, from 1 yaml config, once we add 1 more utility, which translates the RefSeq NC_* chromosome identifiers to chr1-22,X,Y,M.

Will remove [wip] once test added

@akotlar akotlar changed the title [wip] Feature/273 dbSNP 2 VCF reformat utils [wip] Feature/273 dbSNP 2 VCF reformat util Oct 18, 2023
@akotlar akotlar changed the title [wip] Feature/273 dbSNP 2 VCF reformat util [issue/273] dbSNP 2 VCF reformat util Oct 19, 2023
@akotlar akotlar requested a review from wingolab October 19, 2023 01:10
ok(
<$fh>
== "NC_000001.11 10001 rs1570391677 T A,C . . RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;KOREAN=0.0109,.;SGDP_PRJ=0,.;dbGaP_PopFreq=.,0",
'1st data row wiht KOREAN, SGDP_PRJ, dbGap freqs are correctly processed'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wiht -> with

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bah :) need CI spellcheck, clearly

thanks

$self->{_localFiles} = $localFilesAref;
}

# TODO: error check opening of file handles, write tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still in force? (looks like we have tests now for this module)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it isn't, thanks


my $base_name = basename($input_vcf);
$base_name =~ s/\.[^.]+$//; # Remove last file extension (if present)
$base_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many different kinds of file extensions are we likely to have to handle here, and could we match explicitly on those instead of sedding off two extensions? The worry is someone uploading a file named something like foo.bar.vcf, which would get pared down to foo [?]

}

# If it's an INFO line
if (/FREQ=/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing context here but this comment / condition pair is a little surprising because we're not matching against INFO explicitly-- do all INFO lines in this file match /FREQ=/?

Copy link
Collaborator Author

@akotlar akotlar Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I found this curious too. This is actually a clever solution, unless I'm missing something. ChatGPT came up with this, and I ended up keeping it because it was cute and quite elegant.

It checks rows for the FREQ=. If it doesn't find that, there's nothing to do but add the row as is. FREQ isn't guaranteed to be in the INFO field. It is also not found any other field, and never will be.

Then we split on ";". That will result in 1 field that contains everything up until the first INFO value, and the rest of the info values. We find the field containing FREQ=, extract the FREQ=(.*) value, expand the individual population POP=VAL as new INFO fields, join by ";" to the first field, which results in a complete file with correct delimiters.

I would have written this by splitting each field by "\t", then search INFO, but didn't see any downsides with this approach. FREQ is guaranteed to never appear 2x, though it could potentially appear 0 times, and this handles that case :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.

My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.

}

# TODO: error check opening of file handles, write tests
sub go {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems like it's doing a lot of different things at a few different levels of abstraction. One way to address that might be to break it up into helpers roughly along the lines of:

  • input validation
  • setting up filepaths
  • the core nested loop where we process $in_fh into $output_data_fh
  • updating the VCF header
  • writing everything out and cleaning up

But I defer to your judgment as to whether the juice is worth the squeeze there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding or switching to checking magic strings to deduce the compression type rather than matching extensions exclusively? File::LibMagic is a Perl package that binds to the c library that seems like it would be a good fit and probably much easier to use than our own implementation for checking magic strings.

Copy link
Collaborator Author

@akotlar akotlar Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, though beyond the scope of this PR, as the file extension solution is used everywhere (the change here was to remove accepting $innerFile, which was not handled if provided). I have a tracking ticket #312 to evaluate the switch to File::LibMagic, scheduled for Sprint 4.

use 5.10.0;
use strict;
use warnings;
use DDP;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need DDP after development is done?


my $output_data_fh = $self->getWriteFh($output_data_path);

$self->log( 'info', "Writing to $output_data_path" );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything to this point is getting setup to do actual work. This seems like a logical place to split apart the function into two sections, which will also aid testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

my $freq_data = $1;
my @pops = split( /\|/, $freq_data );

foreach my $pop (@pops) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the heart of what you're doing - how about making it a func that you'd test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent you my suggestion for sharding the functions and testing via teams.

# Read the processed file to check the INFO field
$fh = path($expected_output_vcf_path)->openr;

ok( <$fh> == "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're comparing a string using == the interpreter gave warnings with prove -lv, which makes me concerned that the ok wasn't really evaluated or something along those lines since you should expected '\n' characters if you're using <>. My suggestion is to write these as arrays of expected lines and cycle through them. Alternatively, I think we should use the eq operator and chomp before comparing the string.

e.g.,

$fh = path($expected_output_vcf_path)->openr;

my $str = <$fh>;
chomp $str;
ok( $str eq "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' );

$str = <$fh>;
chomp $str;
ok( $str eq "##INFO=<ID=RS,Number=1,Type=String,Description=\"dbSNP ID\">",
  'RS population is correctly processed' );

}

# If it's an INFO line
if (/FREQ=/) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.

My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.

@akotlar akotlar dismissed wingolab’s stale review October 21, 2023 18:54

Addressed your comments Thomas

@akotlar akotlar requested a review from wingolab October 21, 2023 18:54
Copy link
Collaborator

@wingolab wingolab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commit to include the spec URL and a note on processing in the module. Testing could be more transparent but that could be addressed later with some minor refactoring.

@akotlar akotlar merged commit 37b4cfe into bystrogenomics:master Oct 23, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants