Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert mutated_peptide in peptide_info failed #6

Closed
choosehappy opened this issue Oct 8, 2017 · 13 comments
Closed

Assert mutated_peptide in peptide_info failed #6

choosehappy opened this issue Oct 8, 2017 · 13 comments

Comments

@choosehappy
Copy link

In on some samples (it works on others) the python script failed with this error:

File "MuPeXI.py", line 79, in main
peptide_info, peptide_counters, fasta_printout, pepmatch_file_names = peptide_extraction(peptide_length, vep_info, proteome_reference, genome_reference, reference_peptides, reference_peptide_file_names, input_.fasta_file_name, paths.peptide_match, tmp_dir, input_.webserver, input_.print_mismatch, input_.keep_temp, input_.prefix, input_.outdir, input_.num_mismatches)
File "MuPeXI.py", line 720, in peptide_extraction
peptide_info, pepmatch_file_names = normal_peptide_correction(mutated_peptides_missing_normal, mutation_info, p_length, reference_peptide_file_names, peptide_info, peptide_match, tmp_dir, pepmatch_file_names, webserver, print_mismatch, num_mismatches)
File "MuPeXI.py", line 825, in normal_peptide_correction
assert mutated_peptide in peptide_info_

I’ve narrowed it down to one indel:
chr14 22482290 . T TCCG

and use the following command line:
python MuPeXI.py -v /data/DO/p2/final/2017-09-21_p2/batch2-ensemble-annotated_indel.vcf.recode.vcf -a HLA-A02:06 -c ./config.ini -d /data/DO/p2/final/2017-09-21_p2/mupexi/ -l 9

The value of mutated_peptide is “XIRQGAQKL”

and the value of peptide_info is (which doesn't contain the peptide):

defaultdict(<type 'dict'>, {'RQGAQKLVF': {'IQGAQKLVF': [mutation_info(gene_id='ENSG00000211836', trans_id='ENST00000390484',
mutation_consequence='inframe_insertion', chr='14', pos='22482290-22482291', cdna_pos='4-5', prot_pos=2, prot_pos_to=None, aa_normal='I', aa_mut='IR',
codon_normal='att', codon_mut='atCCGt', alt_allele='CCG', symbol='TRAJ54'), peptide_sequence_info(chop_normal_sequence='-----------',
mutation_sequence='XIRQGAQKLVF', normal_sequence='XIQGAQKLVF', mutation_position='2:4', consequence='I'), '1:2', pep_match_info(normal_peptide='IQGAQKLVF',
mismatch=1, mismatch_peptide='---------')]}, 'IRQGAQKLV': {'IRQGKAKLV': [mutation_info(gene_id='ENSG00000211836', trans_id='ENST00000390484',
mutation_consequence='inframe_insertion', chr='14', pos='22482290-22482291', cdna_pos='4-5', prot_pos=2, prot_pos_to=None, aa_normal='I', aa_mut='IR',
codon_normal='att', codon_mut='atCCGt', alt_allele='CCG', symbol='TRAJ54'), peptide_sequence_info(chop_normal_sequence='-----------',
mutation_sequence='XIRQGAQKLVF', normal_sequence='XIQGAQKLVF', mutation_position='2:4', consequence='I'), '1:3', pep_match_info(normal_peptide='IRQGKAKLV',
mismatch=2, mismatch_peptide='---------')]}})

Any thoughts?

Cheers,
Andrew

@choosehappy
Copy link
Author

By the way, the same issue occurs when submitting that VCF to your web-server:

image

with the same vcf file (attached, rename from txt to vcf):

batch2-ensemble-annotated_indel.vcf.recode.txt

Cheers,
Andrew

@ambj
Copy link
Owner

ambj commented Oct 9, 2017

Hi Andrew

It looks like an insertion exception that I have not accounted for. I can also see that it must be an somewhat odd transcript since it includes X in the normal sequence.
I will look into this, and try to fix the bug. Would you be able to provide me with a test VCF file including the error prone insertion?

Best,
Anne-Mette

@choosehappy
Copy link
Author

Thanks for responding so quickly

Just to confirm, do you see the vcf file attached in the second comment (as a txt file; it won't be let me upload .vcf files)? If not I can email or host it somewhere else.

That file contains only the variant which is causing the error, if you'd like I can include a larger VCF?

From my side, I don't think anything exceptional has been done to generate these variants. I used a straightforward normal vs tumor calling on WES NGS reads pipeline to produce the vcf

@choosehappy
Copy link
Author

Actually, I also made a small change to the python script to support more types of variant callers (freebayes, varscan, vardict, mutect).

You can find the gist of the difference here:

https://gist.github.com/choosehappy/e54f23e41ad9246bdafd4d4ceb4e97d7

Feel free to incorporate those changes if you'd like (i can submit a pull request if you prefer)

@ambj
Copy link
Owner

ambj commented Oct 9, 2017

Ah yes, thanks! and it is fine with the single mutation.
No, it dont think anything is out of the ordinary with the variant calling, but sometimes the variants are called in for example pseudo genes where the reference sequence is odd (including X's). Here some exceptions needs to be taken into account. The more VCF files are run trough MuPeXI the higher the likelihood of finding an exception to the ordinary which is not encountered for in the current version.
And thank you for the edits - that have been on my todo for some time.

@choosehappy
Copy link
Author

Not a problem, glad to help.

In particular, those edits were necessary because we're using ensemble calls from the bcbio variant calling pipeline, and i discovered that the FORMAT fields for each mutation are not of the same format. whichever caller "first" called the mutation has its format put into the final file once confirmed by a second caller, so a robust approach which accepts "any" format on the fly was necessary

@choosehappy
Copy link
Author

So I've stumbled across a set of peptides which also fail that assertion.
I looked into trying to figure out which mutations are causing them but they're more difficult to track down than the one mentioned above

Anyway, these are the ones which raise an error, is this sufficient information?

CISTSRSSGGVS
CAPCISTSR
SSG
ISTSRSSGGVSI
PCISTSR
SSGGV
SRSSGGVSILHC
STSR
SSGGVSIL
SSGGVSILHCHN
TSR
SSGGVSILH
RSSGGVSILHCH
APCISTSR
SSGG
VCAPCISTSR*SS

@ambj
Copy link
Owner

ambj commented Oct 27, 2017

Dear Andrew

I'm very sorry for not solving the course of this error sooner. I'm currently writing up my PhD thesis which is to be handed in, in the middle of december. Fore this reason im very sorry to inform you that i will not be able to solve this problem in the near future, but it is on my todo and will be solved as soon as possible.
I hope this does not course you, or any other user experiencing the same error, too much trouble.

Best,
Anne-Mette

@choosehappy
Copy link
Author

Thank you for updating us on your situation; it is certainly understandable. Is it possible for you to provide a time estimate? say January 2018?

In the meantime, when that assertion is triggered, I've had Mupexi simply skip it. Besides not considering that particular neuropeptide, are there any other (unforeseen) consequences?

@ambj
Copy link
Owner

ambj commented Oct 27, 2017

As a time estimate I hope to have MuPeXI updated by the beginning of February 2018.

I would not recommend avoiding the assertions in the code, as these check that the information you seek in the different sets or directories actually exists. I would rater identify the mutations in the VCF file causing the errors and exclude these from the analysis until the bug is fixed.
With that said I think skipping these assertions would only leave you with odd outputs with empty fields in the output file but, but I'm not completely confident in this statement. In the worst case a shift would occur which would make the entire output useless, instead of just missing out in a few peptides predicted.

@choosehappy
Copy link
Author

I see, thank you for that insight. I'll have to be careful.

I was wondering if you knew of a quick way to identify which mutation is causing the error? The way the normal_peptide_correction function is written, which is at the peptide level, its not obvious to me which mutation from the vcf file created the problematic peptide?

For the first error i reported, it was easy to find since "pep_match" only contained a single mutation, but now that variable contains many mutations.

Is there a way to know which one should be removed from the vcf file?

@ambj
Copy link
Owner

ambj commented Oct 30, 2017

It is not strait forward - but it would be a great idea to print the valuable information for the assertion errors. What i normally to when i trouble shoot is print information needed to find the troubling mutation.

From the mutation info generated from the VEP file you cen print the information needd from the VCF file.
build_vep_info (line 539):
Mutation_Info = namedtuple('mutation_info', ['gene_id', 'trans_id', 'mutation_consequence','chr', 'pos', 'cdna_pos', 'prot_pos', 'prot_pos_to', 'aa_normal', 'aa_mut', 'codon_normal', 'codon_mut', 'alt_allele', 'symbol'])

the mutation info is also used in the normal_peptide_correction function. therefor i would put in the following line, right before the error occurs. the last information printet is the the mutation causing the problem.
print( '{}/t{}/t{}/t{}/t{}'.format(mutation_info.chr, mutation_info.pos, mutation_info.codon_normal, mutation_info.codon_mut, mutation_info.alt_allele))

From this you can look it up in your VCF file. There might be some incompatibilities between VEP and VCF, how the exact genomic position is counted.

I know its not the smoothes solution, but it might help you along the way to begin with.

@ambj
Copy link
Owner

ambj commented Apr 17, 2018

Hi Andrew
I have fixed this bug in the newest version of MuPeXI on the dev branch. This will be merged with the master branch later this week with the new release.
I tested the mutation you reported and could not reproduce the error in the newest version og MuPeXI.
I hope this have solved your problem

Best,
/AM

@ambj ambj closed this as completed Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants