Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got wrong positions on chomped query maf subset #68

Closed
songtaogui opened this issue Sep 18, 2020 · 2 comments
Closed

Got wrong positions on chomped query maf subset #68

songtaogui opened this issue Sep 18, 2020 · 2 comments

Comments

@songtaogui
Copy link

Hi,

Thank you for developing this nice package.

However, I have encountered issues when subset maf with maf_extract_ranges_indexed.py, the query position of chomped maf seems incorrect:

For example, I have a REF.maf like:

##maf version=1
a score=1643
s Ref.3                            219977717 1643 + 235667834 CCTGTAAAATATATAATCTTGAGCAAACTAGTTAGTCCAATTATTTGTGTTGGGCATTTCAACCACCAAAATCATTTAGGAAAAAGTTTGACCCTATTTCCCTTTCAGGTATCAATATAATATAAAATTTGTTAATGTTTATATGTTTGTCATGAAATAAATATATGAATGAATAAGATATATTGTGTCTATAAATTTTTATTTTCATGAATGATGAAGGTGTGTCCTTCATGACCTTCGTCTGAAGATCGTTATATCCGAAGGAGATAATGCTTCGAAGAACAAAGGTCCTTAATACTTAACGATTGTGTTGTCTTGTTCTTGATTCACAACATTTGAGAACAAGTGACCAACAGCTCTTTTGAGCCTAACTCCTCTAACTTTTTTGGCATTTCTCGAGGATTCCTTACGACTTAGACAAACATAGTTAGCTTGTAAAACAATTGACTAAGCCTAGGAAACGTACCTTTTTTCACATAGTTTTTATAGGATTTGCAATATGCTCAAAACTAAGTCTAAAAGTCAATTTTCTTGCACAAAAGAGTCAGTAAATCAAGGTGTAGGAAAGATAGTATAAGGTGTGTGC-AACTTATTCAAACACTTAAATCTCATGTTTTATAGTTTTACCCAAAGTTGCACTTTTGGCCTTTATTTCCATTCTTTTGAACTTGTTGCCTTTAATTTGCTTTTGCTAGAAGCCTAGAACGTGTGCTCAAGCTGGCATCAAAGTAATTAGTCTGATGTGCTATCACTATGGGTGACAATAAGCTCTAAATTTTACGCTATAAGATTTAAGAATCAGATCAGATTAGAATCAGACCATATTTTTATTCATTTTTGAACTAAAATTTATTTAGGGCACTACCATTTTGTGAAGAAACATTTGGATCGTGATCCATTACCACTCCTAACTGTCACTTGAATGTGAATCATTACCATTGTCATTTGCGTTACAATTCGTAAATCAAACAACTTGTTTGAACCAAATAACTTTCTACGTCATCCTCTACCGTGCACAACTTACATCAACTTTTTCACAAGCCACGACTTAAGGGGTGTTTGAATGCAATAGAGCTAATAGTTAGCTGTTAAAATTAGCTGAAGACATCTAAACAATATAGCTAATAGTTCAGCTATTAGCTATTTTTAGCAAATTAGCTAATAGTTAGCTAGCTATTTGTTAGCTAGCTAATTTCACCGGCAA-TTTTTTGTCAACTAACTATTAGCTCTAATGCATTGAAACACCATCTTTATTAAGGCCCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAAAAGAGCTAATGATGGTGTGATAAAATTTAGCACATATTATCTCATAAACTGCTCAAGAGATGCTAAAAGTAGCTAAACGTGGGCCAGGAACCCATTTTTCATCCATACCCTCCAACATCATTCCCTCT--CTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTACTAAACTTTATTAGAGTAGTCCAAATACCATAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTGAAGTTTAGCAGGAAGCTAAAATTTATCCTGTGAGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC
s query                                 2314 1632 -      3946 CCTGTAGAATATATAATCTTGAGCAAACTAGTTAGTCCAATTATTTGTGTTGGGCATTTCAACCACCAAAATCATTTAGGAAAAGGTTTGACCCTATTTCCCTTTCAGATATCAATATAATATAAACTTTGTTAATGTTTATATGTTTGTCATGAAATAAATATATGAATGAATAAGATATATTGTGTGTATAAATTTTTATTTTCATGAATGATGAAGGTGTGTCCTTCATGACCTTCGTCTGAAGATCGTTATATCCGAAGGAGATAATGCTTCGAAGGACGAAGGTCCTTAATACTTAACGATTGTGTTGTCTTGTTCTTGATTCACAACATTTGAGAACAAGTGACCAACAGCTCTTTTGAGCCTAACTCCTCTAACTTTTTTGGCATTTCTCGAGGGTTCCTTACGACTTAGACAAACATAGTTAGCTTGTAAAACAATTGACTAAGCCTAGGAAACATACC-TTTTTCACATAGTTTTTATAGGATTTGCAATATGCTCAAAACTAAGTCCAAAAGTCAATTTTCTTGCACAAAAGAGTTAGTAAATCAAGGTGCAGGAAAGATAGTATAAGATGTGTGCTAACTTATTCAAACACTTAAATCTCATGTTTTATAGTTTTACCCAAAGTTGCACTTTTGGCCTTCATTTCCATTCTTTTGAACTTGTTGCCTTTAATTTGCTTTTGCTAGAAGCCTAGAACGTGTGCTCAAGCTGACATCAAAGTAATTAGTCTGATGTGCTATCACTATGGGTGACAATAAGCTCTAAATTTTACAATATAAGATTTAATGATCAGATCAGATTAGAATCGGACCATATTTTTATTCATTTTTGAACTAAAATTTATTTAGGGCACTACCATTTTGTGAAGAAACATTTGGATCGTGATCCATTACCACTCCTAGCTGTCACTTGAGTGTGAATCATTATCATTGTCATTTGCGTTACAATTCGTAAATCAAACAACTTGTTTGAACCAAATAATTTTCTACGTCATCCTCTACCGTGCACAACTTACATCAACTTTTTCACAAGCCACGACTTAAGGGGTGTTTGAATGCAATAGAGCTAATAGTTAGCTGTTAAAATTAGCTGAAGACATCTAAACAATATAGCTAATAGTTCAGCTATTAGCTATTTTTAGCAAATTAGCTAATA-------------GTTAGCTAGCTAATTCCACCGACAATTTTTTTGTCAACTAACTATTAGCTCTAATGCATTTAAACACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATA-CCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC

And I was trying to subset the maf based on a region with command:

# >cat test.bed
# Ref.3   219978958       219979958

maf_extract_ranges_indexed.py -c REF.maf < test.bed > test.maf

The output test.maf was like:

##maf version=1
a score=1643
s Ref.3 219978958 402 + 235667834 ACACCATCTTTATTAAGGCCCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAAAAGAGCTAATGATGGTGTGATAAAATTTAGCACATATTATCTCATAAACTGCTCAAGAGATGCTAAAAGTAGCTAAACGTGGGCCAGGAACCCATTTTTCATCCATACCCTCCAACATCATTCCCTCT--CTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTACTAAACTTTATTAGAGTAGTCCAAATACCATAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTGAAGTTTAGCAGGAAGCTAAAATTTATCCTGTGAGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC
s query      3543 403 -      3946 ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATA-CCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC

However, the exact position of the aligned query sequence was nothing like the position reported in the test.maf:

# the query whole sequence was stored in query.fa, locating the aligned block in the test.maf with seqkit locate function:
# > seqkit locate --gtf -p "ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATACCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC" query.fa -o test_location.gtf
# > cat test_location.gtf
query   SeqKit  location        2315    2717    0       -       .       gene_id "ACCATCTTTATTAAGGCTCTGTTTCAATCCCACTAGATAAACTTTAGCAACTTTTTTAGCTACTTTTAGCCATTTAGTGATCTAAACAGGAGAGCTAATGGTGGTGTGATAAAATTTAGCACATATTATCTCATAAGCTGCTCAAGAGGTGCTAAAAGTAGCTAAACGTGGACCAGGAAACCCCATTTTCCATCCATACCTCCAACATCATTCCCTCTCCCTCCGCATTAAGGGCATATGTGTCTTTTTATTCCAATGTGCTAAACTTTATTAGAGTAGTCCAAACACCACAAGGAGCTAAACTTTAGCACTTCAATTTATATGGCTAAAGTTTAGCAAGAAGCTAAAATTTATCCTGTGGGATTGAAACAGGGCCTAAATATATAGCCTCCCTTATCCGTAC";

It seems to me that maf_extract_ranges_indexed.py takes the left part of the chomped maf block as the right part when the query strand is "-".

Did I misunderstand something ? How could I get the correct postions when chomping maf ?

Please find attached the aforementioned files to reproduce the result:
chomp_test.tar.gz

Any suggestion would be grateful.

Best wishes,

Songtao Gui

@rsharris
Copy link
Contributor

Usually reports like this are due to a user misunderstanding of the MAF format, for which treatment of the reverse strand is not necessarily obvious. See http://genome.ucsc.edu/FAQ/FAQformat#format5 . I'm pretty sure MAF originated at UCSC, so this should be considered as the formal specification of the format.

I haven't looked at the details of your post, so I'm not 100% positive this is what is happening. And I do appreciate that your post seems to be thorough. But your conclusion is consistent with the usual misunderstanding of MAF. I had similar problems in my own early encounters with MAF. Could you check to see whether this explains what you are seeing?

@songtaogui
Copy link
Author

@rsharris

Yes, you are right.

After another look at my MAF file, it turns out that the start postion of the query was always based on the positive line regardless of the strand. After changing to the reverse complement line based start position, everything just works fine !

Sorry for the bothering, and thank you for the help.

Best,

Songtao Gui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants