Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First trimmer script not completing #1

Closed
Astahlke opened this issue Jun 25, 2021 · 12 comments
Closed

First trimmer script not completing #1

Astahlke opened this issue Jun 25, 2021 · 12 comments

Comments

@Astahlke
Copy link

Hello!

Thanks for building this.

I can't figure out why the first trimmer script is not completing. From what I can tell there's no error generated. It appears that mummer successfully runs twice, but the expected ${FNAME}_polish2_10x1_trim1.fasta isn't generated at the end for map10x2 to proceed. Any suggestions to resolve this?

Here's the relevant standard output:

BEGIN1: 5
BEGIN2: 18010
END1: 15213

1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "Pectinophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediate
s/trimmed/Pgos.tig00000001_polish2_10x1.ntref" of length 18007
# construct suffix tree for sequence of length 18007
# (maximum reference length is 536870908)
# (maximum query length is 4294967295)
# CONSTRUCTIONTIME /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer
Pectinophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig0000
0001_polish2_10x1.ntref 0.00
# reading input file "/90daydata/project/ag100pest/Pgos/RawData/MT_Contig/Pectinophora_gos
sypiella_2/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish2_10x
1_new.fasta" of length 18006
# matching query-file "/90daydata/project/ag100pest/Pgos/RawData/MT_Contig/Pectinophora_go
ssypiella_2/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish2_10
x1_new.fasta"
# against subject-file "Pectinophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermedia
tes/trimmed/Pgos.tig00000001_polish2_10x1.ntref"
# COMPLETETIME /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer Pect
inophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001
_polish2_10x1.ntref 0.01
# SPACE /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer Pectinophor
a_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish
2_10x1.ntref 0.03
4: FINISHING DATA


++++ running: map10x2 ++++

Species: -s Pectinophora_gossypiella_2

Species ID: -i Pgos

Contig number: -n tig00000001

Number of threads: -t 30

Working directory: Pectinophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediates


--Generate sorted alignment:

Align...
Error: could not open Pectinophora_gossypiella_2/Pgos/assembly_MT_rockefeller/intermediate
s/trimmed/Pgos.tig00000001_polish2_10x1_trim1.fasta

I've attached a couple of other log files here - not sure what's most helpful. Thanks in advance for any ideas.

Amanda

@gf777
Copy link
Owner

gf777 commented Jul 2, 2021

Hi Amanda,

can you generate a self alignment of the canu contig selected by mitoVGP (or attach the fasta from canu). Usually this is very informative of repeats and other structures that may cause issues to mummer. We can increase the processivity of mummer with the option -z (e.g. 5000 could be a good value), if the problem is just finding the overlap within a nasty repeat.

I close the other issue so we keep everything in one place :-)

@gf777
Copy link
Owner

gf777 commented Jul 2, 2021

ps. a simple way to generate the self alignment is using online blast (paste the sequence twice)

@Astahlke
Copy link
Author

Astahlke commented Jul 2, 2021

Ah interesting. Is it possible to add an error output that mummer failed before going on to create the next directories, etc?
Just to clarify, are you asking for a self-alignment of ${W_URL}/trimmed/intermediate.fasta processed by mummer or the assembled canu contig?

Here are self-alignments for both
${W_URL}/trimmed/intermediate.fasta and ${W_URL}/canu/${ID}.contigs.fasta

Thanks so much for your help resolving this!

@Astahlke
Copy link
Author

Astahlke commented Jul 6, 2021

Following up here, increasing the sensitivity "Mummer sensitivity: -z 5000" didn't resolve the issue.

The self-alignment links are no longer live on ncbi - what format would you like them in? Here's the canu fasta (renamed as .txt for uploading) instead in case that's easier.
Pgos.contigs.fasta.txt

@Astahlke
Copy link
Author

Astahlke commented Jul 8, 2021

Hello! Another follow-up as I continue to tug at this.

Line 172 of the trimmer script

NUCMER_OUT=$(show-coords "${W_URL}/trimmed/${FNAME}.delta" -lrcTHo | grep "BEGIN" | head -1)

is looking for those terminal ends, including BEGIN, but show-cords is not reporting this. There's no BEGIN to grep.

$ nucmer --maxmatch --nosimplify Pgos.tig00000001_polish2_10x1_new.fasta Pgos.tig00000001_polish2_10x1_new.fasta -f -p test_polish2_10x1 -b 500
1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "test_polish2_10x1.ntref" of length 18007
# construct suffix tree for sequence of length 18007
# (maximum reference length is 536870908)
# (maximum query length is 4294967295)
# CONSTRUCTIONTIME /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer test_polish2_10x1.ntref 0.00
# reading input file "/90daydata/project/ag100pest/Pgos/RawData/MT_Contig/mitoVGP/Pectinophora_gossypiella_3/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish2_10x1_new.fasta" of length 18006
# matching query-file "/90daydata/project/ag100pest/Pgos/RawData/MT_Contig/mitoVGP/Pectinophora_gossypiella_3/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish2_10x1_new.fasta"
# against subject-file "test_polish2_10x1.ntref"
# COMPLETETIME /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer test_polish2_10x1.ntref 0.00
# SPACE /home/amanda.stahlke/.conda/envs/mitoVGP_pacbio/opt/mummer-3.23/mummer test_polish2_10x1.ntref 0.03
4: FINISHING DATA
$ NUCMER_OUT=$(show-coords test_polish2_10x1.delta -lrcTHo)
$ echo $NUCMER_OUT
1 18006 1 18006 18006 18006 100.00 18006 18006 100.00 100.00 Pgos.tig00000001_polish2_10x1 Pgos.tig00000001_polish2_10x1 [IDENTITY]

Does this imply that the sequence is sufficiently trimmed? Googling around some more it seems that I may have identified the issue as something you also encountered (mummer4/mummer#110). Were you able to resolve this? Is there a way to bypass this?

P.S. Dotplot attached here.
test

Thank you!

@gf777
Copy link
Owner

gf777 commented Jul 9, 2021

Hi Amanda!

Blast still has the self-alignment function (need to check the option under the sequence box). When run on the first of your canu contigs (tig00000001) it produces this:

hit_matrix

This thus seems a perfectly legitimate/regular mitochondrial concatamer for what looks like Pectinophora gossypiella. Solving the concatamer is sometimes tricky as you noted, but the specific issue in mummer should be fixed in mitoVGP by now. I am a bit perplexed by your dotplot because based on blast dotplot it looks like the mitogenome is about 15kb while your dotplot shows 18k of unique sequence. Can you share tig00000001_polish2_10x1.fasta?

Are you sure you have passed -z 5000? because your nucmer says -b 500 (so at least in this run you did not)

Best

@Astahlke
Copy link
Author

Astahlke commented Jul 9, 2021

Hi Giulio!

Yes, sorry, I was playing directly in the command line to see if I could figure out what was happening. Whether I use -z 500 or -z 5000, the show-coords and trimming process results in that 18007 length fasta, attached here. Very circular looking blast self-alignment attached here too.

It could be possible that we need to update programs or scripts from older versions - I could try that too.

Thanks again for all of the help!
Amanda

Pgos.tig00000001_polish2_10x1_new.fasta.txt
hit_matrix

@gf777
Copy link
Owner

gf777 commented Jul 10, 2021

Hi Amanda,

yes you got this figured out. Mummer is a little sensitive, so it appears in this case is not able to trim the ends correctly, albeit the issue you pointed to should have been fixed by now. I wonder how the output of show coords look like (i.e. what is confusing the trimming script here). 2 quick options: trim manually and you are done (most of the hard work is done at this point anyway), or trim manually while leave 100 bp only for the next step and run the final steps one by one using the scripts included in mitoVGP. Note that since this is not a vertebrate the final orientation step wouldn't work anyway with default settings.

All the best,

Giulio

@Astahlke
Copy link
Author

Progress! At least we've determined the issue here.

Here's the output of show-coords:

$show-coords Pectinophora_gossypiella/Pgos/assembly_MT_rockefeller/intermediates/trimmed/Pgos.tig00000001_polish2_10x1.delta -lrcTHo
1       18006   1       18006   18006   18006   100.00  18006   18006   100.00  10
0.00  Pgos.tig00000001_polish2_10x1   Pgos.tig00000001_polish2_10x1   [IDENTITY]

And this is the case whether I use mummer/4.0.0beta2 or mummer=3.23 as provided in mitoVGP_conda_env_pacbio.yml. Is it possibe that the issue isn't fixed?

To confirm, if I manually trim, is it still important to polish the trimmed mitocontig with short reads again? Can you explain the rationale behind this?

Thank you!!

Amanda

@gf777
Copy link
Owner

gf777 commented Jul 16, 2021

Interesting. It could be that my trick to fix mummer doesn't work in all cases. Basically mummer fails to find any overlap, even obvious ones, if the beginning of the two sequences being considered is exactly identical. Rather than trying to find the bug in mummer, my fix was introducing a small difference in one of the ends HOWEVER what it's probably happening here is that this worked for removing the first concatamers, but the internal ones are still perfectly identical and the fix doesn't work anymore. You could introduce a base change to one end in this partially trimmed sequence, I guess

@Astahlke
Copy link
Author

Astahlke commented Jul 26, 2021

I found that I could circularize that contig with blast, using https://github.com/Ag100Pest/modified_mitoVGP/blob/master/circularizationCheck.py, modified from the mitofinder pipeline. I also added an exit statement to my mitoVGP script (https://github.com/Ag100Pest/modified_mitoVGP/blob/master/mitoVGP) to suggest that circularization strategy if the trimmer fails.
Merqury didn't find any false kmers after the first rounds of polishing, so I didn't bring that contig back for another round of polishing and trimming in that case.
I've run into this one other time so far, this time in the second round of trimming. Seems to be just a matter of chance as far as I can tell.
Going to close this issue - thanks again for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants