Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum length of scaffolds? #10

Open
MrOlm opened this issue Feb 9, 2017 · 4 comments
Open

Minimum length of scaffolds? #10

MrOlm opened this issue Feb 9, 2017 · 4 comments

Comments

@MrOlm
Copy link

MrOlm commented Feb 9, 2017

Hello,

I was wondering if there is a minimum length requirement for scaffolds somewhere hard-coded in the program?

I performed a test where I compared the length of genomes before running Panseq on them and after (by summing up all of the fragments each genome had in the Pangenome, as well as the length of the fragments). For the most part these numbers agreed very well (1-2% difference), but with highly fragmented genomes (with a lot of small pieces) there was as much as a 13% reduction in genome size.

The smallest scaffolds I used as input are 1kb, but I was wondering if I should make that even higher?

Thank you for the wonderful tool,
-Matt

@chadlaing
Copy link
Owner

Hi,

There is no hard-coded minimum length. All the cutoffs can be specified in the parameters file that is given to Panseq. The minimumNovelRegionSize will determine what the threshold for keeping a region is.

If this doesn't answer your question, would you be able to provide a small example?

Thanks,
Chad

@MrOlm
Copy link
Author

MrOlm commented Feb 13, 2017

Hi Chad,

Below is a list of genomes with a minimum scaffold length of 1kb. The true length of the genome input vs. the length of the genome as interpreted by the binary_table is then shown, as well as the difference in genome size and the reduction in genome size when comparing the true genome length vs. the genome length as determined by the sum of all fragments in the core and accessory genomes. The number of scaffolds in the input genome is also shown

conN1_009_006G1_concoct_25  : 2473919 vs 2473919 (      0bp -  0.00% reduction) ( 53 scaffolds)
con_SP_CRL_000G1_concoct_31 : 2455342 vs 2401612 ( 53,730bp -  2.19% reduction) ( 65 scaffolds)
conN4_155_006G1_concoct_13  : 2501838 vs 2451576 ( 50,262bp -  2.01% reduction) ( 45 scaffolds)
conN4_210_008G1_concoct_0   : 2851395 vs 2766652 ( 84,743bp -  2.97% reduction) (157 scaffolds)
conN4_206_000G1_concoct_30  : 2135556 vs 1857730 (277,826bp - 13.01% reduction) (723 scaffolds)
conN4_116_010G1_concoct_3   : 2567613 vs 2528031 ( 39,582bp -  1.54% reduction) ( 31 scaffolds)
conN4_139_007G1_concoct_4   : 2597855 vs 2531798 ( 66,057bp -  2.54% reduction) (136 scaffolds)
conN4_207_000G1_concoct_32  : 2556963 vs 2504248 ( 52,715bp -  2.06% reduction) ( 38 scaffolds)
conS2_002_011G1_concoct_8   : 2144299 vs 1869622 (274,677bp - 12.81% reduction) (699 scaffolds)
conS2_012_013G1_concoct_9   : 2469268 vs 2403776 ( 65,492bp -  2.65% reduction) ( 99 scaffolds)

As you can see, two of the genomes have a pretty large difference in size, >10%

However, below is the same analysis, but only counting scaffolds greater than 1.5kb when calculating the length of the original genome:

conN4_129_012G1_concoct_19  : 2505570 vs 2464111 ( 41,459bp -  1.65% reduction) ( 40 scaffolds)
conN1_009_006G1_concoct_25  : 2473919 vs 2473919 (      0bp -  0.00% reduction) ( 53 scaffolds)
con_SP_CRL_000G1_concoct_31 : 2453052 vs 2401612 ( 51,440bp -  2.10% reduction) ( 63 scaffolds)
conN4_155_006G1_concoct_13  : 2501838 vs 2451576 ( 50,262bp -  2.01% reduction) ( 45 scaffolds)
conN4_210_008G1_concoct_0   : 2848748 vs 2766652 ( 82,096bp -  2.88% reduction) (155 scaffolds)
conN4_206_000G1_concoct_30  : 1880704 vs 1857730 ( 22,974bp -  1.22% reduction) (517 scaffolds)
conN4_116_010G1_concoct_3   : 2567613 vs 2528031 ( 39,582bp -  1.54% reduction) ( 31 scaffolds)
conN4_139_007G1_concoct_4   : 2596594 vs 2531798 ( 64,796bp -  2.50% reduction) (135 scaffolds)
conN4_207_000G1_concoct_32  : 2556963 vs 2504248 ( 52,715bp -  2.06% reduction) ( 38 scaffolds)
conS2_002_011G1_concoct_8   : 1964334 vs 1869622 ( 94,712bp -  4.82% reduction) (555 scaffolds)
conS2_012_013G1_concoct_9   : 2465376 vs 2403776 ( 61,600bp -  2.50% reduction) ( 96 scaffolds)

The two outliers now show much less difference in size, which is what made me think that there was a hard-coded cutoff in the code.

This is really no longer a problem for me, as I can just set my minimum contig length threshold to 2kb, but I was just wondering if you had any thoughts about why this could be happening?

Thanks for your help- I'm really liking the program a lot
-Matt

@chadlaing
Copy link
Owner

Hi Matt,

Just to check:
You are calculating the size of each genome by summing the fragment sizes given in column 1 of the binary_table.txt file?

Chad

@MrOlm
Copy link
Author

MrOlm commented Feb 24, 2017

Hi Chad,

Yes, that's exactly right

-Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants