Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing the best genome assembly #220

Closed
mmokrejs opened this issue Apr 28, 2018 · 17 comments
Closed

Choosing the best genome assembly #220

mmokrejs opened this issue Apr 28, 2018 · 17 comments

Comments

@mmokrejs
Copy link
Contributor

mmokrejs commented Apr 28, 2018

Hi,
I am moving my posts from a thread which I somewhat hijacked in the past (#178) for easier reading. I will add the comments of @sjackman here too.

Interesting, the k=128 assembly went better IMO than the k=64 assembly. These are the contigs but I think it is correct to check those well-assembled segments before scaffodling happens. What am I missing here? Do you think the counts are larger because the k=128 assembly contains more redundant contigs? Was our discussion only about sealing gaps?

$ BBmap-37.90/stats.sh in=abyss_k64/tt_16D1C3L12__abyss_64-3.fa 
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3026	0.1979	0.1980	0.3014	0.0000	0.0000	0.0000	0.3959	0.1092

Main genome scaffold total:         	8970038
Main genome contig total:           	8970038
Main genome scaffold sequence total:	1545.997 MB
Main genome contig sequence total:  	1545.997 MB  	0.000% gap
Main genome scaffold N/L50:         	984481/207
Main genome contig N/L50:           	984481/207
Main genome scaffold N/L90:         	6786697/66
Main genome contig N/L90:           	6786697/66
Max scaffold length:                	48.544 KB
Max contig length:                  	48.544 KB
Number of scaffolds > 50 KB:        	0
% main genome in scaffolds > 50 KB: 	0.00%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	     8,970,038	     8,970,038	 1,545,997,227	 1,545,997,227	 100.00%
     50 	     8,970,038	     8,970,038	 1,545,997,227	 1,545,997,227	 100.00%
    100 	     4,217,230	     4,217,230	 1,211,849,941	 1,211,849,941	 100.00%
    250 	       763,487	       763,487	   724,479,476	   724,479,476	 100.00%
    500 	       386,281	       386,281	   597,014,844	   597,014,844	 100.00%
   1 KB 	       218,620	       218,620	   479,261,380	   479,261,380	 100.00%
 2.5 KB 	        56,120	        56,120	   220,798,793	   220,798,793	 100.00%
   5 KB 	         9,411	         9,411	    66,580,089	    66,580,089	 100.00%
  10 KB 	           875	           875	    11,287,016	    11,287,016	 100.00%
  25 KB 	            12	            12	       371,039	       371,039	 100.00%

$ BBmap-37.90/stats.sh in=abyss_k128/tt_16D1C3L12__abyss_128-3.fa 
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3002	0.2001	0.2002	0.2995	0.0000	0.0000	0.0000	0.4003	0.0996

Main genome scaffold total:         	3835048
Main genome contig total:           	3835048
Main genome scaffold sequence total:	1651.869 MB
Main genome contig sequence total:  	1651.869 MB  	0.000% gap
Main genome scaffold N/L50:         	280737/938
Main genome contig N/L50:           	280737/938
Main genome scaffold N/L90:         	2578210/142
Main genome contig N/L90:           	2578210/142
Max scaffold length:                	56.412 KB
Max contig length:                  	56.412 KB
Number of scaffolds > 50 KB:        	1
% main genome in scaffolds > 50 KB: 	0.00%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	     3,835,048	     3,835,048	 1,651,869,106	 1,651,869,106	 100.00%
    100 	     3,835,048	     3,835,048	 1,651,869,106	 1,651,869,106	 100.00%
    250 	     1,733,198	     1,733,198	 1,333,132,564	 1,333,132,564	 100.00%
    500 	       475,698	       475,698	   956,024,894	   956,024,894	 100.00%
   1 KB 	       267,415	       267,415	   813,134,944	   813,134,944	 100.00%
 2.5 KB 	       116,134	       116,134	   571,167,119	   571,167,119	 100.00%
   5 KB 	        38,065	        38,065	   299,800,293	   299,800,293	 100.00%
  10 KB 	         6,544	         6,544	    88,583,607	    88,583,607	 100.00%
  25 KB 	           125	           125	     3,756,838	     3,756,838	 100.00%
  50 KB 	             1	             1	        56,412	        56,412	 100.00%

$

And when I compare the final scaffolds:

$ BBmap-37.90/stats.sh in=abyss_k64/tt_16D1C3L12__abyss_64-8.fa
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3023	0.1982	0.1984	0.3011	0.1032	0.0001	0.0000	0.3966	0.1107

Main genome scaffold total:         	8066430
Main genome contig total:           	8196768
Main genome scaffold sequence total:	1675.455 MB
Main genome contig sequence total:  	1502.500 MB  	10.323% gap
Main genome scaffold N/L50:         	276947/388
Main genome contig N/L50:           	703390/232
Main genome scaffold N/L90:         	5555783/69
Main genome contig N/L90:           	6079298/67
Max scaffold length:                	369.235 KB
Max contig length:                  	59.732 KB
Number of scaffolds > 50 KB:        	2906
% main genome in scaffolds > 50 KB: 	15.71%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	     8,066,430	     8,196,768	 1,675,455,273	 1,502,499,899	  89.68%
     50 	     8,066,430	     8,196,768	 1,675,455,273	 1,502,499,899	  89.68%
    100 	     3,838,952	     3,969,290	 1,376,296,357	 1,203,340,983	  87.43%
    250 	       515,159	       645,453	   910,327,281	   737,374,137	  81.00%
    500 	       203,301	       333,571	   805,706,044	   632,758,384	  78.53%
   1 KB 	        75,439	       205,708	   715,848,362	   542,901,177	  75.84%
 2.5 KB 	        34,397	       160,621	   648,229,490	   475,601,705	  73.37%
   5 KB 	        25,225	       149,598	   618,496,895	   446,269,819	  72.15%
  10 KB 	        16,226	       128,799	   555,879,949	   400,660,524	  72.08%
  25 KB 	         7,213	        90,509	   413,531,573	   302,275,101	  73.10%
  50 KB 	         2,906	        54,936	   263,145,032	   196,921,549	  74.83%
 100 KB 	           806	        23,619	   118,660,722	    91,293,900	  76.94%
 250 KB 	            36	         1,952	    10,260,581	     8,133,203	  79.27%


$ BBmap-37.90/stats.sh in=tt_16D1C3L12__abyss_k128-8.fa
A	C	G	T	N	IUPAC	Other	GC	GC_stdev
0.3001	0.2002	0.2003	0.2994	0.0776	0.0001	0.0000	0.4005	0.1032

Main genome scaffold total:         	2876731
Main genome contig total:           	3015816
Main genome scaffold sequence total:	1703.964 MB
Main genome contig sequence total:  	1571.653 MB  	7.765% gap
Main genome scaffold N/L50:         	11981/19.507 KB
Main genome contig N/L50:           	108612/2.736 KB
Main genome scaffold N/L90:         	1639503/171
Main genome contig N/L90:           	1867631/160
Max scaffold length:                	873.806 KB
Max contig length:                  	90.766 KB
Number of scaffolds > 50 KB:        	5362
% main genome in scaffolds > 50 KB: 	37.88%


Minimum 	Number        	Number        	Total         	Total         	Scaffold
Scaffold	of            	of            	Scaffold      	Contig        	Contig  
Length  	Scaffolds     	Contigs       	Length        	Length        	Coverage
--------	--------------	--------------	--------------	--------------	--------
    All 	     2,876,731	     3,015,816	 1,703,964,196	 1,571,652,579	  92.24%
    100 	     2,876,731	     3,015,816	 1,703,964,196	 1,571,652,579	  92.24%
    250 	     1,268,766	     1,407,851	 1,454,714,840	 1,322,403,223	  90.90%
    500 	       189,163	       328,132	 1,135,210,242	 1,002,913,588	  88.35%
   1 KB 	        52,595	       191,516	 1,042,703,436	   910,423,098	  87.31%
 2.5 KB 	        32,095	       169,028	 1,006,465,234	   874,336,179	  86.87%
   5 KB 	        23,821	       159,541	   978,974,988	   846,946,092	  86.51%
  10 KB 	        17,880	       148,430	   935,974,349	   809,391,570	  86.48%
  25 KB 	         9,965	       122,063	   807,426,899	   701,763,980	  86.91%
  50 KB 	         5,362	        93,327	   645,398,771	   565,717,262	  87.65%
 100 KB 	         2,428	        61,150	   440,176,973	   389,086,844	  88.39%
 250 KB 	           382	        17,275	   130,980,817	   117,113,064	  89.41%
 500 KB 	            28	         2,206	    17,115,130	    15,343,274	  89.65%

@sjackman commented on Feb 8

k=64 Main genome contig N/L50: 984481/207
k=128 Main genome contig N/L50: 280737/938

k=64 is the more contiguous assembly. Try k=80,96,112.
I'd suggest using ntCard and GenomeScope first before assembling with ABySS to get a sense of the coverage and k-mer distribution of your data.

@mmokrejs commented on Feb 8
Those numbers were from step -3. In final scaffolds in -8 files there is:

k=64 Main genome contig N/L50: 703390/232
k=128 Main genome contig N/L50: 108612/2.736 KB

Those KB are certainly not kilobytes, it is just weird.

@sjackman commented on Feb 8
Perhaps it's a a bug in BBmap-37.90/stats.sh I'm guessing it should be 2,736.
k=64 is more contiguous for both unitigs and scaffolds.

@mmokrejs commented on Feb 8
I asked Brian Bushnell for meaning of the MB and KB in the output, here is the answer.

Indeed, those are Kbp and Mbp; I should change them. You can add the flag "format=2" to print whole numbers of bp with no commas or decimals, instead of Kbp or Mbp. Also, note that "L50" for stats.sh means a Length and "N50" means a Number of contigs, so the L50 (length of 50th percentile contig) increased from 232 to 2736 in that case, when K increased, meaning continuity improved. There are some programs that have L50 and N50 reversed.

@sjackman commented on Feb 9
I'm afraid Brian has the definition of N50 and L50 reversed. It's a common misunderstanding due to an unfortunate nomenclature. N50 is the length of the contig, and L50 is the number of the contigs whose size is N50 or larger. Yes it's weird, but that's the way it is.
See https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics
and http://quast.bioinf.spbau.ru/manual.html#sec3.1.1

@mmokrejs commented on Feb 11
You asked for the numbers from abyss-fac so that we do not have to think of what Brian has eventually swapped. The mapper used was the abyss-map (I sticked to the default so far).

Based on uncorrected input reads:

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_64-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
17.02e6	400495	97124	400495	500	500	929	1797	3307	2389	40026	576.5e6	tt_16D1C3L12__abyss_64-1.fa
9269779	400495	97124	400495	500	500	929	1797	3307	2389	40026	576.5e6	tt_16D1C3L12__abyss_64-2.fa
8970038	386281	91280	386281	500	500	1008	1971	3663	2649	48544	597e6	tt_16D1C3L12__abyss_64-3.fa
3036	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_64-4.fa
5148	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_64-5.fa
8178432	310669	63719	310669	500	500	1495	2882	5772	4006	59732	633.7e6	tt_16D1C3L12__abyss_64-6.fa
150	23	9	23	519	519	675	918	1344	1079	2400	20808	tt_16D1C3L12__abyss_64-7.fa
8066430	203300	7924	203300	500	500	1783	15776	59085	34309	290392	632.8e6	tt_16D1C3L12__abyss_64-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_80-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
13.12e6	446607	100548	356890	643	500	968	2003	3865	2719	40026	684.4e6	tt_16D1C3L12__abyss_80-1.fa
7098314	446607	100548	356890	643	500	968	2003	3865	2719	40026	684.4e6	tt_16D1C3L12__abyss_80-2.fa
6781726	423529	92333	302719	757	500	1071	2247	4374	3073	48480	707.5e6	tt_16D1C3L12__abyss_80-3.fa
4693	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_80-4.fa
6110	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_80-5.fa
5820652	308849	55950	157302	1610	500	1887	3813	7894	5337	59232	755e6	tt_16D1C3L12__abyss_80-6.fa
308	99	34	99	501	501	707	977	1455	1144	2825	94745	tt_16D1C3L12__abyss_80-7.fa
5696462	191129	5702	43046	1934	500	2941	28235	91696	52410	565062	753.4e6	tt_16D1C3L12__abyss_80-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_96-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
10.67e6	480137	100288	266417	964	500	1006	2224	4498	3088	40026	779.6e6	tt_16D1C3L12__abyss_96-1.fa
5767063	480137	100288	266417	964	500	1006	2224	4498	3088	40026	779.6e6	tt_16D1C3L12__abyss_96-2.fa
5440940	445527	89508	221230	1188	500	1142	2565	5220	3560	57744	802.7e6	tt_16D1C3L12__abyss_96-3.fa
6539	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_96-4.fa
6512	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_96-5.fa
4474821	304908	49734	107774	2642	500	2204	4824	10075	6702	61751	854.7e6	tt_16D1C3L12__abyss_96-6.fa
515	208	68	208	501	501	722	1144	1782	1295	3032	218062	tt_16D1C3L12__abyss_96-7.fa
4344599	184154	4671	14726	10390	500	5460	40717	121439	70922	851373	852.6e6	tt_16D1C3L12__abyss_96-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_112-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
8861505	509781	99296	212832	1294	500	1031	2434	5108	3438	45517	864.1e6	tt_16D1C3L12__abyss_112-1.fa
4792535	509781	99296	212832	1294	500	1031	2434	5108	3438	45517	864.1e6	tt_16D1C3L12__abyss_112-2.fa
4462563	461389	85975	173733	1635	500	1207	2887	6078	4039	49970	884.7e6	tt_16D1C3L12__abyss_112-3.fa
8534	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_112-4.fa
6145	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_112-5.fa
3561326	305368	45960	82288	3652	500	2455	5703	11978	7851	79396	936.6e6	tt_16D1C3L12__abyss_112-6.fa
663	311	99	311	500	500	798	1193	1927	1362	3365	340199	tt_16D1C3L12__abyss_112-7.fa
3427972	184013	4178	9292	21190	500	7950	50996	146429	85078	664640	934.1e6	tt_16D1C3L12__abyss_112-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
7603647	540270	98987	178531	1617	500	1042	2609	5644	3732	45325	940.8e6	tt_16D1C3L12__abyss_128-1.fa
4164814	540270	98987	178531	1617	500	1042	2609	5644	3732	45325	940.8e6	tt_16D1C3L12__abyss_128-2.fa
3835048	475698	83169	143642	2074	500	1260	3192	6791	4462	56412	956e6	tt_16D1C3L12__abyss_128-3.fa
11372	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-4.fa
5300	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-5.fa
3011140	310186	43902	67976	4608	500	2643	6415	13450	8757	104425	1.006e9	tt_16D1C3L12__abyss_128-6.fa
772	429	133	429	500	500	786	1261	2004	1414	3662	482664	tt_16D1C3L12__abyss_128-7.fa
2876731	189115	4029	7223	30658	500	9361	56647	164847	94949	793681	1.003e9	tt_16D1C3L12__abyss_128-8.fa

@sjackman commented on Feb 13

n	n:500	L50	min	N80	N50	N20	E-size	max	sum	name
8066430	203300	7924	500	1783	15776	59085	34309	290392	632.8e6	tt_16D1C3L12__abyss_64-8.fa
2876731	189115	4029	500	9361	56647	164847	94949	793681	1.003e9	tt_16D1C3L12-8.fa

k=128 looks like the better of these two assemblies to me. I'd suggest trying larger values of k. Note that the assembly size differs quite a bit between the two assemblies. That's something to keep in mind when interpreting these numbers. I'd suggest adding a -G parameter to abyss-fac that is your best estimate of the genome size to calculate NG50, which is generally better for comparing the contiguity of two assemblies. Note that abyss-fac can accept multiple FASTA files on its command line.

@mmokrejs commented on Feb 13

$ abyss-fac -G 1267403131 ./abyss_k64/tt_16D1C3L12__abyss_64-scaffolds.fa ./abyss_k80/tt_16D1C3L12__abyss_80-scaffolds.fa ./abyss_k96/tt_16D1C3L12__abyss_96-scaffolds.fa tt_16D1C3L12__abyss_128-scaffolds.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
8066430	203300	7924	203300	500	500	1783	15776	59085	34309	290392	632.8e6	./abyss_k64/tt_16D1C3L12__abyss_64-scaffolds.fa
5696462	191129	5702	43046	1934	500	2941	28235	91696	52410	565062	753.4e6	./abyss_k80/tt_16D1C3L12__abyss_80-scaffolds.fa
4344599	184154	4671	14726	10390	500	5460	40717	121439	70922	851373	852.6e6	./abyss_k96/tt_16D1C3L12__abyss_96-scaffolds.fa
2876731	189115	4029	7223	30658	500	9361	56647	164847	94949	793681	1.003e9	./abyss_k128/tt_16D1C3L12__abyss_128-scaffolds.fa

I am glad that now you agree the k128 result is better than k64. Pity I confused you with the output from stats.sh. The k112 and k156 computations are ongoing, the k156 will take 2 extra days to those about 35hrs for computation due to cluster maintenance.

@sjackman commented on Feb 13
You could try some ABySS 2 bloom-filter dBG assemblies in the mean time if you liked, which require about a tenth of the RAM. Perhaps that'd be easier for you to get scheduled on your cluster.

@mmokrejs commented on Mar 19
So, finally I have the assemblies with k160 and k192. I updated the tables above to reflect the new values.

@sjackman commented on Mar 19
k=160 has the peak NG50. k=128 has the peak E-size and N50 and largest scaffold. Either of those look assemblies looks good to me.

@mmokrejs commented on Mar 20
I had the impression I shall try k144 too. The input for k160 and k192 were error-corrected reads, for assemblies <=k128 I used uncorrected reads. I will use the error-corrected reads for k144 again. It decreases the memory usage and increases the average coverage by a tiny little but maybe still helps overall the assembly.

Based on uncorrected input reads:

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
7603647	540270	98987	178531	1617	500	1042	2609	5644	3732	45325	940.8e6	tt_16D1C3L12__abyss_128-1.fa
4164814	540270	98987	178531	1617	500	1042	2609	5644	3732	45325	940.8e6	tt_16D1C3L12__abyss_128-2.fa
3835048	475698	83169	143642	2074	500	1260	3192	6791	4462	56412	956e6	tt_16D1C3L12__abyss_128-3.fa
11372	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-4.fa
5300	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-5.fa
3011140	310186	43902	67976	4608	500	2643	6415	13450	8757	104425	1.006e9	tt_16D1C3L12__abyss_128-6.fa
772	429	133	429	500	500	786	1261	2004	1414	3662	482664	tt_16D1C3L12__abyss_128-7.fa
2876731	189115	4029	7223	30658	500	9361	56647	164847	94949	793681	1.003e9	tt_16D1C3L12__abyss_128-8.fa

Based on error-corrected input reads (using tadpole.sh and k=64):

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
7278490	540152	103434	200909	1417	500	1008	2430	5164	3450	42126	905.4e6	tt_16D1C3L12__abyss_128-1.fa
3606504	540152	103434	200909	1417	500	1008	2430	5164	3450	42126	905.4e6	tt_16D1C3L12__abyss_128-2.fa
3282085	479704	87706	162732	1801	500	1203	2936	6202	4112	57779	921.7e6	tt_16D1C3L12__abyss_128-3.fa
8089	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-4.fa
5252	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_128-5.fa
2524721	320183	47968	79411	3892	500	2439	5684	11904	7801	84004	972.2e6	tt_16D1C3L12__abyss_128-6.fa
1093	563	182	563	503	503	784	1227	1854	1371	3813	624124	tt_16D1C3L12__abyss_128-7.fa
2383961	195687	4463	8913	23272	500	7332	48678	144350	83684	910150	968.8e6	tt_16D1C3L12__abyss_128-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_144-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
6286575	575941	104590	176213	1671	500	995	2539	5531	3653	42158	972.9e6	tt_16D1C3L12__abyss_144-1.fa
3128233	575941	104590	176213	1671	500	995	2539	5531	3653	42158	972.9e6	tt_16D1C3L12__abyss_144-2.fa
2809379	496455	86074	141270	2144	500	1227	3151	6754	4426	56465	980.4e6	tt_16D1C3L12__abyss_144-3.fa
10335	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_144-4.fa
4403	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_144-5.fa
2130364	332146	47269	70164	4540	500	2517	6090	12741	8324	109348	1.027e9	tt_16D1C3L12__abyss_144-6.fa
1223	658	201	658	501	501	783	1263	1935	1431	3985	736598	tt_16D1C3L12__abyss_144-7.fa
1988503	207153	4441	7594	29632	500	7711	51296	153330	89693	730307	1.023e9	tt_16D1C3L12__abyss_144-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_160-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
5378996	618241	106811	157364	1918	500	965	2616	5854	3810	42190	1.041e9	tt_16D1C3L12__abyss_160-1.fa
2722381	618241	106811	157364	1918	500	965	2616	5854	3810	42190	1.041e9	tt_16D1C3L12__abyss_160-2.fa
2411515	516769	85211	125222	2482	500	1231	3345	7223	4688	57811	1.037e9	tt_16D1C3L12__abyss_160-3.fa
12737	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_160-4.fa
3101	1	1	1	501	501	501	501	501	501	501	501	tt_16D1C3L12__abyss_160-5.fa
1807627	348775	47265	63826	5107	500	2543	6370	13388	8717	82251	1.078e9	tt_16D1C3L12__abyss_160-6.fa
1263	717	215	717	502	502	756	1240	1970	1481	7966	794484	tt_16D1C3L12__abyss_160-7.fa
1665496	223249	4585	6863	34706	500	7480	51976	156903	91040	724617	1.074e9	tt_16D1C3L12__abyss_160-8.fa

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_192-?.fa
n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
3646761	728516	121430	143397	2194	500	874	2503	5844	3754	51286	1.165e9	tt_16D1C3L12__abyss_192-1.fa
2014683	728516	121430	143397	2194	500	874	2503	5844	3754	51286	1.165e9	tt_16D1C3L12__abyss_192-2.fa
1731479	596980	93440	114146	2829	500	1112	3294	7312	4686	60880	1.141e9	tt_16D1C3L12__abyss_192-3.fa
33949	0	0	0	0	0	0	0	0	0	0	0	tt_16D1C3L12__abyss_192-4.fa
1201	1	1	1	507	507	507	507	507	507	507	507	tt_16D1C3L12__abyss_192-5.fa
1361846	455052	61113	71142	4723	500	1948	5287	11270	7278	98989	1.167e9	tt_16D1C3L12__abyss_192-6.fa
767	579	167	579	500	500	749	1314	2091	1500	6596	650527	tt_16D1C3L12__abyss_192-7.fa
1213455	319826	6263	7774	30914	500	2610	38588	129848	74019	751657	1.163e9	tt_16D1C3L12__abyss_192-8.fa

@sjackman would you please comment on the k144 and k160 assemblies. I think k144 is the best from k128-k192 range (has very high actually assmbled genome size and still very high max scaffold, L50 is even better than in k128, NG50 is somewhat in between) . Provided the gap closing steps are so time consuming, isn't it however better approach to opt for k160 assembly? It provides almost complete genome (unless the actually assembled genome size is inflated due to redundant contigs) and is maybe a better base for future work.

I am running

java -Xmx2800G -jar pilon-1.22.jar --output abyss_ecc_k128_ecc_N10_pilon --diploid --changes --vcf --fix all,amb --threads 104 ...

but it will last for about 1200wallclock hours. As we discussed elsewhere, abyss' Sealer runs multithreaded in the very beginning but then switches to a single-threaded job. And moreover, I have the impression it would just make a perfect consensus of my sequences inside "gaps" whereas pilon scans the BAM files and works with the spanning read pairs (so it could pick the right set of two alleles). Unfortunately, it works only with the proper pairs and way too many read-pairs I obtained from BBmap have "wrong" orientation or insert size (note the mix in Nextera mate-pair libs) so it works with much less of reads. What is your strategy? I shall try the BAM's created by abyss-map too if it cares to set the proper read flags. BWA doesn't I think so that is why I did not try.

@sjackman
Copy link
Collaborator

would you please comment on the k144 and k160 assemblies.

Both the k=144 and k=160 assemblies look good. k=144 is more contiguous. k=160 has a larger assembly by 50 Mbp (5%), which is pretty significant.

Provided the gap closing steps are so time consuming, isn't it however better approach to opt for k160 assembly? It provides almost complete genome (unless the actually assembled genome size is inflated due to redundant contigs) and is maybe a better base for future work.

KAT is a great tool to compare k-mer spectra and assess whether the genome is over collapsed or over expanded. It may help you pick the better assembly. https://github.com/TGAC/KAT#readme

but it will last for about 1200wallclock hours.

Does Pilon take 1,200 wall hours (50 days!) for a 1 Gbp genome? That seems high to me.

@mmokrejs
Copy link
Contributor Author

So it appears you agree k128 < k144 < k160, right?
Oh yes, we are working on KAT here a lot: TGAC/KAT#94 I did not get to try it.

Regarding pilon, um, yes, it really runs on all 104 cores available, java grew up to those 2.8TB after two days and does not ask for more (good), and provided pilon spits out once in a while a few lines to the log file speaking about regions processed, it appears those 45k region are 1/4 of my all 190k gaps in scaffolds. For example:

Total Reads: 41, Coverage: 3, minDepth: 5
Confirmed 41 of 251 bases (16.33%)
Corrected 0 snps; corrected 0 small insertions totaling 0 bases, 0 small deletions totaling 0 bases
Finished processing 441078:1-251
Processing 4150108:1-223
898210:1-364 log:

When I count the Corrected lines I thing I am getting the number of gaps attempted. That means it will need really about 50 days (if there is no other iteration or whatever additional ongoing). Sadly I picked the k128 assembly for this, I wanted to be on the conservative side but that was maybe too quick. Sadly pilon does not ouput any intermediate output files so if the analysis gets interrupted ... we are out of luck.

@sjackman
Copy link
Collaborator

java grew up to those 2.8TB after two days

Do you have 2.8 TB of memory installed on that machine? If not, I'd recommend reducing the number of cores so that the job fits in available memory. It'll take forever when swapping to disk.

@sjackman
Copy link
Collaborator

So it appears you agree k128 < k144 < k160, right?

I agree that both k=144 and k=160 are better than k=128. don't see a clear winner between k=144 and k=160. k=144 is more contiguous. k=160 appears more complete. All things being equal, I usually go for the more contiguous assembly. I'd prefer k=144 here. The results of KAT may help assess whether k=160 truly is more complete, or just over expanded. At this point I'd suggest running BUSCO on both k=144 and k=160, and pick whichever assembly is more complete.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 28, 2018

Do you have 2.8 TB of memory installed on that machine?

The machine has 3.2TB of RAM, is a NUMA architecture with 112 CPU cores.

@sjackman
Copy link
Collaborator

What's the structure of the NUMA memory? How much memory is available is a single uniform-access unit? I'm not familiar with NUMA terminology, so I'm not sure what the typical name is for that unit.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 28, 2018

Touch question to a molecular biologist.

It is SGI UV200 with 14 x Intel Xeon E5-4627v2, 3.3 GHz, 8 cores each "node"/"chunk".
It is composed from 14 chunks, somehow interconnected. The accesses to memory and maybe? other tasks (like IRQ?) needs to be proxied via CPU0. This is what our admins complain about, because for example the ABYSS-P runs at 1/4 of the speed while the 104 of 112 local CPUs talk to each other via OpenMPI whereas jobs from other programs can talk across our cluster of normal nodes 4x faster through infiniband. Admins say this is the price for CPU0 proxying the requests. Unfortunately, as you may remember from some other thread now closed, I had issues with ABYSS-P talking on this host of k-mer size was higher that 128. I did not manage to figure out what is the issue, nor I managed to run ABYSS-P across infiniband between our normal cluster nodes. At least, on this NUMA host with mpirun --mca btl vader,self mpirun and ABYSS-P worked together at k>128. So it did its job although much faster k-mer counting step could be probably achieved using just a four of our 24-core nodes across infiniband (well, they have 128GB each only, so I would need for to yield those 800GB-1.2TB depending on the k-mer size chosen).

Does the below answer you question better?

$ lstopo
Machine (3100GB total)
  Group0 L#0
    NUMANode L#0 (P#0)
    NUMANode L#1 (P#1 248GB) + Package L#0 + L3 L#0 (16MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#8)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#9)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#10)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#11)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#12)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#13)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#14)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#15)
  Group0 L#1
    NUMANode L#2 (P#2 248GB) + Package L#1 + L3 L#1 (16MB)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#17)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#18)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#19)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#20)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#21)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#22)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#23)
    NUMANode L#3 (P#3 248GB) + Package L#2 + L3 L#2 (16MB)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#24)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#25)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#26)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#27)
      L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#28)
      L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#29)
      L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#30)
      L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#31)
  Group0 L#2
    NUMANode L#4 (P#4 248GB) + Package L#3 + L3 L#3 (16MB)
      L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#32)
      L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#33)
      L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#34)
      L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#35)
      L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#36)
      L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#37)
      L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#38)
      L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#39)
    NUMANode L#5 (P#5 248GB) + Package L#4 + L3 L#4 (16MB)
      L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#40)
      L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#41)
      L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#42)
      L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#43)
      L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#44)
      L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#45)
      L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#46)
      L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#47)
  NUMANode L#6 (P#6 248GB) + Package L#5 + L3 L#5 (16MB)
    L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#48)
    L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#49)
    L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#50)
    L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#51)
    L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#52)
    L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#53)
    L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#54)
    L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#55)
  Group0 L#3
    NUMANode L#7 (P#7 248GB) + Package L#6 + L3 L#6 (16MB)
      L2 L#48 (256KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#56)
      L2 L#49 (256KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#57)
      L2 L#50 (256KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#58)
      L2 L#51 (256KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#59)
      L2 L#52 (256KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52 + PU L#52 (P#60)
      L2 L#53 (256KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53 + PU L#53 (P#61)
      L2 L#54 (256KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54 + PU L#54 (P#62)
      L2 L#55 (256KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55 + PU L#55 (P#63)
    NUMANode L#8 (P#8 248GB) + Package L#7 + L3 L#7 (16MB)
      L2 L#56 (256KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56 + PU L#56 (P#64)
      L2 L#57 (256KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57 + PU L#57 (P#65)
      L2 L#58 (256KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58 + PU L#58 (P#66)
      L2 L#59 (256KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59 + PU L#59 (P#67)
      L2 L#60 (256KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60 + PU L#60 (P#68)
      L2 L#61 (256KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61 + PU L#61 (P#69)
      L2 L#62 (256KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62 + PU L#62 (P#70)
      L2 L#63 (256KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63 + PU L#63 (P#71)
  Group0 L#4
    NUMANode L#9 (P#9 248GB) + Package L#8 + L3 L#8 (16MB)
      L2 L#64 (256KB) + L1d L#64 (32KB) + L1i L#64 (32KB) + Core L#64 + PU L#64 (P#72)
      L2 L#65 (256KB) + L1d L#65 (32KB) + L1i L#65 (32KB) + Core L#65 + PU L#65 (P#73)
      L2 L#66 (256KB) + L1d L#66 (32KB) + L1i L#66 (32KB) + Core L#66 + PU L#66 (P#74)
      L2 L#67 (256KB) + L1d L#67 (32KB) + L1i L#67 (32KB) + Core L#67 + PU L#67 (P#75)
      L2 L#68 (256KB) + L1d L#68 (32KB) + L1i L#68 (32KB) + Core L#68 + PU L#68 (P#76)
      L2 L#69 (256KB) + L1d L#69 (32KB) + L1i L#69 (32KB) + Core L#69 + PU L#69 (P#77)
      L2 L#70 (256KB) + L1d L#70 (32KB) + L1i L#70 (32KB) + Core L#70 + PU L#70 (P#78)
      L2 L#71 (256KB) + L1d L#71 (32KB) + L1i L#71 (32KB) + Core L#71 + PU L#71 (P#79)
    NUMANode L#10 (P#10 248GB) + Package L#9 + L3 L#9 (16MB)
      L2 L#72 (256KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72 + PU L#72 (P#80)
      L2 L#73 (256KB) + L1d L#73 (32KB) + L1i L#73 (32KB) + Core L#73 + PU L#73 (P#81)
      L2 L#74 (256KB) + L1d L#74 (32KB) + L1i L#74 (32KB) + Core L#74 + PU L#74 (P#82)
      L2 L#75 (256KB) + L1d L#75 (32KB) + L1i L#75 (32KB) + Core L#75 + PU L#75 (P#83)
      L2 L#76 (256KB) + L1d L#76 (32KB) + L1i L#76 (32KB) + Core L#76 + PU L#76 (P#84)
      L2 L#77 (256KB) + L1d L#77 (32KB) + L1i L#77 (32KB) + Core L#77 + PU L#77 (P#85)
      L2 L#78 (256KB) + L1d L#78 (32KB) + L1i L#78 (32KB) + Core L#78 + PU L#78 (P#86)
      L2 L#79 (256KB) + L1d L#79 (32KB) + L1i L#79 (32KB) + Core L#79 + PU L#79 (P#87)
  NUMANode L#11 (P#11 124GB) + Package L#10 + L3 L#10 (16MB)
    L2 L#80 (256KB) + L1d L#80 (32KB) + L1i L#80 (32KB) + Core L#80 + PU L#80 (P#88)
    L2 L#81 (256KB) + L1d L#81 (32KB) + L1i L#81 (32KB) + Core L#81 + PU L#81 (P#89)
    L2 L#82 (256KB) + L1d L#82 (32KB) + L1i L#82 (32KB) + Core L#82 + PU L#82 (P#90)
    L2 L#83 (256KB) + L1d L#83 (32KB) + L1i L#83 (32KB) + Core L#83 + PU L#83 (P#91)
    L2 L#84 (256KB) + L1d L#84 (32KB) + L1i L#84 (32KB) + Core L#84 + PU L#84 (P#92)
    L2 L#85 (256KB) + L1d L#85 (32KB) + L1i L#85 (32KB) + Core L#85 + PU L#85 (P#93)
    L2 L#86 (256KB) + L1d L#86 (32KB) + L1i L#86 (32KB) + Core L#86 + PU L#86 (P#94)
    L2 L#87 (256KB) + L1d L#87 (32KB) + L1i L#87 (32KB) + Core L#87 + PU L#87 (P#95)
  Group0 L#5
    NUMANode L#12 (P#12 248GB) + Package L#11 + L3 L#11 (16MB)
      L2 L#88 (256KB) + L1d L#88 (32KB) + L1i L#88 (32KB) + Core L#88 + PU L#88 (P#96)
      L2 L#89 (256KB) + L1d L#89 (32KB) + L1i L#89 (32KB) + Core L#89 + PU L#89 (P#97)
      L2 L#90 (256KB) + L1d L#90 (32KB) + L1i L#90 (32KB) + Core L#90 + PU L#90 (P#98)
      L2 L#91 (256KB) + L1d L#91 (32KB) + L1i L#91 (32KB) + Core L#91 + PU L#91 (P#99)
      L2 L#92 (256KB) + L1d L#92 (32KB) + L1i L#92 (32KB) + Core L#92 + PU L#92 (P#100)
      L2 L#93 (256KB) + L1d L#93 (32KB) + L1i L#93 (32KB) + Core L#93 + PU L#93 (P#101)
      L2 L#94 (256KB) + L1d L#94 (32KB) + L1i L#94 (32KB) + Core L#94 + PU L#94 (P#102)
      L2 L#95 (256KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95 + PU L#95 (P#103)
    NUMANode L#13 (P#13 248GB) + Package L#12 + L3 L#12 (16MB)
      L2 L#96 (256KB) + L1d L#96 (32KB) + L1i L#96 (32KB) + Core L#96 + PU L#96 (P#104)
      L2 L#97 (256KB) + L1d L#97 (32KB) + L1i L#97 (32KB) + Core L#97 + PU L#97 (P#105)
      L2 L#98 (256KB) + L1d L#98 (32KB) + L1i L#98 (32KB) + Core L#98 + PU L#98 (P#106)
      L2 L#99 (256KB) + L1d L#99 (32KB) + L1i L#99 (32KB) + Core L#99 + PU L#99 (P#107)
      L2 L#100 (256KB) + L1d L#100 (32KB) + L1i L#100 (32KB) + Core L#100 + PU L#100 (P#108)
      L2 L#101 (256KB) + L1d L#101 (32KB) + L1i L#101 (32KB) + Core L#101 + PU L#101 (P#109)
      L2 L#102 (256KB) + L1d L#102 (32KB) + L1i L#102 (32KB) + Core L#102 + PU L#102 (P#110)
      L2 L#103 (256KB) + L1d L#103 (32KB) + L1i L#103 (32KB) + Core L#103 + PU L#103 (P#111)
$
$ numactl --show
policy: default
preferred node: current
physcpubind: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 
cpubind: 1 2 3 4 5 6 7 8 9 10 11 12 13 
nodebind: 1 2 3 4 5 6 7 8 9 10 11 12 13 
membind: 1 2 3 4 5 6 7 8 9 10 11 12 13 
$

@mmokrejs
Copy link
Contributor Author

Maybe the short answer could be: whole 3.2TB of RAM appears as a local memory.

@sjackman
Copy link
Collaborator

sjackman commented Apr 28, 2018

Neat. I haven't yet used a NUMA system. As an experiment you may want to try Pilon with 8 threads and 248 GB of RAM, ensuring that the 8 threads are bound to the same socket.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 28, 2018

But Pilon seems to require 1.2TB at least for my data. I ran out of memory a few times when I tested with no extra java -Xmx flag.

Maybe pilon is just unpacking all the time my BAM files to realize there is not enough coverage? But working with SAM files would be probably much faster. I again forgot the difference in their indexing approaches.

@sjackman
Copy link
Collaborator

1.2 TB seems high to me for polishing. Reducing the number of threads may also reduce the memory requirement. The authors of Pilon may have more suggestions. You could split your assembly up into 12 equal-sized chunks, and polish each chunk separately.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 28, 2018

Maybe. I could try another host with 16 cores and 512GB RAM.

Anyway, here are some figures from the current java -Xmx2800G -jar pilon-1.22.jar --output abyss_ecc_k128_ecc_N10_pilon --diploid --changes --vcf --fix all,amb --threads 104 job at two zoom-levels:

pilon_memory_usage_3days

pilon_cpu_usage_3days

pilon_memory_usage_11days

pilon_cpu_usage_11days

@mmokrejs
Copy link
Contributor Author

mmokrejs commented Apr 28, 2018

And some intermediate results if I am parsing/interpreting the log file properly. There are 195687 scaffolds involved ...remember the line from abyss-fac for this k128 assembly:

2383961	195687	4463	8913	23272	500	7332	48678	144350	83684	910150	968.8e6	tt_16D1C3L12__abyss_128-8.fa

abyss_k128_ecc_n10 pilon sizes density

abyss_k128_ecc_n10 pilon infilled relative sizes density

Rplots.pdf

pilon_stats.sh.txt

pilon_stats.R.txt

@mmokrejs
Copy link
Contributor Author

mmokrejs commented May 9, 2018

I cancelled the pilon process, I think it was re-aligning in every gap region not only reads flanking the region but also all unaligned reads from the BAM files. Provided the long mate-pair datasets are a mixture of FR and RF reads (and badly separated into sub-group by any tool) I think it was unnecessarily spending too much time on this.

@mmokrejs
Copy link
Contributor Author

mmokrejs commented May 9, 2018

I agree that both k=144 and k=160 are better than k=128. don't see a clear winner between k=144 and k=160. k=144 is more contiguous. k=160 appears more complete. All things being equal, I usually go for the more contiguous assembly. I'd prefer k=144 here. The results of KAT may help assess whether k=160 truly is more complete, or just over expanded. At this point I'd suggest running BUSCO on both k=144 and k=160, and pick whichever assembly is more complete.

PE2016 vs. assembly_k128
pe_vs_assembly-main mx spectra-cn

PE2016 vs. PE2017

gcp__both_pe_datasets mx

assembly_k128 vs. assembly_k128
assembly_vs_assembly-main mx spectra-cn

PE2016 vs. assembly_k128
pe2016_vs_assembly_k128-main mx spectra-cn

PE2016 vs. assembly_k144
pe2016_vs_assembly_k144-main mx spectra-cn

PE2016 vs. assembly_k160
pe2016_vs_assembly_k160-main mx spectra-cn

@sjackman
Copy link
Collaborator

sjackman commented May 14, 2018

If bubble popping were perfect (which it usually is not), the black and red components at 0.5x copy number (in your case ~20x depth) would be roughly equal sized for a diploid organism. So it looks somewhat overexpanded, but not not badly so.

@stale
Copy link

stale bot commented Jun 4, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants