To produce a whole genome phylogeny of the outbreak strains against the existing (non-draft) NCBI E.coli genomes. The following are the results of my attempt to analyse the two sequenced E.coli outbreak isolates (identified as 0104 serotype) - TY2482 and LB226692. Both were sequenced using Life Technologies Ion Torrent technology.
The TY2482 reads were assembled by Nick Loman using MIRAhttp://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt The annotation for TY2482 was obtained from http://www.era7bioinformatics.com/en/E_Coli_EHEC_O104_STRAIN_EU_OUTBREAK_era7bioinformatics.html LB226692 reads were assembled by Life Tech and University of Muenster http://www.ncbi.nlm.nih.gov/nuccore/AFOB00000000
Note that these are also now available at github
TY2482 was used as the 'reference strain here'.
Full MUMmer alignments and SNP calls available at http://bio-ruby.ex.ac.uk/ecoli_outbreak SNP table listing synonymous/non-synonymous and gene IDs is available for download here (though it is 270Mb) The following is the phylogeny produced by the above analysis. Highlighted in red are the outbreak strains.
First of all I should point out that Kat Holt's excellent SNP analysis of TY2482 LB226692 and uses some filtering which the analysis here would definitely benefit from, especially given the homopolymer issues Kat noted. http://bacpathgenomics.wordpress.com/2011/06/04/ehec-genomes/. Even with this relatively dirty dataset its reassuring that the results here agree with David Studholme's analysis finding that closest relative is Escherichia coli 55989 NC_011748.
According to MUMmer the TY2482 strain shares 97.23% of its sequence with LB226692. However LB226692 only shares 95.56% of its sequence with TY2482. It is possible there are one or two plasmids lurking there, but it could also be an artefact of the different methods used to assemble these two isolates. The isolates have around 1500 SNPs between them. 1281 are within coding regions and 239 are classified as synonymous changes. This seems a rather high number to me if they are merely different clinical isolates of the same strain but I don't think we have enough background knowledge to really get a handle on how common and/or significant this is. Certainly a good proportion are due to crappy filtering on my part.
The same comparison has been done for each Genbank E.coli genome wrt TY2482.
It may be worth while checking out other closely related species in case cross-species hybridisation has occurred. By trade I am a facility manager and a bioinformatician, rather than a bona-fide pathogenomicist so I apologise if there are any issues I have neglected to take into account or anything which would be obvious to a proper pathogen researcher. Please let me know if you think there is anything clearly wrong with this approach and I will do my best to correct it.
From a bioinformatics point of view its interesting just to see that our current tools/knowledge-set mean that it is quite difficult rapidly pin down exactly what happened to form this strain.
I'll try to look at the differences between the two strains in more detail and perhaps see whether the latest BGI assembly significantly changes anything. It would also be really useful if someone could repeat the above analysis using the raw reads (if the reads for LB226692 become available) and aligning with TMAP, Newbler or some other form of homopolymer aware alignment suite to Escherichia coli 55989. I'd be interested to know if the SNPs identified by MUMmer (when suitable filtered) give comparable results to those called by alignment of the reads to a reference.