Skip to content

TY2482, LB226692 vs Genbank Ecoli

konradpaszkiewicz edited this page Jun 5, 2011 · 12 revisions

Whole genome phylogenies (03/06/2011)

Konrad Paszkiewicz, University of Exeter Sequencing Service khp204@ex.ac.uk

The following are the results of my attempt to analyse the two sequenced E.coli outbreak isolates (identified as 0104 serotype) - TY2482 and LB226692. Both were sequenced using Life Technologies Ion Torrent technology.

Datasets:

    The TY2482 reads were assembled by Nick Loman using MIRA  [[http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt]]
    The annotation for TY2482 was obtained from [[http://www.era7bioinformatics.com/en/E_Coli_EHEC_O104_STRAIN_EU_OUTBREAK_era7bioinformatics.html]]
    LB226692 reads were assembled by Life Tech and University of Muenster [[http://www.ncbi.nlm.nih.gov/nuccore/AFOB00000000]]

Objective: To produce a whole genome phylogeny of the outbreak strains against the existing (non-draft) NCBI E.coli genomes.

Methods:

    TY2482 was used as the 'reference strain here'.

    1. The dnadiff command as part of the MUMmer 3.21 [[http://mummer.sourceforge.net/]] package was used to generate whole-genome alignments. As part of this process the MUMmer show-snps command is executed and 'calls' (if you can define it that way) snps between the two genomes. dnadiff was run for TY2482 against all other genomes in NCBI Genbank and the LB226692 assembly of another 0104 isolate.

    2.  The out.snps files for each TY2482 vs Query alignment were parsed and SNPs from all alignments extracted into a single file.

    3.  The GFF annotation performed by BG7 was used to identify putative gene locations and determine whether a SNP would cause a synonymous or non-synonymous change.

    4.  Only SNPs for which synonymous changes were present were used to generate a pseudo-sequence.

    5.  The program FastTreeMP was used to generate a tree using generalised time-reversible model (options: FastTreeMP -nucleotide -gtr) (http://www.microbesonline.org/fasttree/)

    6. The resulting tree can be visualised in MEGA (http://www.megasoftware.net/)

    Steps 2 and 3 used custom scripts partly based on work originally performed by David Studholme 

Results:

   Available at [[http://bio-ruby.ex.ac.uk/ecoli_outbreak]]

Comments:

    First of all I should point out that Kat Holt's excellent SNP analysis uses some filtering which the analysis here could benefit from. [[http://bacpathgenomics.wordpress.com/2011/06/04/ehec-genomes/]] 

    According to MUMmer the TY2482 strain shares 97.23% of its sequence with

LB226692. However LB226692 only shares 95.56% of its sequence with TY2482. Its possible there are one or two plasmids lurking there, but it could also be an artefact of the different methods used to assemble these two isolates. The isolate s have around 1500 SNPs between them. 1281 are within coding regions and 239 are classified as synonymous changes. This seems a rather high number to me if they are merely different clinical isolates of the same strain but I don't think we have enough background

    The same comparison has been done for each Genbank E.coli genome.

    It may be worth while checking out other closely related species in case

cross-species hybridisation has occurred. By trade I am a facility manager and a bioinformatician, so I apologise if there are any issues I have neglected to t ake into account or anything which would be obvious to a bona-fide pathogen rese archer. Please let me know if you think there is anything clearly wrong with thi s approach and I will do my best to correct it.

    It is fascinating how difficult it actually is to pin down exactly what

happened to form this strain.

    I'll try to look at the differences between the two strains in more deta

il.

EHEC outbreak phylogeny