Skip to content

Era7 annotation of BGI v2 assembly of e. coli ty 2482

marina-manrique edited this page Jun 10, 2011 · 2 revisions

Era7 E. coli TY-2482 annotation with BG7 system of BGI V2 assembly

Annotation

BG7 system developed by Oh no sequences! was used to get this annotation.

BG7 system is specially designed to handle data from NGS. One of the most important features of the system is that we predict ORF searching for protein similarity. So we start with a search of the reference proteins in the contigs and then we define the ORF. We preserve all the CDS found (although they haven't canonical start or stop codons and although they have frameshifts or intrastop codon) so the system is pretty robust to NGS errors that may cause the lose of start/stop signals or changes in the frameshit.

Dataset

These are the datasets we used in the annotation

Assembly

We used the BGI's assembly (6-Jun) of Illumina (200x single-end reads) & 12x IonTorrent. This is the first BGI assembly that is a completely de novo assembly. It was assembled with Newbler v. 2.0.00.22, Soapdenovo v. 1.06 & AMOS minimus2 v. 1.59. See the assemblies page for more details.

Reference proteins

We took as reference proteins a set of 137,063 proteins. This set includes:

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

Results

You can get the files of the results of this annotation from the repo: https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/TY2482/seqProject/BGI/annotations/era7bioinformatics/BGI_V2

We have predicted 5,982 genes

  • 5,849 protein encoding genes
  • 133 RNA genes (rRNA and tRNA)

4,797 out of the 5,849 (82.01%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.

658 out of the 5,849 (11.24%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors. However, our system is tolerant to errors of massive sequencing technologies and it has been able to detect this rich set of genes even with very preliminary sequencing results.

Probably some of the proteins detected are fragmented and could appear as two different predicted genes if they are at the end of different contigs.

Annotation of Restriction-Modification systems

Analysing the automatic annotation we did of the second BGI assembly of TY-2482 genome we have found that this isolate has 3 restriction modification systems

  • Type I Restriction modification system encoded in an operon in contig 42. The specific protein encoded by the gene 79712, the modification protein encoded by the gene 84400 and the restriction protein encoded by 66267
  • Type II system encoded in an operon in contig 486. The nuclease protein encoded by the gene 21919 and the methyltransferase protein encoded by gene 23135
  • Type III system encoded in the contig 493. The nuclease protein encoded by the gene 3634 and the methyltransferase one encoded by gene 5265

Type I restiction-modification system encoded in the contig 42

Contig42_TypeI

Type II restiction-modification system encoded in the contig 486

Contig486_TypeII

Type III restiction-modification system encoded in the contig 493

Contig493_TypeIII