Home

Pablo Pareja Tobes edited this page May 31, 2013 · 37 revisions

This is the wiki for the project BG7.

Pipeline schema:

Click here for the linkable SVG file. BG7 pipeline schema

Input data format

RNA sequences constraints:

The headers of the FASTA file including the RNA sequences must comply with the format of the .frn files that you can find in Refseq, that means they should look something like this:

>ref|NC_011283|:75804-75898|Sec tRNA| [locus_tag=KPK_0076]

You can find an example here for the RNA file of a Clostridium strain: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Clostridium_SY8519_uid68705/NC_015737.frn

Features

BG7

This program is the main enter point for the project. It relies on the 'executions.xml' file, where sub-programs are specified along with their arguments so that in the end the whole annotation process is performed. The corresponding jar file can be found at the /jars project folder.

Execution times expected:

  • FixFastaHeaders: almost instantaneous
  • PredictGenes: about 10 minutes
  • RemoveDuplicatedGenes: 5/10 minutes
  • SolveOverlappings: 10/15 minutes
  • FillDataFromUniprot : Directly proportional to the number of proteins (if there are a lot proteins sometimes it kind of gets stucked for some time... we suspect uniprot cuts temporarily the access to our ip)
  • FillDataFromBio4j: ~ 1 minute
  • GenerateCSVFile: almost instantaneous

FixFastaHeaders

Associates an unique id to each fasta header.

FillDataFromUniprot

Completes protein data performing HTTP requests to Uniprot site.

FillDataFromBio4j

Completes protein data retrieving it from Bio4j DB.

RemoveDuplicatedGenes

Removes all genes that are duplicated.

SolveOverlappings

Solves every overlapping found between genes and rnas.

PredictGenes

This is one of the most important programs/steps on the semi-automatic annotation process. It carries out the gene prediction phase of the process.

GenerateFastaFiles

Generates two multifasta files for the genes that have been predicted by the end of the process. One including the nucleotide sequences and other with the amino acid sequences.

GetIntergenicSequences

Generates both a XML and multifasta file including every intergenic sequence.

GenerateGffFile

Generates the corresponding file in format GFF for the final XML results file.

GenerateCSVFile

Exports the fnial XML results file to a CSV file.

RemoveDismissedGenes

It creates a new annotation XML file without any dismissed gene included in the input annotation XML file.

Exporting data to other formats

Export Embl files

Exports the final xml annotation file to Embl format (one file for each contig).

Export GenBank files

Exports final xml annotation file to GenBank format.

Export 5 columns GenBank files

Exports final xml annotation file to GenBank format.

Test programs

CheckForIterationQueryDefErrors

Looks for weird/wrong syntax <Iteration_query-def> values in blastoutput xml files, specifically wrong number of characters '|'.

GetOrganismFreqFromBlastIterations

Generates some statistics about proteins grouped by organism.

Quality control programs

BasicQualityControl

Performs a really basic (still useful) quality control in the final annotation results XML file.

AutomaticQualityControl

Performs an automatic quality control in some results selected randomly from the final annotation XML file.

Control GenBank files quality

Quality control program for GenBank files exporter program: Export GenBank files

Control 5 columns GenBank files quality

Quality control program for '5 columns' GenBank files (those used for genomes submissions) exporter program: Export 5 columns GenBank files

FixFastaHeadersQC

Quality control program for the file generated by the program 'FixFastaHeaders'