# task_One 
## Testing EMV_Augustus pipeline
_Date: 03/14/2015_

## GeneMark Commands
_Building HMM and gene predictions using the programm gmes_petap_

`perl gmes_petap.pl --ES --cores 2 --sequence sequence.fasta `

_Generating only gene predictions for organisms using the program gmhmme3_

`gmhmme3 -f gff3 -o predictions.gff3 -m Moryzae.mod ~/contig.fasta`

### Options :
*-f FORMAT* : Use FORMAT as the output format. "gff3" and "gtf" are accepted, as well as the default "lst". The last of these three formats is specific to GeneMark 

*-n* : Include gene nucleotide sequences in FASTA format. Only works with the "lst" format.

*-p* : Include gene protein sequences in FASTA format. Only works with the "lst" format.

*-o* FILENAME : Write the output to the specified file. The default depends on the combination of options used (hmm_model).

*-m* FILENAME : The name of the parameter file. This option is required.

<center>$\rightarrow$ Output : <b>gtf file</b></center>

## Have to convert gft into gff3

`perl /home/savandara/bin/gtf2gff3/gtf2gff3.pl gene_prediction.gtf > gene_prediction.gff3`

_Date: 03/15/2016_
## Exonerate Command

`exonerate --showtargetgff yes --showalignment no --verbose 0 --showvulgar no -n 1 --model protein2genome --percent 50 --ryo "AveragePercentIdentity: %pi\n" -t genome_sequence.fasta -q uniprot_reviewed_allProteins.fasta > protein_alignments.gff`

### Options : 
*--showtargetgff yes*:  a GFF dump in the output

*--showalignment no*: Disable showing a human-readable alignment 

*-n 1*: Report the best N results for each query. The option reduces the amount of output generated, and also allows exonerate to speed up the search.
*--showvulgar no* : disable this line in the output

--model protein2genome : allows alignment of a protein sequence to genomic DNA. 

--percent 50 : Report only alignments scoring at least this percentage of the maximal score for each query

<center>$\rightarrow$ Output : <b>gff file</b>(but not clean)</center>

## Have to convert gff into gff3
`perl exonerate_gff_to_alignment_gff3.pl exonerate.output > protein_alignments.gff3`

_Date: 03/16/2016_

### Part 1 : Creating the inputs for EMV
- Using __GeneMArk-ES__ to have Ab initio gene predition
    - Input: __DNA sequence (FASTA file)__
    - Command line : ```perl /home/savandara/bin/gmes_petap/gmes_petap.pl --ES --cores 2 --sequence genome_sequence.fasta```
    - Output : __GeneMark\_hmm.mod__ (used in gmhmm3) + __genemark\_prediction.gft__
    - $\rightarrow$ Has to be converted in GFF3 file
- Using __Exonerate__ to have protein alignments
    - Input: __Protein sequence (FASTA file)__ + __AllProtein sequence (FASTA file)__
    - Command line : `exonerate --showtargetgff yes --verbose 0 --showalignment no - showvulgar no --model protein2genome --percent 50 -q uniprot_reviewed_allProteins.fasta -t genome_sequence.fasta > protein_alignments.gff`
    - $\rightarrow$ Has to be converted in GFF3 file 
- Creating the CSV file with the __weight values__


_Date: 03/17/2017_

## Part 2 : Prepare set of training and test files with EVM

1. Partitioning the Inputs
    * `perl partition_EVM_inputs.pl --genome genome.fasta 
     --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3
     --segmentSize 100000 --overlapSize 10000 --partition_listing partitions_list.out`
2. Generating the EVM Command Set
    * `perl write_EVM_commands.pl --genome genome.fasta --weights 'pwd'/weights.txt --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 --output_file_name evm.out  --partitions partitions_list.out >  commands.list`
3. Run the commands (locally)
    * `perl execute_EVM_commands.pl commands.list | tee run.log`
4. Combining the partitions
    * `perl recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out`
5. Converting in GFF3 file
    * `perl convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome genome.fasta` (for each evm.out file)
    * `find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3`

## Part 3 : Training Augustus

1. Obtain a set of training and test files
    * Come from EMV output or already existed in NCBI
    * Split the file with `random_split.pl` $\rightarrow$ and `bug.gb.test` `bug.gb.train`
2. Create meta parameters file for your species
    * Command line : `new_species.pl --species=bug`
3. Initial training
    * `etraining --species=bug gene.gb.train`
    * `augustus --speices=bug gene.gb.test`
4. Retraining Augustus
    * `etraining --species=bug gene.gb.train`
    * `augustus --speices=bug gene.gb.test`

$\Rightarrow$ Will obtain the accuracy of the prediction (shoul be up to 20% if it is a good model)


## Part 4 : Gene prediction with Augustus

1. Gene prediction step
    * `augustus --species=bug genome.fasta > final_gene_prediction.gff`
2. Obtain the protein sequences
    * `getAnnoFasta.pl final_gene_prediction.gff`
    