# Full annotation pipeline from Max

This is a parametric notebook (see next cell) which will annotate using Max's full annotation pipeline a given A. thaliana reference fasta file.

In [None]:
ACCESSION=at6137
SCAFFOLD_TITLE="unmasked"
SCAFFOLDS=NOTFOUND.fasta
OUTDIR=NOTFOUND
NCPUS=${NSLOTS:-64}
REFERENCE_GENES_GFF=output/01_assembly/01_pansn-named/Araport.gff3
REFERENCE_SCAFFOLDS=output/01_assembly/01_pansn-named/Araport.scaffolds.fasta

In [None]:
conda activate dl20-annotate

# Step 1: liftoff

This recreates max's `6_gene_annotation/2_liftoff/2_masked/run_liftoff.sh`. We use Araport.gff + Araport.scaffolds.fasta as the TAIR10 reference.

In [None]:
mkdir -p ${OUTDIR}/01_liftoff/
mkdir -p ${OUTDIR}/tmp/01_liftoff

In [None]:
liftoff \
    -exclude_partial \
    -dir ${OUTDIR}/tmp/01_liftoff/ \
    -g $REFERENCE_GENES_GFF \
    -o ${OUTDIR}/01_liftoff/${ACCESSION}~${SCAFFOLD_TITLE}.liftoff.gff \
    -u ${OUTDIR}/01_liftoff/${ACCESSION}~${SCAFFOLD_TITLE}.liftoff.unmapped.gff \
    -copies \
    -p $NCPUS \
    $SCAFFOLDS \
    $REFERENCE_SCAFFOLDS

In [None]:
awk 'BEGIN{FS=OFS="\t"}$3 == "exon" || $3 == "CDS"{print $1, $2, $3, $4, $5, $6, $7, $8, $9 ",source=T"}' \
    < ${OUTDIR}/01_liftoff/${ACCESSION}~${SCAFFOLD_TITLE}.liftoff.gff \
    > ${OUTDIR}/01_liftoff/${ACCESSION}~${SCAFFOLD_TITLE}.liftoff.hints


# Step 2: Augustus

This is derived from `6_gene_annotation/4_augustus/2_annotation/runAugustus.sh`.

## Step 2.1: Augustus annotation from inbuilt `arabidopsis` trained species dataset

This runs augustus with `--species=arabidopsis`, i.e. the pre-trained weights the augutstus authors have trained on TAIR10.

In [None]:
mkdir -p ${OUTDIR}/02_augustus/
mkdir -p ${OUTDIR}/tmp/02_augustus/01_seqsplit/

In [None]:
seqkit split -i -f -O ${OUTDIR}/tmp/02_augustus/01_seqsplit/ $SCAFFOLDS

In [None]:
mkdir -p ${OUTDIR}/tmp/02_augustus/02_spp-arabiopsis/
parallel -j $NCPUS augustus \
    --species=arabidopsis \
    --softmasking=1 \
    --hintsfile=${OUTDIR}/01_liftoff/${ACCESSION}~${SCAFFOLD_TITLE}.liftoff.hints \
    --extrinsicCfgFile=input/max-augustus-config_newtest.cfg \
    --gff3=on \
    {} \> ${OUTDIR}/tmp/02_augustus/02_spp-arabiopsis/{/}.gff3 \
    ::: ${OUTDIR}/tmp/02_augustus/01_seqsplit/*.fasta
    
cat ${OUTDIR}/tmp/02_augustus/02_spp-arabiopsis/*.gff3 \
    >  ${OUTDIR}/02_augustus/${ACCESSION}~${SCAFFOLD_TITLE}_augustus_spp-arabidopsis.gff3

Fix up IDs as they re-start every chromosome.

In [None]:
conda activate dl20-gff

In [None]:
agat_sq_manage_IDs.pl \
    --gff "${OUTDIR}/02_augustus/${ACCESSION}~${SCAFFOLD_TITLE}_augustus_spp-arabidopsis.gff3" \
    -o "${OUTDIR}/02_augustus/${ACCESSION}~${SCAFFOLD_TITLE}_augustus_spp-arabidopsis.idfix.gff3"