# Sm GFF HHPred table

## Aim
Generate a new annotation database using the latest GFF file (v7.1) as previously done. The goal is to extract the protein sequences, perform HHPred analysis and create a final database with GFF and HHPred annotations.

## Protocol
* Generate HHPred database

In [None]:
# Download the protein sequences
wget "ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/v7/annotation/Sm_v7.1.pep.fa.gz"
gunzip Sm_v7.1.pep.fa.gz

# Split the protein fasta file in individual files
mkdir data
mv Sm_v7.1.pep.fa data/
cd data
splitfasta.pl Sm_v7.1.pep.fa
cd ..
mv data/Sm_v7.1.pep.fa .

# Create list file of sequence to treat and split for parallelization
ls -1 data/* > list
split -l 100 -a 3 -d list list.d/list.

# Run jobs in parallele
for i in $(ls -1 list.d/*) ; do qsub -V -cwd -o status -j y -r y -S /bin/bash hhpred-ann.sh "$i" ; done

# Build the database
cat results/list.* > hhpred_ann_v7.1.db

# Reformat single transcript names with ending .1 for future join with GFF annotation
sed -i -r "s/(Smp_[0-9]{6})\t/\1.1\t/g" hhpred_ann_v7.1.db

# Comparing GFF and HHPred transcript names showed a difference. This difference needs to be corrected on the HHPred side.
diff <(cut -f 1 ~/data/sm_Gene_table/Sm_v7.1_ann/Sm_v7.1_transcript_table.tsv | sort) <(cut -f 1 hhpred_ann_v7.1.db | sort)
sed -i "s/Smp_210550.1/Smp_210550.2/"  hhpred_ann_v7.1.db

* Extract GFF annotations from the GFF file.

In [None]:
# Download the GFF
wget ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/v7/annotation/Sm_v7.1.gff.gz
gunzip Sm_v7.1.gff.gz

# Extract transcript ID and product
awk '$3 == "mRNA" {print $0}' Sm_v7.1.gff | cut -f 9 | awk -F ';' '{print $1"\t"$3}' | sed -r "s/^.*=(.*)\t.*=(.*)/\1\t\2/g" > Sm_v7.1_transcript_table.tsv

* Merge GFF annotations and newly generated HHPred annotations.

In [None]:
# Join GFF annotation and HHPred annotation
join -t $'\t' <(sort -k 1 Sm_v7.1_transcript_table.tsv) <(sort -k 1 hhpred_ann_v7.1.db) > Sm_v7.1_transcript_table_gff-hhpred.tsv 

# Add header
sed -i -r "1s/^/#Transcript_ID\tGFF_annotation\tHHPred_annotation\n/" Sm_v7.1_transcript_table_gff-hhpred.tsv