# Sm GFF HHPred table

## Aim
Generate a new annotation database using the uncurated GFF file (obtained from Alan on September 2017) as previously done on 2015. The goal is to extract the protein sequences, perform HHPred analysis and create a final database with GFF and HHPred annotations.

## Protocol
* Generate HHPred database

In [None]:
# Extract protein sequences from the genome using the gff file (which is still a work in progress)
./gff2fasta.pl ~/data/sm_genome/Smansoni_v7_renamed.fa ~/data/sm_Gene_table/Sm_V7_r8-add_renamed_corrected.gff Sm_v7

# Split the protein fasta file in individual files
mkdir data
mv Sm_v7.pep.fasta data/
cd data
splitfasta.pl Sm_v7.pep.fasta
cd ..
mv data/Sm_v7.pep.fasta .
rm data/alt_test_1.seq  # Useless extra sequence

# Create list file of sequence to treat and split for parallelization
ls -1 data/* > list 
split -l 100 -a 3 -d list list.d/list.

# Run jobs in parallele
for i in $(ls -1 list.d/*)
do
    qsub -V -cwd -o status -j y -r y -S /bin/bash hhpred-ann.sh "$i"
done

# Build the database
cat results/list.* > hhpred_ann_v7.db

* Extract GFF annotations from the GFF file.

Because the new GFF does not have annotations yet, we used the annotation from the v5.2 version.

In [None]:
## Gene table done using v5.2 gff file
## source: ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/Latest_assembly_annotation_others/

# Prepare working directory 
old_pwd=$(pwd)
mkdir -p ~/data/sm_Gene_table/Sm_v5.2_ann/
cd ~/data/sm_Gene_table/Sm_v5.2_ann/

# Download the annotation
wget ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/Latest_assembly_annotation_others/Schistosoma_mansoni_v5.2.gff.gz
gunzip Schistosoma_mansoni_v5.2.gff.gz

# Remove fasta sequences
sed -n "1,$(grep -m 1 -n "##FASTA" Schistosoma_mansoni_v5.2.gff | cut -d ":" -f 1)p" Schistosoma_mansoni_v5.2.gff > Schistosoma_mansoni_v5.2_ann_only.gff

# Extract product and locus
grep -o "product=.*locus_tag=.*$" Schistosoma_mansoni_v5.2_ann_only.gff > Schistosoma_mansoni_v5.2_gene_tmp.txt 

# Parse locus name
grep -o "locus_tag=.*$" Schistosoma_mansoni_v5.2_gene_tmp.txt | sed "s/locus_tag=//g" > Schistosoma_mansoni_v5.2_gene_name.txt

# Parse product name
cut -d ";" -f 1 Schistosoma_mansoni_v5.2_gene_tmp.txt | sed "s/product=//g" > Schistosoma_mansoni_v5.2_gene_prod.txt 

# Build the table
paste Schistosoma_mansoni_v5.2_gene_name.txt Schistosoma_mansoni_v5.2_gene_prod.txt > Schistosoma_mansoni_v5.2_gene_table.tsv

# Return to the project folder
cd "$old_pwd"

* Merge GFF annotations (from v5.2) and newly generated HHPred annotations.

In [None]:
mkdir "0-merged"

cd "0-merged"

ln -s ../hhpred_ann_v7.db .
ln -s ~/data/sm_Gene_table/Sm_v5.2_ann/Schistosoma_mansoni_v5.2_gene_table.tsv .

In [None]:
gtdb="Schistosoma_mansoni_v5.2_gene_table.tsv"
hhdb="hhpred_ann_v7.db"

# Generate table header
echo -e "#Gene_nb\tGFF_annotation\tHHPred_annotation" > "Schistosoma_mansoni_v7.0_gene_table_hhpred.tsv"

while read line
do
    gene=$(echo -e "$line" | cut -f 1)
    
    gtan=$(grep "$gene" "$gtdb" | cut -f 2 | head -1)
    [[ -z "$gtan" ]] && gtan="NA"
    
    hhan=$(echo "$line" | cut -f 2)
    
    echo -e "$gene\t$gtan\t$hhan" >> "Schistosoma_mansoni_v7.0_gene_table_hhpred.tsv"
done < "$hhdb"