GitHub - ZhaoBioinformaticsLab/PlantSSPProtocols: Identification and functional investigation of genome-encoded, small, secreted peptides in plants

Protocol #1: Small Secret Peptide Gene discovery from genomic sequences

Traditional genome annotation policy is biased to discover long genes; leading to missing of some small secret peptide (SSP) genes. The following workflow was optimized to identify SSP genes from assembled genomic sequences utilizing specific RNA-seq data as expression evidence and conserved SSP motifs.

1.1. Prerequisites

1.1.1. Suggestions:

We recommend an X-windows desktop (such as gnome/XFCE/MATE) instead of SSH terminal because it is more convenient to edit files.
All commands below are typed under Linux terminal.
A line start with # in Linux command line indicates that this is explanatory information only.
You need a no-root user with sudo privilege in host system to install docker packages and enable docker service. Asking your system administrator to install docker service and add your user as a member of docker group if you can’t have sudo privileges.
You need to be a sudo user or a member of docker group in host system to start Docker container and attach to container terminal.
Default user in the Docker container is test.

1.1.2. Computer:
A high-performance computer (I7/Xeon processor and >16GB RAM) with CentOS 7, Ubuntu 16.04 or higher as your host operation system(OS).

1.1.3. Work folder
Work folder is the place for all raw input data (genomic sequences, gff, RNA-seq and protein, etc ), and analysis results in your host OS. It is recommend to be work under your home directory. For example, if your username is test in host OS, the recommended work folder will be /home/test/work in your host OS. To create the work folder in your home directory of host OS:

cd ~   # ~ means your home directory, e.g. /home/test
mkdir work

1.1.4. Input data

Genomics sequences in FASTA format
Reference annotation in GFF format if available
SSP gene expression specific RNA-seq data in compress FASTQ format
Protein sequence of known SSP genes other related protein sequences in FASTA format
Other EST/transcript sequences from the same species.

1.1.5. Demo data
The demo data is available for download. In host OS, copy it to your work folder and type the following command to unzip it:

cd ~/work
wget http://bioinfo.noble.org/manuscript-support/ssp-protocol/ssp-demo.tar.gz
tar -xzvf ssp-demo.tar.gz

The above command will generate ssp folder under work, download the demo file ssp-demo.tar.gz, and uncompress it.

In the ~/work/ssp/data folder, ssp_family.fa is a protein sequences of known SSP genes. The known SSP file is used in Maker genome annotation (Protocol #1) and SSP gene annotation (Protocol #2).

1.1.6. Software installation
All software have been configured and packed as a docker image hosted in Docker Hub. Firstly, install docker packages and enable/start docker service in your host OS:

Under CentOS 7, install docker packages:

sudo yum install docker

If you are using Ubuntu, install docker packages as below:

sudo apt install docker.io

Enable and start docker service for CentOS/ubuntu:

sudo systemctl enable docker
sudo systemctl start docker

Then, start a container of SSP-mining image to input Linux command line:

sudo docker run -d -it -e "uid=$(id -u)" -e "gid=$(id -g)" --name sspvm -v $(pwd)/work:/work docker.io/noblebioinfo/sspgene
sudo docker attach sspvm

The above commands will start a Docker container named sspvm using docker.io/noblebioinfo/sspgene as template image. This step will take a while depend on your network download speed.

In -v $(pwd)/work:/work: $(pwd)/work, the path of work folder in your host OS, is work under your current directory. Here,$(pwd) will be converted to your current folder, e.g. home folder by Linux Bash interpreter. The work folder in host OS will be mounted on /work in Docker container. Thus, the folder makes it possible to exchange data between Host computer ($(pwd)/work) and Docker container (/work). You can copy your demo data or other research data to the work folder in hosts OS ($(pwd)/work) and access them in /work in Docker container.

The attach subcommand will link your current Linux terminal to the running docker container (bioinfo in this case).
Tip: to detach the container terminal and get back to host OS hold Ctrl key and press P,Q.

Type the following command to enter demo data folder work folder in attached Docker container terminal:

cd /work/ssp

All Linux commands below should be typed in this container terminal.

1.2. Prepare RNA-sed based gene expression evidence for MAKER pipeline

Some plant SSP genes may only express under a specific condition or tissue, such as nutrient deficiency or root tissue. Related RNA-seq data will help to improve the performance of SSP gene mining. The following sample code will perform reference-based transcriptome assembly and generate a GFF file for MAKER genome annotation.

1.2.1. Prepare work folder

cd /work/ssp
mkdir transcriptome
cd transcriptome/

1.2.2. Compile the genomics sequences using HISAT2

hisat2-build /work/ssp/data/genome.fa genome_hisat2

1.2.3. Extract splicing sites (if reference annotation is available) using HISAT2

gffread /work/ssp/data/maker/ref.gff3 -T -o ref.gtf
hisat2_extract_splice_sites.py ref.gtf > splicesites.txt

1.2.4. Map RNA-seq read on genomic sequences

time hisat2 -p 20 -x genome_hisat2 --known-splicesite-infile splicesites.txt --dta --dta-cufflinks -1 /work/ssp/data/RNA-seq/root_R1.fq.gz,/work/ssp/data/RNA-seq/bud_R1.fq.gz -2 /work/ssp/data/RNA-seq/root_R2.fq.gz,/work/ssp/data/RNA-seq/bud_R2.fq.gz | samtools view -bS - > all_runs.bam

-1 and -2 are input parameters for paired-end libraries, and -U is the input parameter for single-end libraries.
all_runs.bam file is the mapping result file.

1.2.5. Sort BAM file using sambamba

sambamba sort -m 40G --tmpdir tmp/ -o all_runs.sorted.bam -p -t 20 all_runs.bam

all_runs.sorted.bam is the sorted BAM file.

1.2.6. Generate reference-based transcriptome file

stringtie all_runs.sorted.bam -o transcriptome_models.gtf -p 20
cufflinks2gff3  /work/transcriptome/transcriptome_models.gtf > /work/transcriptome/transcriptome_models.gff3

In the above commands, -p 20 or -t 20 is the number of CPU cores assigned to the program. Type nproc to check the maximum number in your computer. -m 40G is max RAM size assigned to your computer. Type free to check your computer RAM size.

transcriptome_models.gff3 is the output file and contains transcriptome data. This file will be used as expression evidence in MAKER genome annotation (step 1.3.2.).

1.3. Genome annotation procedure for mining SSP genes using MAKER pipeline

General genome annotation procedure can be optimized to identify more SSP genes through including SSP-specific expression evidence and conserved known SSP domains.

1.3.1. Prepare MAKER configuration file

The protocol for genome annotation using MAKER has been well documented (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). We installed and tested MAKER pipeline in the Docker image.

Users need to generate three MAKER configuration files called maker_opts: maker_opts_1.ctl, maker_opts_2.ctl and maker_opts_3.ctl. In addition, MAKER also needs maker_bopts.ctl and maker_exe.ctl configuration files. These files include paths for the input data files and other settings for the genome annotation. MAKER will take these files as inputs to generate the final GFF file with genome annotation information. The annotation procedure will be done for three rounds to generate optimized results.

The GFF file for transcriptome generated in the previous step (1.2.6.) and known SSP protein sequences (as of 01/2019, under /work/ssp/data) will be included in above three maker_opts files. This additional information will help MAKER to identify novel SSP genes.

1.3.2. Run MAKER pipeline

We generated the optimized gene models using SNAP gene predictior for Medicago truncatula. If you want to use these optimized gene models, skip to Round 3. But if you are going to generate these files (for another species, for example), run rounds 1, 2 and 3.

1.3.2.1. Round 1

We included all MAKER configuration files in the demo data. Users should be able to test MAKER for the first round using the following command:

cd /work/ssp
time /usr/lib64/mpich/bin/mpiexec -n 20 maker -fix_nucleotides /work/ssp/data/maker/maker_opts_1.ctl /work/ssp/data/maker/maker_bopts.ctl /work/ssp/data/maker/maker_exe.ctl 1>&2 2>log_round1

-n 20 is the number of CPU cores. Please check the log file if pipeline failed.

Then, generate the GFF file using the following command:

gff3_merge -d genome.maker.output/genome_master_datastore_index.log -s > maker_all_round1.gff
(head -1 maker_all_round1.gff;cat maker_all_round1.gff|grep -P $'\t(CDS|contig|exon|five_prime_UTR|gene|mRNA|three_prime_UTR)\t') > maker_round1.gff

Then, generate optimized gene models with SNAP gene predictior using the following commands:

mkdir snap
cd snap
maker2zff  -d /work/ssp/genome.maker.output/genome_master_datastore_index.log
/opt/maker/exe/snap/fathom -categorize 1000 genome.ann genome.dna
/opt/maker/exe/snap/forge export.ann export.dna
/opt/maker/exe/snap/hmm-assembler.pl mt . > mt.hmm
cd ..

1.3.2.2. Round 2

In the second round, users need to include an HMM file generated by the SNAP gene predictor on Round 1. This HMM file (mt.hmm) is included in the demo data, and also the maker_opts_2.ctl file.

Run MAKER again:

time /usr/lib64/mpich/bin/mpiexec -n 20 maker -fix_nucleotides /work/ssp/data/maker/maker_opts_2.ctl /work/ssp/data/maker/maker_bopts.ctl /work/ssp/data/maker/maker_exe.ctl  1>&2 2>log_round2

Then, generate the GFF file again using the following command:

gff3_merge -d genome.maker.output/genome_master_datastore_index.log -s > maker_all_round2.gff
(head -1 maker_all_round2.gff;cat maker_all_round2.gff|grep -P $'\t(CDS|contig|exon|five_prime_UTR|gene|mRNA|three_prime_UTR)\t') > maker_round2.gff

Then, generate optimized gene models again with SNAP gene predictior using the following commands:

mkdir snap2
cd snap2
maker2zff  -d /work/ssp/genome.maker.output/genome_master_datastore_index.log
/opt/maker/exe/snap/fathom -categorize 1000 genome.ann genome.dna
/opt/maker/exe/snap/forge export.ann export.dna
/opt/maker/exe/snap/hmm-assembler.pl mt2 . > mt2.hmm
cd ..

1.3.2.3. Round 3

In the third round, users need to include an HMM file generated by the SNAP gene predictor in the Round 2. This HMM file (mt2.hmm) is included in the demo data, and also the maker_opts_3.ctl file.

Run MAKER again:

time /usr/lib64/mpich/bin/mpiexec -n 20 maker -fix_nucleotides /work/ssp/data/maker/maker_opts_3.ctl /work/ssp/data/maker/maker_bopts.ctl /work/ssp/data/maker/maker_exe.ctl  1>&2 2>log_round3

Then, generate the GFF file again using the following command:

gff3_merge -d genome.maker.output/genome_master_datastore_index.log -s > maker_all_round3.gff
(head -1 maker_all_round3.gff;cat maker_all_round3.gff|grep -P $'\t(CDS|contig|exon|five_prime_UTR|gene|mRNA|three_prime_UTR)\t') > maker_round3.gff

1.4. Genome annotation procedure for mining SSP genes using SPADA pipeline

SPADA pipeline typically utilizes conserved domains (in HMM format) information of known SSP families to identify SSP genes from genomic sequences (Zhou et al., 2013). We included a copy of SPADA and a comprehensive HMM dataset from PlantSSP database (Ghorbani et al., 2015), and our curated known SSPs in Docker image.

Here is an example of SPADA analysis:

perl /opt/spada_soft/spada/spada.pl --cfg /opt/spada_soft/spada/cfg.txt -d sspanno -p /opt/spada_soft/spada/CRP_PlantSSPv1_Noble -f data/genome.fa -t 20 -o arabidopsis 1>&2 2>spada_log

In this example:
/opt/spada_soft/spada/: SPADA pipeline is installed here.
/opt/spada_soft/spada/CRP_PlantSSPv1_Noble: location of above-mentioned comprehensive HMM dataset. You can change it to any of your favorite HMM dataset.
data/genome.fa: is the genomic sequences to be analyzed.
/opt/spada_soft/augustus/config/species/: change it to the closest species in your case.
arabidopsis: folder name under /opt/spada_soft/augustus/config/species/
The GFF file is available at sspanno/31_model_evaluation/61_final.gff

1.5. Merge the annotation result from MAKER and SPADA

There are (partially) duplicate genes between MAKER and SPADA outputs. Additionally, users also need to consider which gene model to be removed. If you are working with a model plant, it is better to retain the official annotation.
Here we describe the procedure to identify duplicated genes:

1.5.1. Generate cds sequences from different annotations using GFF and genomic sequences

Generate CDS sequences for SPADA gene models:

gffread sspanno/31_model_evaluation/61_final.gff -g data/genome.fa -x spada_cds.fa

Replace transcript id with gene id:

sed -ri 's/^>\S+\s+gene=/>/' spada_cds.fa

1.5.2. Run NCBI BLASTN between two generated CDS files and only keep the query-hit gene pairs with > 50% overlapped region.

1.5.3. Check the coordinates of the query-hit pairs in the MAKER and SPADA GFF files

If both genes have an overlapped region and same coordinates, choose one of them as redundant gene**

First create bed files from the GFF files for MAKER and SPADA:

cat sspanno/31_model_evaluation/61_final.gff | awk 'BEGIN{FS="\t"}; BEGIN{OFS="\t"}; $3=="gene"' | tr '; ' '\t' | awk 'BEGIN{FS="\t"}; BEGIN{OFS="\t"};{print $1,$4,$5,$9}' | sed 's/ID=//' > spada.bed
cat maker_round3.gff | awk 'BEGIN{FS="\t"}; BEGIN{OFS="\t"}; $3=="gene"' | tr '; ' '\t' | awk 'BEGIN{FS="\t"}; BEGIN{OFS="\t"};{print $1,$4,$5,$9}' | sed 's/ID=//' > maker.bed

Then, check overlaps between MAKER and SPADA, and create duplicate regions from SPADA:

bedtools intersect -wa -wb -a maker.bed -b spada.bed -filenames | cut -f8 > dup_gene_in_spada.txt

1.5.4. Remove duplicate genes from the corresponding GFF file

gffremove.py --infile sspanno/31_model_evaluation/61_final.gff --outfile new_spada.gff --genefile dup_gene_in_spada.txt

dup_gene_in_spada.txt includes the gene list to be removed. new_spada.gff is the new GFF file.

1.5.5. Merge MAKER and SPADA GFF files without redundant genes

head -1 maker_round3.gff  > header
cat maker_round3.gff new_spada.gff | grep -v "#" > maker_spada.gff
cat header maker_spada.gff > all.gff

1.5.6. Generate protein and transcript files

The protein and transcript files will be used to further annotate gene functions in the Protocol #2.

Protein sequences:

gffread all.gff -g data/genome.fa -y all_protein.fa
sed -ri 's/^>\S+\s+gene=/>/' all_protein.fa

Transcript sequences:

gffread all.gff -g data/genome.fa -w all_transcript.fa

Protocol #2: Functional annotation and family classification of SSP genes

Due to the short conserved regions and less homologous among the members of the same gene family, it is less efficient to search and identify SSP proteins using standard NCBI BLASTP. Here we introduce a comprehensive annotation procedure to identify SSPs from candidate genes.

2.1. Prerequisite

The Prerequisite is almost the same as protocol #1 (see 1.1), except for the input data.

2.1.1. Input data

SSP gene expression evidence (RNA-seq data) in compressed FASTQ format files
Protein sequence (with gene ID as protein ID) of SSP gene candidates in FASTA format
Transcript sequences of SSP gene candidates and a two-columns mapping file between gene and transcript id.
Known SSP protein sequences and HMM library; both data are available in the Docker image and demo data.

2.2. Only keep short sequences (< 250 a.a.)

keepshortseq all_protein.fa 250 > short-seq.fa

2.3. Smith-waterman search against known SSP protein

Smith-Waterman alignment is more accurate way for sequence homolog search compared to BLAST search. The wrapped shell script swsearch use FASTA software to perform Smith-waterman search against known SSP proteins (target.fa in demo data) and take two top hits (e-value < 0.01) as output. The script will remove hit sequence IDs and will only output the family names.

swsearch short-seq.fa /work/ssp/data/ssp_family.fa 0.01 > sw.txt

2.4. HMM search against HMMs of known SSP families

2.4.1. Generate HMM library for all known SSP families from SPADA installation

cat /opt/spada_soft/spada/CRP_PlantSSPv1_Noble/15_hmm/*.hmm > all.hmm

2.4.2 . Compile HMM library

/opt/spada_soft/hmmer/bin/hmmpress all.hmm

2.4.3 Search your protein sequence against HMM library

/opt/spada_soft/hmmer/bin/hmmscan --cpu 20 -E 0.01 --tblout hmm_output.txt all.hmm short-seq.fa > /dev/null

The expectation cutoff -E for hmmscan is 0.01. short-seq.fa is the input protein file and all.hmm is the HMM library file.

The above command will generate a tab-delimited table file hmm_output.txt. In the result table, column #1 is HMM family name and column #2 is gene ID in the input protein sequence file.

2.5. Signal peptide detection

We use SignalP to predict the signal peptides from SSP candidates. We recommend to use “No TM” and long output format. D-score thresholds are usually 0.45 or 0.5, depending on the type of network chosen (with or without transmembrane segments), but we recommend a D-score of ≥ 0.45 for known SSPs or ≥ 0.25 for putative SSPs.

/opt/spada_soft/signalp-4.1/signalp -t euk -f long -s notm short-seq.fa > signalp_long.txt
cat signalp_long.txt | singalP_parser  > sp.txt

signalp_long.txt: is the output prediction result file
singalP_parser: is a script to parse long format outputs from SignalP.
The final signal peptide prediction result sp.txt includes four columns: gene ID, start coordinate, end coordinate, D-score, cut-off, and SSP prediction (YES/NO).

2.6. Identification of novel SSP gene families using MCL analysis

SSP candidates (SignalP D-score > 0.25) can be clustered into candidate SSP families using Markov Chain Cluster (MCL). The procedure should be performed on the last 50 a.a. and candidate peptides should be less than 230 a.a…

2.6.1. Create index to retrieve protein sequences by sequence ID.

cdbfasta short-seq.fa

2.6.2 Select proteins with D-score > 0.45.

cat sp.txt | awk '{if($4>0.45) print $1}' | cdbyank short-seq.fa.cidx  > all_putative_ssp.fa

2.6.3. Select proteins shorter than 230 a.a. and only take the last 50 a.a…

shortseqtail all_putative_ssp.fa 230 50 > peptide-tail.fa

2.6.4. Generate protein vs protein relationship file

/opt/bin/ssearch35_t -T 20 -Q -H -m 9 -b 100 -d 100 peptide-tail.fa peptide-tail.fa > sw_for_peptidetail
bioparser -t ssearch -m sw_for_peptidetail | awk '{print $3,$6,$14}' FS="\t" | sort | uniq | awk '{ if($1!=$2&&$3 < 0.01) print $0; }' FS=" " > protein-protein-rel.txt

protein-protein-rel.txt file has three columns with two protein/gene IDs and their relation in e-values. All protein-protein pairs with e-value > 0.01 have been removed.

2.6.5. Cluster protein into clusters

mcxload -abc protein-protein-rel.txt --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o protein-protein.mci -write-tab protein-protein.tab
mcl protein-protein.mci -I 1.4 -use-tab protein-protein.tab

In this case, mcl command will generate the output file out.protein-protein.mci.I14, in which all gene/protein belonging to the same cluster will be put in one line.

Referring to MCL manual to adjust -I. The current value for -I generate larger clusters. You may consider to increase the value to get smaller clusters.

Next, type the following command to generate a cluster name vs gene/protein ID table file:

cat out.protein-protein.mci.I14 | awk '{print "Cluster_" NR "\t" $0}' |awk '{ OFS="\n" $1 "\t";$1="";print $0;}'|grep -Pv '^\s*$' > mclcluster_protein.txt

mclcluster_protein.txt is the result file containing cluster-protein mapping.

2.7. Gene expression analysis of RNA-seq data

Use RSEM software (Li & Dewey, 2011) to generate gene expression table using RNA-seq data. In the generated table, each gene will occupy one row, and each RNA-seq sample will take one column. We recommend to use transcript per million transcripts (TPM) value in this table.

2.8. Perform transmembrane helix (TMH) prediction

TMH prediction is a criterion used to classify the putative SSPs (de Bang et al., 2017) because a gene harboring TMHs cannot be considered as an SSP.

2.8.1. Generate putative SSP sequences without the signal peptide regions

Run the following command to generate the putative sequences without the signal peptide regions from the putative SSP input protein sequences:

/opt/spada_soft/signalp-4.1/signalp -t euk -f short -m processed_putative.fa -u 0.449 -s notm all_putative_ssp.fa > signalp_putative.txt

processed_putative.fa is the output file that will be used in the next step.

2.8.2. Perform TMH prediction

We recommend the TMHMM server because it is easy-to-use and accurate due to its based on hidden Markov model search (Krogh et al., 2001). Use the processed FASTA sequences generated in the previous step (2.8.1.).
The output result file (tm_putative.txt) provides a list of the genes and their number of predicted transmembrane helices.

Run the following command:

/opt/tmhmm-2.0c/bin/tmhmm --short processed_putative.fa > tmhmm.out
cat tmhmm.out | cut -f1,5 | sed 's/PredHel=//' > tm_putative.txt

2.9. Perform ER-retention signal search (KDEL, HDEL or KQEL) in the C-terminal region

Only take the last 50 amino acids for this step. Run the following command to obtain the peptides with ER-retention signal peptides:

cat peptide-tail.fa | sed -r 's/^\s+|\s+$//g'|awk '{if($0~/^>/) print  "\n" $0 "\t"; else print $0;}' ORS=""|grep -Pv '^\s*$' | grep -i  'KDEL\|HDEL\|KQEL'

Then, generate a FASTA file with the ER retention peptides to be excluded:

cat peptide-tail.fa | sed -r 's/^\s+|\s+$//g'|awk '{if($0~/^>/) print  "\n" $0 "\t"; else print $0;}' ORS=""|grep -Pv '^\s*$' | grep -i  'KDEL\|HDEL\|KQEL' | awk 'BEGIN{FS="\t"}; BEGIN{OFS="\t"}; {print $1,"\n"$2}' > peptide-ERretention.fa

3.0. Comprehensive table of gene annotation evidence

The result file generated from Section 2.3 to 2.8 should be merged into a comprehensive data table in which each gene will use one row and each annotation information, such as - RNA-seq sample, family name from Smith-Waterman search and HMM search, D-score from SignalP, cluster ID from MCL analysis, and predicted transmembrane helices - will use one column. Microsoft excel is a good option to merge and generate such comprehensive table.

Further curation based on the table will help to screen known and putative SSP genes. The genes with Smith-Waterman or HMM hits will be considered as known SSP genes. Other genes with D-score (> 0.25) and shorter than 230 a.a. will be considered as putative SSP genes. The gene expression values is also helpful to identify gene with high confidence.

Trouble shooting

The Docker image was only tested on Linux CentOS 7 and Ubuntu 16.04 LTS. We don’t recommend to use it on Windows and Mac, although it can be installed on Windows and Mac systems. As a Linux user, installing Docker software and starting backend service require root or sudo privileges. Downloading Docker image and starting a container for the image only require to be a member of Docker user group or have root or sudo privileges. Contact your Linux administrator if you are using a virtual Linux machine without root or sudo privileges in Data center and having permission or privileges issues for running Docker container.

MAKER or SPADA usage errors can be found at:
https://groups.google.com/forum/#!forum/maker-devel https://groups.google.com/forum/#!forum/SPADA

HMMER help page can be found at: https://www.ebi.ac.uk/Tools/hmmer/help
Be advised that the HMM libraries compiled by different versions of HMMER are incompatible each other. The installed HMMER in Docker image is version 3.0. If user plans to compile HMM library using their SSP family data, please use the same version.

SignalP frequently asked questions can be found at:
http://www.cbs.dtu.dk/services/SignalP/faq.php

TMHMM instruction information can be found at http://www.cbs.dtu.dk/services/TMHMM/TMHMM2.0b.guide.php

MCL frequently asked questions can be found at: https://micans.org/mcl/

References

de Bang, T. C., Lundquist, P. K., Dai, X., Boschiero, C., Zhuang,
Z. , Pant, P., … Scheible, W.-R. (2017). Plant Physiology, 175
(4), 1669-1689.
Ghorbani, S., et al. (2015). Expanding the repertoire of secretory
peptides controlling root development with comparative genome
analysis and functional assays. J Exp Bot, 66(17): p. 5257-69.
Krogh, A., et al. (2001). Predicting transmembrane protein
topology with a hidden Markov model: application to complete
genomes. J Mol Biol, 305(3): p. 567-80.
Li B., & Dewey CN. (2011). RSEM: accurate transcript quantification
from RNA-Seq data with or without a reference genome. BMC
Bioinformatics, 12, 323.
Zhou P, Silverstein KA, Gao L, Walton JD, Nallu S, Guhlin J and
Young ND (2013). Detecting small plant peptides using SPADA
(Small Peptide Alignment Discovery Application). BMC
Bioinformatics 14: 335.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protocol #1: Small Secret Peptide Gene discovery from genomic sequences

1.1. Prerequisites

1.2. Prepare RNA-sed based gene expression evidence for MAKER pipeline

1.3. Genome annotation procedure for mining SSP genes using MAKER pipeline

1.4. Genome annotation procedure for mining SSP genes using SPADA pipeline

1.5. Merge the annotation result from MAKER and SPADA

Protocol #2: Functional annotation and family classification of SSP genes

2.1. Prerequisite

2.2. Only keep short sequences (< 250 a.a.)

2.3. Smith-waterman search against known SSP protein

2.4. HMM search against HMMs of known SSP families

2.5. Signal peptide detection

2.6. Identification of novel SSP gene families using MCL analysis

2.7. Gene expression analysis of RNA-seq data

2.8. Perform transmembrane helix (TMH) prediction

2.9. Perform ER-retention signal search (KDEL, HDEL or KQEL) in the C-terminal region

3.0. Comprehensive table of gene annotation evidence

Trouble shooting

References

About

Releases

Packages

ZhaoBioinformaticsLab/PlantSSPProtocols

Folders and files

Latest commit

History

Repository files navigation

Protocol #1: Small Secret Peptide Gene discovery from genomic sequences

1.1. Prerequisites

1.2. Prepare RNA-sed based gene expression evidence for MAKER pipeline

1.3. Genome annotation procedure for mining SSP genes using MAKER pipeline

1.4. Genome annotation procedure for mining SSP genes using SPADA pipeline

1.5. Merge the annotation result from MAKER and SPADA

Protocol #2: Functional annotation and family classification of SSP genes

2.1. Prerequisite

2.2. Only keep short sequences (< 250 a.a.)

2.3. Smith-waterman search against known SSP protein

2.4. HMM search against HMMs of known SSP families

2.5. Signal peptide detection

2.6. Identification of novel SSP gene families using MCL analysis

2.7. Gene expression analysis of RNA-seq data

2.8. Perform transmembrane helix (TMH) prediction

2.9. Perform ER-retention signal search (KDEL, HDEL or KQEL) in the C-terminal region

3.0. Comprehensive table of gene annotation evidence

Trouble shooting

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages