Skip to content

fpozoc/hp-pfamscan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance PfamScan

The objective of this pure Python implementation of PfamScan is to parallelize the process of pfam_scan.pl in order to perform a complete proteome.

Installation instructions

Run the silent installation of Miniconda in case you don't have this software in your Linux Environment

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3

Once you have installed Miniconda/Anaconda, create a Python 3 environment.

git clone https://github.com/fpozoc/hp-pfamscan.git
cd HP-PfamScan
conda env create --file environment.yml
conda activate pfamscan

In case the user does not choose Conda as the desired environment, this instructions described here can be followed.

Disclaimer: Pfam-B has not been uploaded from version 27. You can take Pfam-A.hmm from current_release and Pfam-B.hmm from version 27 or take only Pfam-A.hmm. More info here.

mkdir -p pfam_db
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz | gunzip > pfam_db/Pfam-A.hmm
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz | gunzip > pfam_db/Pfam-A.hmm.dat
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz | gunzip > pfam_db/active_site.dat
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-B.hmm.gz | gunzip > pfam_db/Pfam-B.hmm
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-B.hmm.dat.gz | gunzip > pfam_db/Pfam-B.hmm.dat

hmmpress Pfam-A.hmm
hmmpress Pfam-B.hmm ### Optional

### Download GRCh38 Gencode v33
mkdir -p genomes/GRCh38/g33/
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.pc_transcripts.fa.gz > genome_annotation/GRCh38/g33/gencode.v33.pc_transcripts.fa.gz

Split the multifasta file in several files with transcript id as name of the file. It will be stored in genomes/GRCh38/g33/seqs.

Once we have it, run src/pfamscan.py to locally process the sequences in a batched way.

python -m src.run --seqs genome_annotation/GRCh38/g33/gencode.v33.pc_transcripts.fa.gz --outdir out/GRCh38/g33 --pfamdb pfam_db --jobs 10

Links of interest

  • Some old pfam_scan.pl starting instructions here.