TBD
- Python >=3.5 (for
subprocess.run()
) - Bio::HMM::Logo Perl library for calculating HMM information content profiles locally or Requests Python 3 library for making skylign.org requests
- HMMER v3.1b2 program binaries in
$PATH
phmmer
/hmmsearch
target sequence database files; examples can be downloaded usingdatabase/sequence/update_databases.sh
- Up-to-date PDB structure residue scheme files; can be generated using
database/PDB_mmCIFs/update_mmCIFs.sh
run_fasta_local.py
requiresrun_hmm_local.py
be present in the same directory; both programs require thehmm_to_logo.pl
script- Write permission in the working directory
Use the file code/run_hmm_local.py
as
python3 run_hmm_local.py PROFILE_HMM_FILE HMMSEARCH_TARGET_DATABASE_FILE STRUCTURE_RESIDUE_SCHEMES_DIRECTORY
The HMMSEARCH_TARGET_DATABASE_FILE
should be a file containing sequences of PDB chains. Assuming this file and the PDB structure residue schemes have been generated as described in the Requirements, the following example should work
mkdir test
cd test/
wget http://pfam.xfam.org/family/PF00046/hmm # Downloads the Pfam Homeobox profile HMM
mv hmm Homeobox.hmm # For clarity, has no effect on naming the output files
python3 ../code/run_hmm_local.py Homeobox.hmm ../database/sequence/pdb_seqres_prot.txt ../database/PDB_mmCIFs/schemes/
The directory test/HMM_NAME/
presents for each matched domain a JSON file containing the color mask. These JSON files can be loaded into the LiteMol plugin included in the html_demo/
directory. Input profile HMM gathering threshold is used to determine which domain matches are reported. The numbers in the names of the JSON files indicate the first and last residues of the PDB chain sequence included in the alignment to the profile HMM.
The files test/HMM_NAME.dom
and test/HMM_NAME.icp
contain a summary of the matched domains and the profile HMM information content profile, respectively.
HMM_NAME
(e.g., Homeobox) is read from the contents of the input profile HMM file.
Use the file code/run_fasta_local.py
as
python3 run_fasta_local.py FASTA_FILE PHMMER_TARGET_DATABASE_FILE (HMMSEARCH_TARGET_DATABASE_FILE) STRUCTURE_RESIDUE_SCHEMES_DIRECTORY
FASTA_FILE
must contain a header and a single protein sequence. The header must be in the format used by the Protein Data Bank in Europe: >pdb|PDB_ID|CHAIN_ID
. PDB_ID
and CHAIN_ID
are used for naming output files.
PHMMER_TARGET_DATABASE_FILE
should be a file containing high-quality sequences covering the protein sequence space. The script database/sequence/update_databases.sh
can be used to download the latest release of the UniProtKB/Swiss-Prot database; the UniProt reference proteomes are a good alternative.
HMMSEARCH_TARGET_DATABASE_FILE
is an optional argument. If it is present, it should be a file containing sequences of PDB chains; hmmsearch
will be run against these sequences to create mark-up for all PDB structures similar to the FASTA query. If the argument is absent, then the input sequence will be aligned to the profile HMM generated from the phmmer
search results using hmmalign
; the mark-up will be generated only for this structure.
If all necessary files have been prepared, the following example should work
mkdir test2
cd test2/
wget http://www.ebi.ac.uk/pdbe/entry/pdb/1ubq/fasta # Downloads FASTA file containing the sequence of ubiquitin
mv fasta 1ubq_A.fasta # For clarity, has no effect on naming the output files
python3 ../code/run_fasta_local.py 1ubq_A.fasta ../database/sequence/uniprot_sprot.fasta ../database/sequence/pdb_seqres_prot.txt ../database/PDB_mmCIFs/schemes/
In this case, the directory test2/PDB_CHAIN_ID/
contains the color mask JSON files corresponding to domains identified using hmmsearch
. These are the significant hits scoring above per-target and per-domain inclusion thresholds (0.01 and 0.03, respectively).