Skip to content
This repository

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 docs
Octocat-spinner-32 examples
Octocat-spinner-32 plugins
Octocat-spinner-32 protocols
Octocat-spinner-32 tests
Octocat-spinner-32 twill
Octocat-spinner-32 .gitignore
Octocat-spinner-32 BeautifulSoup.py
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.markdown
Octocat-spinner-32 helpers.py
Octocat-spinner-32 inmembrane.py
Octocat-spinner-32 run_example.py
Octocat-spinner-32 run_tests.py
README.markdown

inmembrane

inmembrane is a pipeline for proteome annotation to predict if a protein is exposed on the surface of a bacteria.

Installation and Configuration

Dowload the latest version of inmembrane from the github repository: https://github.com/boscoh/inmembrane/zipball/master.

This package includes tests, examples, data files, docs and a few included libraries (Beautiful Soup, mechanize and twill).

The editable parameters of inmembrane are found in inmembrane.config, which is always located in the same directory as the main script. If no such file exists, a default inmembrane.config will be generated. The parameters are:

  • the path location of the binaries for SignalP, LipoP, TMHMM, HMMSEARCH, and MEMSAT. This can be the full path, or just the binary name if it is on the system path environment. Use which to check.
  • 'protocol' to indicate which analysis you want to use. Currently, we support:
    • gram_pos the analysis of surface-exposed proteins of Gram+ bacteria;
    • gram_neg annotation of subcellular localization and inner membrane topology classification for Gram- bacteria
  • 'hmm_profiles_dir': the location of the HMMER profiles for any HMM peptide sequence motifs
  • for HMMER, you can set the cutoffs for significance, the E-value 'hmm_evalue_max', and the score 'hmm_score_min'
  • the shortest length of a loop that sticks out of the peptidoglycan layer of a Gram+ bacteria. The SurfG+ determined this to be 50 amino acids for terminal loops, and twice that for internal loops, 100
  • 'helix_programs' you can choose which of the transmembrane-helix prediction programs you want to use

We provide a number of unit tests for inmembrane:

python runtest.py

As inmembrane has a lot of dependencies, these tests are really useful in working out if the dependencies are installed in a way that is compatible with inmembrane. Since not all the binaries are needed, not all tests (and corresponding dependencies) are required for inmembrane to work.

Execution

inmembrane was written in Python 2.7. It takes a FASTA input file and runs a number of external bioinformatic programs on the sequences. It then collects the output to make the final analysis, which is printed out and stored in a CSV file.

inmembrane can be run in two modes. It can be run as a command-line program:

python inmembrane.py your_fasta_file

If run in this mode, the CSV will be given the same basename as the FASTA file.

The other way of running imembrane.py is with a custom script, such as run_example.py where all pertinent input is in the script itself. You can either run this on the command-line like this:

python run_example.py

or simply double-click run_example.py in a file-manager. You can change this by simply duplicating run_example.py, and editing the parameters in a text editor. In particular, the fields in the parameters include:

  • 'fasta' the input FASTA file
  • 'out_dir' the directoyr that stores intermediate output
  • 'csv' the output CSV file

Output format

The output of inmembrane gram_pos protocol consists of four columns of output. This is printed to stdout and written as a CSV file, which can be opened in spreadsheet software such as EXCEL. The standard text output can be parsed using space delimiters (empty fields in the third column are indicated with a "."). Logging information are prefaced by a '#' character, and is sent to stderr.

Here's an example:

  SPy_0008  CYTOPLASM(non-PSE)  .                         SPy_0008 from AE004092
  SPy_0009  CYTOPLASM(non-PSE)  .                         SPy_0009 from AE004092
  SPy_0010  PSE-Membrane        tmhmm(1)                  SPy_0010 from AE004092
  SPy_0012  PSE-Cellwall        hmm(GW2|GW3|GW1);signalp  SPy_0012 from AE004092
  SPy_0013  PSE-Membrane        tmhmm(1)                  SPy_0013 from AE004092
  SPy_0015  PSE-Membrane        tmhmm(2)                  SPy_0015 from AE004092
  SPy_0016  MEMBRANE(non-PSE)   tmhmm(12)                 SPy_0016 from AE004092
  SPy_0019  SECRETED            signalp                   SPy_0019 from AE004092
  • the first column is the SeqID which is the first token in the identifier line of the sequence in the FASTA file

  • the second column is the prediction, it is CYTOPLASM(non-PSE), MEMBRANE(non-PSE), PSE-Cellwall, PSE-Membrane, PSE-Lipoprotein or SECRETED. Any 'PSE' (Potentially Surface Exposed) annotation means that based on the predicted topology, the protein is likely to be surface exposed and will be protease accessible in a membrane-shaving experiment.

  • the third line is a summary of features detected by external tools:

    • tmhmm(2) means 2 transmembrane helices were found by TMHMM
    • hmm(GW2|GW3|GW1) means that the GW1, GW2 and GW3 motifs were found by HMMER hmmsearch
    • signalp means a secretion signal was found SignalP
    • lipop means a Sp II secretion signal found by LipoP with an appropriate CYS residue at the cleavage site, which will be attached to a phospholipid in the membrane
  • the rest of the line gives the full identifier of the sequence in the FASTA file.

Installing dependencies

As it is the nature of bioinformatic programs that they are changed and updated severely with different versions, stable APIs with consistent output formats are the exception rather than the norm. It is very important that you have the exact version that we have programmed against.

Required dependencies, and their versions:

  • TMHMM 2.0 or MEMSAT3
  • SignalP 4.0
  • LipoP 1.0
  • HMMER 3.0

These instructions have been tailored for Debian-based systems, in particular Ubuntu 11.10. Each of these dependencies are licensed free to academic users.

TMHMM 2.0

Only one of TMHMM or MEMSAT3 are required, but users that want to compare transmembrane segment predictions can install both.

SignalP 4.0

HMMER 3.0

  • Download HMMER 3.0 from http://hmmer.janelia.org/software.
  • The HMMER user guide describes how to install it. For the pre-compiled packages, this is as simple as putting the binaries on your PATH.

LipoP 1.0

MEMSAT3

Note the the 'runmemsat' script refers to PSIPRED v2, but it means MEMSAT3 - PSIPRED is not required.

Modification guide

It is a fact of life for bioinformatics that new versions of basic tools changes output formats and API. We believe that it is an essential skill to rewrite parsers to handle the subtle but significant changes in different versions. We have written inmembrane to be easily modifiable and extensible. Protocols which embody a particular high level workflow are found in inmembrane/protocols.

All interaction with a specific external program or web-site have been wrapped into a single python plugin module, and placed in the plugins directory. This contains the code to both run the program and to parse the output. We have tried to make the parsing code as concise as possible. Specifically, by using the native Python dictionary, which allows an enormous amout of flexibility, we can extract the analysis with very little code.

inmembrane development style guide:

Here are some guidelines for understanding and extending the code.

  • Confidence: Plugins that wrap an external program should have at least one high level test which is executed by run_tests.py. This allows new users to immediately determine if their dependencies are operating as expected.
  • Interface: A plugin that wraps an external program must receive a params data structure (derived from inmembrane.config) and a proteins data structure (which is a dictionary keyed by sequence id). Plugins should return a 'proteins' object.
  • Flexibility: Plugins should have a 'force' boolean argument that will force the analysis to re-run and overwrite output files.
  • Efficiency: All plugins should write an output file which is read upon invocation to avoid the analysis being re-run.
  • Documentation: A plugin must have a Python docstring describing what it does, what parameters it requires in the params dictionary and what it adds to the proteins data structure. See the code for examples.
  • Anal: Unique sequence ID strings (eg gi|1234567) are called 'seqid'. 'name' is ambiguous. 'prot_id' is reasonable, however conceptually a 'protein' is not the same thing as a string that represents it's 'sequence' - hence the preference for 'seqid'.
  • Anal: All file handles should be closed when they are no longer needed.
Something went wrong with that request. Please try again.