Merge pull request #56 from danielparton/master

MolProbity validation feature addition
choderalab · Sep 24, 2015 · 3f59bfb · 3f59bfb
2 parents 2e6a120 + c2d5696
commit 3f59bfb
Show file tree

Hide file tree

Showing 23 changed files with 722 additions and 142 deletions.
diff --git a/docs/cli_docs.rst b/docs/cli_docs.rst
@@ -25,9 +25,14 @@ The ``ensembler`` tool is operated via a number of subcommands, which should be
   ensembler refine_explicit
   ensembler package_models
 
-Furthermore, the ``ensembler quickmodel`` subcommand allows the entire modeling
-pipeline to be run in one go for a single target and a small number of
-templates. Note that this command will not work with MPI.
+The optional ``ensembler validate`` subcommand uses the
+`MolProbity <http://molprobity.biochem.duke.edu/>`_ command-line tools to
+conduct model quality validation based on criteria such as Ramachandran angles,
+backbone distortion, and atom clashes.
+
+The ``ensembler quickmodel`` subcommand allows the entire modeling pipeline to
+be run in one go for a single target and a small number of templates. Note that
+this command will not work with MPI.
 
 To print helpstrings for each subcommand, pass the ``-h`` flag.
 
@@ -96,7 +101,7 @@ Additional Tools
 Ensembler includes a ``tools`` submodule, which allows the user to conduct
 various useful tasks which are not considered core pipeline functions. The
 use-cases for many of these tools are quite specific, so they may not be
-applicable to every project, and should also be used with caution.
+applicable to every project, and should be used with caution.
 
 Residue renumbering according to UniProt sequence coordinates
 -------------------------------------------------------------

diff --git a/docs/examples.rst b/docs/examples.rst
@@ -94,6 +94,12 @@ Determines the number of waters to add when solvating models with explicit water
 
 Solvates models using the number of waters determined in the previous step, then performs a short molecular dynamics simulation (default: 100 ps), using ``OpenMM``. The final structure is written to the compressed PDB file: ``explicit-refined.pdb.gz``, as well as serialized versions of the OpenMM System, State and Integrator objects.
 
+::
+
+  $ ensembler validate
+
+(Optional; requires `MolProbity <http://molprobity.biochem.duke.edu/>`_ command-line tools) Validates model quality using MolProbity, which uses criteria such as Ramachandran angles, backbone distortions, and atom clashes. The ``package_models`` command can filter models based on validation score, using the ``--model_validation_score_cutoff`` and ``--model_validation_score_percentile`` flags.
+
 ::
 
   $ ensembler package_models --package_for FAH --nfahclones 3

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -34,7 +34,7 @@ Then, to install Ensembler with Conda, use the following commands ::
   $ conda config --add channels http://conda.anaconda.org/salilab
   $ conda install ensembler
 
-Conda will automatically install all dependencies except for the optional dependency `Rosetta <https://www.rosettacommons.org/software>`_. This requires a license (free for academic non-profit use), and will have to be installed according to the instructions for that package.
+Conda will automatically install all dependencies except for the optional dependencies `Rosetta <https://www.rosettacommons.org/software>`_ and `MolProbity <http://molprobity.biochem.duke.edu/>`_. These require licenses (free for academic non-profit use), and will have to be installed according to the instructions for those packages. Some limited installation instructions are included below, but these are not guaranteed to be up to date.
 
 Install from Source
 -------------------
@@ -113,6 +113,10 @@ Optional packages:
         Some functionality, including the ``quickmodel`` and ``inspect``
         functions, requires pandas.
 
+    `MolProbity <http://molprobity.biochem.duke.edu/>`_
+        For model validation. The ``package_models`` function can use this
+        data to filter models by validation score.
+
 Manually Installing the Dependencies
 ------------------------------------
 
@@ -159,3 +163,27 @@ databases such as UniProt, or are excluded from the unit tests due to being
 slow. To run them: ::
 
   $ nosetests ensembler -a non_conda_dependencies -a network -a slow
+
+Installation of Dependencies Unavailable Through Conda
+======================================================
+
+(Note: only limited instructions are included here, and these are not guaranteed to be up to date. If you encounter problems, please consult the relevant support or installation instructions for that software dependency.)
+
+MolProbity
+----------
+
+Download the `MolProbity 4.2 release source <https://github.com/rlabduke/MolProbity/archive/molprobity_4.2.zip>`_ from the GitHub repo.
+
+Extract the zip file, enter the created directory, and run the following command: ::
+
+  $ ./configure.sh
+
+This was all that was required when tested on a MacBook running OS X 10.8.
+
+On a Linux cluster, it was first necessary to edit the file configure.sh to uncomment the following line, and comment the ``make`` command: ::
+
+  $ ./binlibtbx.scons -j 1
+
+This forces the build to use only a single core - this ran rather slowly, but using more cores resulted in build failure. This is likely due to memory issues. After runnng ``./configure.sh`` it was then also necessary to run ``./setup.sh``.
+
+Binaries can found in the ``[MolProbity source dir]/cmdline`` directory.
diff --git a/ensembler/cli_commands/__init__.py b/ensembler/cli_commands/__init__.py
@@ -10,6 +10,7 @@
     'refine_implicit',
     'solvate',
     'refine_explicit',
+    'validate',
     'package_models',
     'quickmodel',
     'renumber_residues',
@@ -27,6 +28,7 @@
 from . import refine_implicit
 from . import solvate
 from . import refine_explicit
+from . import validate
 from . import package_models
 from . import quickmodel
 from . import renumber_residues
diff --git a/ensembler/cli_commands/build_models.py b/ensembler/cli_commands/build_models.py
@@ -16,7 +16,7 @@
                                     model for a protein kinase domain target""",
 
     """\
-  --template_seqid_cutoff <cutoff>  Select only templates with sequence identity (percentage)
+  --model_seqid_cutoff <cutoff>  Select only templates with sequence identity (percentage)
                                     greater than the given cutoff.""",
 ]
 
@@ -62,10 +62,10 @@ def dispatch(args):
     else:
         templates = False
 
-    if args['--template_seqid_cutoff']:
-        template_seqid_cutoff = float(args['--template_seqid_cutoff'])
+    if args['--model_seqid_cutoff']:
+        model_seqid_cutoff = float(args['--model_seqid_cutoff'])
     else:
-        template_seqid_cutoff = False
+        model_seqid_cutoff = False
 
     if args['--verbose']:
         loglevel = 'debug'
@@ -75,7 +75,7 @@ def dispatch(args):
     ensembler.modeling.build_models(
         process_only_these_targets=targets,
         process_only_these_templates=templates,
-        template_seqid_cutoff=template_seqid_cutoff,
+        model_seqid_cutoff=model_seqid_cutoff,
         write_modeller_restraints_file=args['--write_modeller_restraints_file'],
         loglevel=loglevel
     )
diff --git a/ensembler/cli_commands/cluster.py b/ensembler/cli_commands/cluster.py
@@ -52,4 +52,8 @@ def dispatch(args):
     else:
         loglevel = 'info'
 
-    ensembler.modeling.cluster_models(process_only_these_targets=targets, loglevel=loglevel, **dispatch_args)
+    ensembler.modeling.cluster_models(
+        process_only_these_targets=targets,
+        loglevel=loglevel,
+        **dispatch_args
+    )
diff --git a/ensembler/cli_commands/general.py b/ensembler/cli_commands/general.py
@@ -16,33 +16,37 @@
   ensembler align [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
       [--templates <templates>] [--templatesfile <templatesfile>] [--substitution_matrix <matrix>]
       [-v | --verbose]
-  ensembler build_models [-h | --help] [--targets <target>] [--targetsfile <targetsfile>]
-      [--templates <template>] [--templatesfile <templatesfile>] [--template_seqid_cutoff <cutoff>]
+  ensembler build_models [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
+      [--templates <template>] [--templatesfile <templatesfile>] [--model_seqid_cutoff <cutoff>]
       [--write_modeller_restraints_file] [-v | --verbose]
-  ensembler cluster [-h | --help] [--targets <target>] [--targetsfile <targetsfile>]
+  ensembler cluster [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
       [--cutoff <cutoff>] [-v | --verbose]
-  ensembler refine_implicit [-h | --help] [--targets <target>] [--targetsfile <targetsfile>]
-      [--templates <template>] [--templatesfile <templatesfile>] [--template_seqid_cutoff <cutoff>]
+  ensembler refine_implicit [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
+      [--templates <template>] [--templatesfile <templatesfile>] [--model_seqid_cutoff <cutoff>]
       [--gpupn <gpupn>] [--openmm_platform <platform>] [--simlength <simlength>]
       [--retry_failed_runs] [--ff <ffname>] [--water_model <modelname>] [--api_params <params>]
       [-v | --verbose]
-  ensembler solvate [-h | --help] [--targets <target>] [--targetsfile <targetsfile>]
-      [--templates <template>] [--templatesfile <templatesfile>] [--template_seqid_cutoff <cutoff>]
+  ensembler solvate [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
+      [--templates <template>] [--templatesfile <templatesfile>] [--model_seqid_cutoff <cutoff>]
       [--padding <padding>] [--select_nwaters_at_percentile <value>] [--ff <ffname>]
       [--water_model <modelname>] [-v | --verbose]
-  ensembler refine_explicit [-h | --help] [--targets <target>] [--targetsfile <targetsfile>]
-      [--templates <template>] [--templatesfile <templatesfile>] [--template_seqid_cutoff <cutoff>]
+  ensembler refine_explicit [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
+      [--templates <template>] [--templatesfile <templatesfile>] [--model_seqid_cutoff <cutoff>]
       [--gpupn <gpupn>] [--openmm_platform <platform>] [--simlength <simlength>]
       [--retry_failed_runs] [--write_solvated_model] [--ff <ffname>] [--water_model <modelname>]
       [--api_params <params>] [-v | --verbose]
-  ensembler package_models [-h | --help] [--package_for <choice>] [--targets <target>]
+  ensembler validate [-h | --help] [--targets <targets>] [--targetsfile <targetsfile>]
+      [--method <method>] [--modeling_stage <stage>] [-v | --verbose]
+  ensembler package_models [-h | --help] [--package_for <choice>] [--targets <targets>]
       [--targetsfile <targetsfile>] [--templates <template>] [--templatesfile <templatesfile>]
-      [--template_seqid_cutoff <cutoff>] [--nfahclones <n>] [--compressruns] [-v | --verbose]
+      [--model_seqid_cutoff <cutoff>] [--model_validation_score_cutoff <cutoff>]
+      [--model_validation_score_percentile <percentile>] [--nfahclones <n>] [--compressruns]
+      [-v | --verbose]
   ensembler testrun_pipeline [-h | --help]
   ensembler quickmodel [-h | --help] [--targetid <id>] [--templateids <ids>]
       [--target_uniprot_entry_name <entry_name>] [--uniprot_domain_regex <regex>]
       [--template_pdbids <pdbids>] [--template_chainids <chainids>]
-      [--template_uniprot_query <query>] [--template_seqid_cutoff <cutoff>] [--no-loopmodel]
+      [--template_uniprot_query <query>] [--model_seqid_cutoff <cutoff>] [--no-loopmodel]
       [--package_for_fah] [--nfahclones <nfahclones>] [--structure_dirs <structure_dirs>]
   ensembler renumber_residues [-h | --help] [--target <targetid>] [-v | --verbose]
 

diff --git a/ensembler/cli_commands/package_models.py b/ensembler/cli_commands/package_models.py
@@ -12,43 +12,51 @@
 
 helpstring_unique_options = [
     """\
-  --package_for <choice>                Specify which packaging method to use (required).
-                                        - transfer: compress results into a single .tgz file
-                                        - FAH: set up the input files and directory structure
-                                          necessary to start a Folding@Home project.""",
+  --package_for <choice>                                Specify which packaging method to use (required).
+                                                        - transfer: compress results into a single .tgz file
+                                                        - FAH: set up the input files and directory structure
+                                                          necessary to start a Folding@Home project.""",
 
     """\
-  --nfahclones <n>                      If packaging for Folding@Home, select the number of clones
-                                        to use for each model [default: 1].""",
+  --nfahclones <n>                                      If packaging for Folding@Home, select the number of clones
+                                                        to use for each model [default: 1].""",
 
     """\
-  --compressruns                        If packaging for Folding@Home, choose whether to compress
-                                        each RUN into a .tgz file.""",
+  --compressruns                                        If packaging for Folding@Home, choose whether to compress
+                                                        each RUN into a .tgz file. [default: False]""",
+
+    """\
+  --model_validation_score_cutoff <cutoff>              Select only models with MolProbity validation score
+                                                        less than the given cutoff.""",
+
+    """\
+  --model_validation_score_percentile <percentile>      Select only models with MolProbity validation score
+                                                        less than the value at the given percentile.""",
 ]
 
 helpstring_nonunique_options = [
     """\
-  --targetsfile <targetsfile>  File containing a list of target IDs to work on (newline-separated).
-                               Comment targets out with "#".""",
+  --targetsfile <targetsfile>                           File containing a list of target IDs to work on (newline-separated).
+                                                        Comment targets out with "#".""",
 
     """\
-  --targets <target>           Define one or more target IDs to work on (comma-separated), e.g.
-                               "--targets ABL1_HUMAN_D0,SRC_HUMAN_D0" (default: all targets)""",
+  --targets <target>                                    Define one or more target IDs to work on (comma-separated), e.g.
+                                                        "--targets ABL1_HUMAN_D0,SRC_HUMAN_D0" (default: all targets)""",
 
     """\
-  --templates <template>            Define one or more template IDs to work on (comma-separated), e.g.
-                                    "--templates ABL1_HUMAN_D0_1OPL_A" (default: all templates)""",
+  --templates <template>                                Define one or more template IDs to work on (comma-separated), e.g.
+                                                        "--templates ABL1_HUMAN_D0_1OPL_A" (default: all templates)""",
 
     """\
-  --templatesfile <templatesfile>   File containing a list of template IDs to work on (newline-separated).
-                                    Comment targets out with "#".""",
+  --templatesfile <templatesfile>                       File containing a list of template IDs to work on (newline-separated).
+                                                        Comment targets out with "#".""",
 
     """\
-  --template_seqid_cutoff <cutoff>  Select only templates with sequence identity (percentage)
-                                    greater than the given cutoff.""",
+  --model_seqid_cutoff <cutoff>                         Select only models with sequence identity (percentage)
+                                                        greater than the given cutoff.""",
 
     """\
-  -v --verbose                 """,
+  -v --verbose                                          """,
 ]
 
 helpstring = '\n\n'.join([helpstring_header, '\n\n'.join(helpstring_unique_options), '\n\n'.join(helpstring_nonunique_options)])
@@ -77,10 +85,20 @@ def dispatch(args):
     else:
         templates = False
 
-    if args['--template_seqid_cutoff']:
-        template_seqid_cutoff = float(args['--template_seqid_cutoff'])
+    if args['--model_seqid_cutoff']:
+        model_seqid_cutoff = float(args['--model_seqid_cutoff'])
     else:
-        template_seqid_cutoff = False
+        model_seqid_cutoff = False
+
+    if args['--model_validation_score_cutoff']:
+        model_validation_score_cutoff = float(args['--model_validation_score_cutoff'])
+    else:
+        model_validation_score_cutoff = None
+
+    if args['--model_validation_score_percentile']:
+        model_validation_score_percentile = int(args['--model_validation_score_percentile'])
+    else:
+        model_validation_score_percentile = None
 
     if args['--nfahclones']:
         n_fah_clones = int(args['--nfahclones'])
@@ -107,8 +125,10 @@ def dispatch(args):
         ensembler.packaging.package_for_fah(
             process_only_these_targets=targets,
             process_only_these_templates=templates,
-            template_seqid_cutoff=template_seqid_cutoff,
+            model_seqid_cutoff=model_seqid_cutoff,
+            model_validation_score_cutoff=model_validation_score_cutoff,
+            model_validation_score_percentile=model_validation_score_percentile,
             nclones=n_fah_clones,
             archive=archive,
             loglevel=loglevel,
-        )
+        )
diff --git a/ensembler/cli_commands/quickmodel.py b/ensembler/cli_commands/quickmodel.py
@@ -60,7 +60,7 @@
                                            /Users/partond/tmp/kinome-MSMSeeder/structures/sifts\"""",
 
     """\
-  --template_seqid_cutoff <cutoff>         e.g. "80\"""",
+  --model_seqid_cutoff <cutoff>         e.g. "80\"""",
 ]
 
 helpstring = '\n\n'.join([helpstring_header, '\n\n'.join(helpstring_unique_options), '\n\n'.join(helpstring_nonunique_options)])
@@ -88,10 +88,10 @@ def dispatch(args):
     else:
         chainids_dict = None
 
-    if args['--template_seqid_cutoff']:
-        template_seqid_cutoff = float(args['--template_seqid_cutoff'])
+    if args['--model_seqid_cutoff']:
+        model_seqid_cutoff = float(args['--model_seqid_cutoff'])
     else:
-        template_seqid_cutoff = None
+        model_seqid_cutoff = None
 
     if args['--nfahclones']:
         nfahclones = int(args['--nfahclones'])
@@ -103,4 +103,4 @@ def dispatch(args):
     else:
         structure_paths = None
 
-    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], template_seqid_cutoff=template_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
+    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], model_seqid_cutoff=model_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
diff --git a/ensembler/cli_commands/refine_explicit.py b/ensembler/cli_commands/refine_explicit.py
@@ -61,7 +61,7 @@
                                     See OpenMM documentation for other water model options""",
 
     """\
-  --template_seqid_cutoff <cutoff>  Select only templates with sequence identity (percentage)
+  --model_seqid_cutoff <cutoff>  Select only templates with sequence identity (percentage)
                                     greater than the given cutoff.""",
 
     """\
@@ -93,10 +93,10 @@ def dispatch(args):
     else:
         templates = False
 
-    if args['--template_seqid_cutoff']:
-        template_seqid_cutoff = float(args['--template_seqid_cutoff'])
+    if args['--model_seqid_cutoff']:
+        model_seqid_cutoff = float(args['--model_seqid_cutoff'])
     else:
-        template_seqid_cutoff = False
+        model_seqid_cutoff = False
 
     if args['--gpupn']:
         gpupn = int(args['--gpupn'])
@@ -121,7 +121,7 @@ def dispatch(args):
         sim_length=sim_length,
         process_only_these_targets=targets,
         process_only_these_templates=templates,
-        template_seqid_cutoff=template_seqid_cutoff,
+        model_seqid_cutoff=model_seqid_cutoff,
         retry_failed_runs=args['--retry_failed_runs'],
         write_solvated_model=args['--write_solvated_model'],
         ff=args['--ff'],