update documentation and fix blacklisting bug

galaxyproject · Aug 4, 2017 · 70f4bc9 · 70f4bc9
1 parent 16f17ef
commit 70f4bc9
Show file tree

Hide file tree

Showing 2 changed files with 119 additions and 29 deletions.
diff --git a/doc/source/admin/special_topics/grt.rst b/doc/source/admin/special_topics/grt.rst
@@ -11,31 +11,105 @@ Registration
 ------------
 
 You will need to register your Galaxy instance with the Galactic Radio
-Telescope (GRT). This can be done `https://radio-telescope.galaxyproject.org
-<https://radio-telescope.galaxyproject.org>`__.
+Telescope (GRT). This can be done `https://telescope.galaxyproject.org
+<https://telescope.galaxyproject.org>`__.
 
-Submitting Data
----------------
+About the Script
+----------------
 
 Once you've registered your Galaxy instance, you'll receive an instance ID and
-an API key which are used to run ``scripts/grt.py``. The tool itself is very simple
-to run. It collects the last 7 days (by default) of data from your Galaxy
-server, and sends them to the GRT for processing and display. Additionally
-it collects the total number of users, and the number of users who ran
-jobs in the last N days.
-
-Running the tool is simple:
-
-.. code-block:: shell
-
-    python scripts/grt.py \
-        <INSTANCE_UUID> \
-        <API_KEY> \
-        -c config/galaxy.ini \
-        --grt-url https://radio-telescope.galaxyproject.org/api/v1/upload/
-        --days 7
-
-The only required parameters are the instance ID and API key. As you can see in
-the example command, the GRT URL is configurable. If you do not wish to
-participate in the public version of this experiment you can host your own
-radio telescope to collect Galactic information.
+an API key which are used to run ``scripts/grt.py``. The tool itself is very
+simple to run. GRT will run and produce a directory of reports that can be
+synced with the GRT server. Every time it is run, GRT only processes the list
+of jobs that were run since the last time it was run. On first run, GRT will
+attempt to export all job data for your instance which may be very slow
+depending on your instance size. We have attempted to optimize this as much as
+is faesible.
+
+Data Privacy
+------------
+
+All data submitted to the GRT will be released into the public domain. If there
+are certain tools you do not want included, or certain parameters you wish to
+hide (e.g. because they contain API keys), then you can take advantage of the
+built-in sanitization. ``scripts/grt.yml.sample`` file allows you to build up
+sanitization for the job logs.
+
+.. code-block:: yaml
+
+    sanitization:
+        # Blacklist the entire tool from appearing
+        tools:
+            - __SET_METADATA__
+            - upload1
+        # Or you can blacklist individual parameters from being submitted, e.g. if
+        # you have API keys as a tool parameter.
+        tool_params:
+            # Or to blacklist under a specific tool, just specify the ID
+            some_tool_id:
+                - dbkey
+                # If you need to specify a parameter multiple levels deep, you can
+                # do that as well. Currently we only support blacklisting via the
+                # full path, rather than just a path component. So everything under
+                # `path.to.parameter` will be blacklisted.
+                - path.to.parameter
+                # However you could not do "parameter" and have everything under
+                # `path.to.parameter` be removed.
+                # Repeats are rendered as an *, e.g.: repeat_name.*.values
+
+To blacklist the results from specific tools appearing in results, just add the
+tool ID under the ``tools`` list.
+
+Blacklisting tool parameters is more complex. In a key under the ``tool_params`` key,
+supply a list of parameters you wish to blacklist. *NB: This will slow down
+processing of records associated with that tool.* Selecting keys is done
+identically to writing test cases, except if you have a repeat element, just
+replace the location of the numeric identifier with ``*``, e.g.
+``repeat_name.*.some_subkey``
+
+Data Collection Process
+-----------------------
+
+.. code-block:: console
+
+    cd $GALAXY; python scripts/grt.py -l debug
+
+
+``grt.py`` connects to your galaxy database and makes queries against the
+database for three primary tables:
+
+- job
+- job_parameter
+- job_metric_numeric
+
+these are exported with very little processing, as tabular files to the GRT
+reports directory, ``$GALAXY/reports/``. (This script could really just be a
+set of SQL queries, but it has been written in python to be database agnostic.)
+Once the files have been exported, they are put in a compresesd archive, and
+some metadata about the export process is written to a json file with the same
+name as the report archive.
+
+You may wish to inspect these files to be sure that you're comfortable with the
+information being sent.
+
+Once you're happy with the data, you can submit it with the GRT submission tool...
+
+Data Submission
+---------------
+
+.. code-block:: console
+
+    cd $GALAXY; python scripts/grt-submit.py
+
+``scripts/grt-submit.py`` is a script which will submit your data to the
+configured GRT server. You must first be registered with the server which will
+also walk you through the setup process.
+
+With your reports, submitting them is very simple. The script will login to the
+server and determine which reports the server does not have yet. Then it will
+begin uploading those.
+
+For administrators with firewalled galaxies and no internet access, if you are
+able to exfiltrate your files to somewhere with internet, then you can still
+take advantage of GRT. Alternatively you can deploy GRT on your own
+infrastructure if you don't want to share your job logs.
diff --git a/scripts/grt.py b/scripts/grt.py
@@ -82,20 +82,29 @@ def __init__(self, sanitization_config, model, sa_session):
             self.sanitization_config['tool_params'] = {}
 
     def blacklisted_tree(self, path):
-        if self.tool_id in self.sanitization_config['tool_params']:
-            if path.lstrip('.') in self.sanitization_config['tool_params'][self.tool_id]:
-                return True
+        if path.lstrip('.') in self.sanitization_config['tool_params'][self.tool_id]:
+            return True
         return False
 
     def sanitize_data(self, tool_id, key, value):
         # If the tool is blacklisted, skip it.
         if tool_id in self.sanitization_config['tools']:
             return 'null'
+        # Thus, all tools below here are not blacklisted at the top level.
 
-        # If it isn't in tool_params, return quickly without parsing.
+        # If it isn't in tool_params, there are no keys being sanitized for
+        # this tool so we can return quickly without parsing.
         if tool_id not in self.sanitization_config['tool_params']:
             return value
 
+        # If the key is listed precisely (not a sub-tree), we can also return slightly more quickly.
+        if key in self.sanitization_config['tool_params'][tool_id]:
+            return 'null'
+
+        # If the key isn't a prefix for any of the keys being sanitized, then this is safe.
+        if not any(san_key.startswith(key) for san_key in self.sanitization_config['tool_params'][tool_id]):
+            return value
+
         # Slow path.
         unsanitized = {key: json.loads(value)}
         self.tool_id = tool_id
@@ -237,6 +246,7 @@ def annotate(label, human_label=None):
 
     # Unfortunately we have to keep this mapping for the sanitizer to work properly.
     job_tool_map = {}
+    blacklisted_tools = config['sanitization']['tools']
 
     annotate('export_jobs_start', 'Exporting Jobs')
     handle_job = open(REPORT_BASE + '.jobs.tsv', 'w')
@@ -276,6 +286,9 @@ def annotate(label, human_label=None):
                 .filter(model.JobMetricNumeric.job_id > offset_start) \
                 .filter(model.JobMetricNumeric.job_id <= min(end_job_id, offset_start + args.batch_size)) \
                 .all():
+            # If the tool is blacklisted, exclude everywhere
+            if job_tool_map[metric[0]] in blacklisted_tools:
+                continue
 
             handle_metric_num.write(str(metric[0]))
             handle_metric_num.write('\t')
@@ -297,6 +310,9 @@ def annotate(label, human_label=None):
                 .filter(model.JobParameter.job_id > offset_start) \
                 .filter(model.JobParameter.job_id <= min(end_job_id, offset_start + args.batch_size)) \
                 .all():
+            # If the tool is blacklisted, exclude everywhere
+            if job_tool_map[param[0]] in blacklisted_tools:
+                continue
 
             sanitized = san.sanitize_data(job_tool_map[param[0]], param[1], param[2])