Skip to content

Commit

Permalink
update documentation and fix blacklisting bug
Browse files Browse the repository at this point in the history
  • Loading branch information
hexylena committed Aug 4, 2017
1 parent 16f17ef commit 70f4bc9
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 29 deletions.
124 changes: 99 additions & 25 deletions doc/source/admin/special_topics/grt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,31 +11,105 @@ Registration
------------

You will need to register your Galaxy instance with the Galactic Radio
Telescope (GRT). This can be done `https://radio-telescope.galaxyproject.org
<https://radio-telescope.galaxyproject.org>`__.
Telescope (GRT). This can be done `https://telescope.galaxyproject.org
<https://telescope.galaxyproject.org>`__.

Submitting Data
---------------
About the Script
----------------

Once you've registered your Galaxy instance, you'll receive an instance ID and
an API key which are used to run ``scripts/grt.py``. The tool itself is very simple
to run. It collects the last 7 days (by default) of data from your Galaxy
server, and sends them to the GRT for processing and display. Additionally
it collects the total number of users, and the number of users who ran
jobs in the last N days.

Running the tool is simple:

.. code-block:: shell
python scripts/grt.py \
<INSTANCE_UUID> \
<API_KEY> \
-c config/galaxy.ini \
--grt-url https://radio-telescope.galaxyproject.org/api/v1/upload/
--days 7
The only required parameters are the instance ID and API key. As you can see in
the example command, the GRT URL is configurable. If you do not wish to
participate in the public version of this experiment you can host your own
radio telescope to collect Galactic information.
an API key which are used to run ``scripts/grt.py``. The tool itself is very
simple to run. GRT will run and produce a directory of reports that can be
synced with the GRT server. Every time it is run, GRT only processes the list
of jobs that were run since the last time it was run. On first run, GRT will
attempt to export all job data for your instance which may be very slow
depending on your instance size. We have attempted to optimize this as much as
is faesible.

Data Privacy
------------

All data submitted to the GRT will be released into the public domain. If there
are certain tools you do not want included, or certain parameters you wish to
hide (e.g. because they contain API keys), then you can take advantage of the
built-in sanitization. ``scripts/grt.yml.sample`` file allows you to build up
sanitization for the job logs.

.. code-block:: yaml
sanitization:
# Blacklist the entire tool from appearing
tools:
- __SET_METADATA__
- upload1
# Or you can blacklist individual parameters from being submitted, e.g. if
# you have API keys as a tool parameter.
tool_params:
# Or to blacklist under a specific tool, just specify the ID
some_tool_id:
- dbkey
# If you need to specify a parameter multiple levels deep, you can
# do that as well. Currently we only support blacklisting via the
# full path, rather than just a path component. So everything under
# `path.to.parameter` will be blacklisted.
- path.to.parameter
# However you could not do "parameter" and have everything under
# `path.to.parameter` be removed.
# Repeats are rendered as an *, e.g.: repeat_name.*.values
To blacklist the results from specific tools appearing in results, just add the
tool ID under the ``tools`` list.

Blacklisting tool parameters is more complex. In a key under the ``tool_params`` key,
supply a list of parameters you wish to blacklist. *NB: This will slow down
processing of records associated with that tool.* Selecting keys is done
identically to writing test cases, except if you have a repeat element, just
replace the location of the numeric identifier with ``*``, e.g.
``repeat_name.*.some_subkey``

Data Collection Process
-----------------------

.. code-block:: console
cd $GALAXY; python scripts/grt.py -l debug
``grt.py`` connects to your galaxy database and makes queries against the
database for three primary tables:

- job
- job_parameter
- job_metric_numeric

these are exported with very little processing, as tabular files to the GRT
reports directory, ``$GALAXY/reports/``. (This script could really just be a
set of SQL queries, but it has been written in python to be database agnostic.)
Once the files have been exported, they are put in a compresesd archive, and
some metadata about the export process is written to a json file with the same
name as the report archive.

You may wish to inspect these files to be sure that you're comfortable with the
information being sent.

Once you're happy with the data, you can submit it with the GRT submission tool...

Data Submission
---------------

.. code-block:: console
cd $GALAXY; python scripts/grt-submit.py
``scripts/grt-submit.py`` is a script which will submit your data to the
configured GRT server. You must first be registered with the server which will
also walk you through the setup process.

With your reports, submitting them is very simple. The script will login to the
server and determine which reports the server does not have yet. Then it will
begin uploading those.

For administrators with firewalled galaxies and no internet access, if you are
able to exfiltrate your files to somewhere with internet, then you can still
take advantage of GRT. Alternatively you can deploy GRT on your own
infrastructure if you don't want to share your job logs.
24 changes: 20 additions & 4 deletions scripts/grt.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,20 +82,29 @@ def __init__(self, sanitization_config, model, sa_session):
self.sanitization_config['tool_params'] = {}

def blacklisted_tree(self, path):
if self.tool_id in self.sanitization_config['tool_params']:
if path.lstrip('.') in self.sanitization_config['tool_params'][self.tool_id]:
return True
if path.lstrip('.') in self.sanitization_config['tool_params'][self.tool_id]:
return True
return False

def sanitize_data(self, tool_id, key, value):
# If the tool is blacklisted, skip it.
if tool_id in self.sanitization_config['tools']:
return 'null'
# Thus, all tools below here are not blacklisted at the top level.

# If it isn't in tool_params, return quickly without parsing.
# If it isn't in tool_params, there are no keys being sanitized for
# this tool so we can return quickly without parsing.
if tool_id not in self.sanitization_config['tool_params']:
return value

# If the key is listed precisely (not a sub-tree), we can also return slightly more quickly.
if key in self.sanitization_config['tool_params'][tool_id]:
return 'null'

# If the key isn't a prefix for any of the keys being sanitized, then this is safe.
if not any(san_key.startswith(key) for san_key in self.sanitization_config['tool_params'][tool_id]):
return value

# Slow path.
unsanitized = {key: json.loads(value)}
self.tool_id = tool_id
Expand Down Expand Up @@ -237,6 +246,7 @@ def annotate(label, human_label=None):

# Unfortunately we have to keep this mapping for the sanitizer to work properly.
job_tool_map = {}
blacklisted_tools = config['sanitization']['tools']

annotate('export_jobs_start', 'Exporting Jobs')
handle_job = open(REPORT_BASE + '.jobs.tsv', 'w')
Expand Down Expand Up @@ -276,6 +286,9 @@ def annotate(label, human_label=None):
.filter(model.JobMetricNumeric.job_id > offset_start) \
.filter(model.JobMetricNumeric.job_id <= min(end_job_id, offset_start + args.batch_size)) \
.all():
# If the tool is blacklisted, exclude everywhere
if job_tool_map[metric[0]] in blacklisted_tools:
continue

handle_metric_num.write(str(metric[0]))
handle_metric_num.write('\t')
Expand All @@ -297,6 +310,9 @@ def annotate(label, human_label=None):
.filter(model.JobParameter.job_id > offset_start) \
.filter(model.JobParameter.job_id <= min(end_job_id, offset_start + args.batch_size)) \
.all():
# If the tool is blacklisted, exclude everywhere
if job_tool_map[param[0]] in blacklisted_tools:
continue

sanitized = san.sanitize_data(job_tool_map[param[0]], param[1], param[2])

Expand Down

0 comments on commit 70f4bc9

Please sign in to comment.