Add some docs

ckan · Oct 23, 2014 · 0401317 · 0401317
1 parent 21432f9
commit 0401317
Show file tree

Hide file tree

Showing 2 changed files with 269 additions and 27 deletions.
diff --git a/README.markdown b/README.markdown
@@ -7,9 +7,245 @@ Losser
 ======
 
 A little UNIX command and Python library for lossy filter, transform, and
-export of JSON to JSON or CSV.
+export of JSON to Excel-compatible CSV.
 
-TODO: Detailed explanation.
+Losser can be run either as a UNIX command or used as Python library
+(see [Usage](#usage) below). It takes a JSON-formatted list of objects
+(or a list of Python dicts) as input and produces a "table" as output.
+
+The input objects do not all have to have the same keys as each other, and may
+contain sub-lists and sub-objects arbitrarily nested.
+
+The output "table" is a list of objects that all have the same keys in the same
+order, and with sub-objects and sub-lists nested no more than one level deep.
+It can be output as:
+
+* A list of Python OrderedDicts each having the same keys in the same order
+* A string of JSON-formatted text representing a list of objects each having
+  the same keys in the same order
+  ([TODO](https://github.com/ckan/losser/issues/3))
+* A string of CSV-formatted text, one object per CSV row. The rows of the CSV
+  correspond to the objects in the list of objects, and the columns correspond
+  to the object's keys.
+
+The input objects can be filtered and transformed before producing the output
+table. You provide a list of "column query" objects in a `columns.json` file
+that specifies what columns the output table should have, and how the values
+for those columns should be retrieved from the input objects.
+
+For example, if you had some input objects that looked like this:
+
+    [
+      {
+        "author": "Sean Hammond",
+        "title": "An Example Input Object",
+        "extras":
+          {
+            "Delivery Unit": "Commissioning"
+          {
+      },
+      ...
+    ]
+
+You might transform them using a `columns.json` file like this:
+
+    {
+        "Data Owner": {
+            "pattern_path": "^author$"
+        },
+        "Title": {
+            "pattern_path": "^title$"
+        },
+        "Delivery Unit": {
+            "pattern_path": ["^extras$", "^Delivery Unit$"]
+        }
+    }
+
+This would output a CSV file like this:
+
+    Data Owner,Title,Delivery Unit
+    Sean Hammond,An Example Input Object,Commissioning
+    Frank Black,Another Example Object,Some Other Unit
+    ...
+
+The `columns.json` file above specifies three column headings for the output
+table:
+
+1. Data Owner
+2. Title
+3. Delivery Unit
+
+The values for each column are retrieved from the input objects by following a
+"pattern path": a list of regular expressions that are matched against the keys
+of the input object and its sub-objects in turn to find a value.
+
+For example the "Data Owner" field above has the pattern path "^author$" which
+matches the string "author". This will find top-level keys named "author" in
+the input objects and output their values in the "Data Owner" column of the
+output table.
+
+The "Delivery Unit" column above has a more complex pattern path:
+`["^extras$", "^Delivery Unit$"]`. This will find the top-level key "extras" in
+an input object and, assuming the value for the "extras" key is a sub-object,
+will find and return the value for the "Delivery Unit" key in the sub-object.
+
+Pattern paths can be arbitrarily long, recursing into arbitrarily deeply nested
+sub-objects.
+
+Losser queries an input object with a pattern path as follows:
+
+1. Any pre-processor functions are applied to the input object.
+2. Each column query in the `columns.json` file is applied to the input object
+   in turn to produce the values for the corresponding row in the output table.
+   For each column query:
+   1. When it hits an object/dictionary (either the top-level input object
+      itself or a sub-object) losser pops the next regular expression off the
+      pattern path, matches it against each of the object's keys, and recurses
+      on each key that matches.
+   2. When it hits a list losser iterates over the items in the list recursing
+      on each of them and collecting the results into a list.
+   3. When it hits a string, number, boolean or `None`/`null` value the
+      recursion bottoms-out and returns the value.
+5. Once all of the column queries have been run on the input object and the
+   results collected, any post-processor functions are called on the output
+   object.
+
+A pattern may match more than one key in an object, in which case each one of
+the matching keys' item's will be recursed into, and a list of matching values
+will eventually be returned instead of a single value. For example given this
+input object:
+
+    {
+      "update": "yearly",
+      "update frequency": "monthly"
+    }
+
+The pattern path `"^update.*"` (which matches both "update" and "update
+frequency") would output `"yearly, monthly"` (a quoted, comma-separated list)
+in the corresponding cell in the output table.
+
+This also happens when a pattern path matches a single field in the input
+object and the field's value is a list.
+
+Nested lists can occur (when the input object contains a list of lists, for
+example). These are flattened and optionally deduplicated in the output cells.
+
+Some of the filtering and transformations you can do with losser include:
+
+* Extract some fields from the objects (by matching regular expression
+  patterns) and filter out others.
+
+  Any fields in an input object that do not match any of the pattern paths in
+  the `columns.json` file are filtered out.
+
+  ([TODO](https://github.com/ckan/losser/issues/2): Support appending unmatched
+  fields to the end of the ouput table as additional columns).
+
+* Specify the order of the columns in the output table.
+
+  Columns are output in the same order that they appear in the `columns.json`
+  file, which does not have to be the same order as the corresponding fields in
+  the input objects.
+
+* Rename fields, using a different name for the column in the output table than
+  for the field in the input objects.
+
+  For example to get the "notes" field from each input object and place them
+  all in a "description" column in the output table, put this object in your
+  `columns.json`:
+
+      "Description": {
+        "pattern_path": "^notes$",
+      }
+
+
+* Match patterns case-sensitively.
+
+  By default patterns are matched case-insensitively. To do case-sensitive
+  matching put `"case_sensitive": true` in a column query in your
+  `columns.json` file:
+
+      "Title": {
+        "pattern_path": "^title$",
+        "case_sensitive": true
+      },
+
+* Transform the matched values, for example truncating or stripping whitespace
+  from strings.
+
+* Provide arbitrary pre-processor and post-processor functions to do custom
+  transformations on the input and output objects
+  ([TODO](https://github.com/ckan/losser/issues/1)).
+
+* Find inconsistently-named fields using a pattern that matches any of the
+  names and combine them into a single column in the output table.
+
+  For example you can provide a pattern like "^update.*" that will find keys
+  named "update", "Update", "Update Frequency" etc. in different input objects
+  and collect their values in a single "Update Frequency" column.
+
+* Recurse into sub-objects and extract fields from the sub-objects, promoting
+  them to top-level keys in the output table.
+
+  This is done using "pattern paths", ordered lists of regexes. For example
+  the pattern path ["^resources$"]["^format$"] will find the values of the
+  "format" fields of the "resources" sub-objects in each of the list of input
+  objects.
+
+* You can specify that a pattern path should find a unique value in the object,
+  and if more than one value in the object matches the pattern (and a list
+  would be returned) an exception will be raised.
+
+  Use `"unique": true` in a column query in your `columns.json` file:
+
+      "Title": {
+        "pattern_path": "^title$",
+        "unique": true
+      },
+
+  This is useful for debugging pattern paths that you expect to be unique.
+
+* You can specify that a pattern path *must* match a value in the object, and
+  an exception will be raised if there's no matching path through the object
+  ([TODO](https://github.com/ckan/losser/issues/4)).
+
+* When a pattern matches multiple paths through the input object, or matches a
+  path going through a sub-list, the resulting list of values in the output
+  table cell can be deduplicated. Put `"deduplicate": true` in a column query
+  in your `columns.json` file:
+
+      "Format": {
+          "pattern_path": ["^resources$", "^format$"],
+          "deduplicate": true
+      },
+
+
+What it can't do (yet):
+
+* Pattern match against the values of items (as opposed to their keys).
+
+  When following a pattern path through an object, when losser hits an
+  object/dictionary in the input, either one of the top-level objects in the
+  list of input objects or a sub-object, losser matches the relevant regex
+  against the object's keys and then recurses on the values of each of the
+  matched keys.
+
+  If the key matches the pattern it recurses, you can't also specify a pattern
+  to match the value against.
+
+  When it hits a string, number, boolean or ``None``/``null`` losser returns
+  it. You can't give it a pattern to match the value against to decide whether
+  to return it or not.
+
+  When it hits a list losser iterates over the items in the list and for each
+  item either returns it or, if it's a sub-list or sub-object, recurses.
+  (When sub-lists or sub-objects would cause a nested list to be returned it's
+  flattened into a single list and optionally deduplicated.) Again, you can't
+  provide a pattern to be matched against each item to decide whether to
+  return/recurse or not.
+
+  Adding pattern matching against values as well as keys would add a lot of
+  power.
 
 
 Requirements
@@ -31,12 +267,29 @@ then do:
     git clone https://github.com/ckan/losser.git
     cd losser
     python setup.py develop
+    pip install -r dev-requirements.txt
 
 
 Usage
 -----
 
-TODO: How to use on command-line and from Python.
+On the command-line losser reads input objects from stdin and writes the output
+table to stdout, making it composable with other UNIX commands. For example:
+
+    losser --columns columns.json < input.json > output.csv
+
+This will read input objects from `input.json`, read column queries from
+`columns.json`, and write output objects to `output.csv`.
+
+To use losser as a Python library:
+
+    import losser.losser as losser
+    table = losser.table(input_objects, columns)
+
+`input_objects` should be a list of dicts. `columns` can be either a list of
+dicts or the path to a `columns.json` file (string). The returned `table` will
+be a list of dicts. If you pass `csv=True` to `table()` it'll return a
+CSV-formatted string instead. See `table()`'s docstring for more arguments.
 
 
 Running the Tests

diff --git a/losser/losser.py b/losser/losser.py
@@ -1,23 +1,3 @@
-"""Filter, transform and export a list of JSON objects to CSV.
-
-A list of objects (as JSON text) is read from stdin, filtered and transformed,
-and the resulting table is written to stdout as CSV text.
-(TODO: Also support CSV input and JSON output.)
-
-A JSON file specifying the columns to output (and how to extract the values
-for the columns from the input data) must be provided as a --columns argument.
-
-Example usage:
-
-    losser --columns columns.json < input.json > output.csv
-
-This will:
-
-1. Read data from input.json
-2. Transform and filter it according to the columns specified in columns.json
-3. Write the result as UTF8-encoded, CSV-formatted text to output.csv
-
-"""
 import re
 import collections
 import sys
@@ -39,6 +19,19 @@ def table(dicts, columns, csv=False):
     A "table" is a list of OrderedDicts each having the same keys in the same
     order.
 
+    :param dicts: the list of input dicts
+    :type dicts: list of dicts
+
+    :param columns: the list of column query dicts, or the path to a JSON file
+        containing the list of column query dicts
+    :type columns: list of dicts, or string
+
+    :param csv: return a UTF8-encoded, CSV-formatted string instead of a list
+        of dicts
+    :type csv: bool
+
+    :rtype: list of dicts, or CSV string
+
     """
     # Optionally read columns from file.
     if isinstance(columns, basestring):
@@ -102,10 +95,6 @@ def query(pattern_path, dict_, max_length=None, strip=False,
     If the dict contains sub-lists or sub-dicts values from these will be
     flattened into a simple flat list to be returned.
 
-    # FIXME: If the pattern path doesn't match the keys in the dict then None
-    # is returned, which is indistinguishable from if it matched a path to a
-    # key whose value was None. Raise an UnmatchedPathError instead.
-
     """
     if string_transformations is None:
         string_transformations = []