Skip to content

Commit

Permalink
Add some docs
Browse files Browse the repository at this point in the history
  • Loading branch information
seanh committed Oct 23, 2014
1 parent 21432f9 commit 0401317
Show file tree
Hide file tree
Showing 2 changed files with 269 additions and 27 deletions.
259 changes: 256 additions & 3 deletions README.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,245 @@ Losser
======

A little UNIX command and Python library for lossy filter, transform, and
export of JSON to JSON or CSV.
export of JSON to Excel-compatible CSV.

TODO: Detailed explanation.
Losser can be run either as a UNIX command or used as Python library
(see [Usage](#usage) below). It takes a JSON-formatted list of objects
(or a list of Python dicts) as input and produces a "table" as output.

The input objects do not all have to have the same keys as each other, and may
contain sub-lists and sub-objects arbitrarily nested.

The output "table" is a list of objects that all have the same keys in the same
order, and with sub-objects and sub-lists nested no more than one level deep.
It can be output as:

* A list of Python OrderedDicts each having the same keys in the same order
* A string of JSON-formatted text representing a list of objects each having
the same keys in the same order
([TODO](https://github.com/ckan/losser/issues/3))
* A string of CSV-formatted text, one object per CSV row. The rows of the CSV
correspond to the objects in the list of objects, and the columns correspond
to the object's keys.

The input objects can be filtered and transformed before producing the output
table. You provide a list of "column query" objects in a `columns.json` file
that specifies what columns the output table should have, and how the values
for those columns should be retrieved from the input objects.

For example, if you had some input objects that looked like this:

[
{
"author": "Sean Hammond",
"title": "An Example Input Object",
"extras":
{
"Delivery Unit": "Commissioning"
{
},
...
]

You might transform them using a `columns.json` file like this:

{
"Data Owner": {
"pattern_path": "^author$"
},
"Title": {
"pattern_path": "^title$"
},
"Delivery Unit": {
"pattern_path": ["^extras$", "^Delivery Unit$"]
}
}

This would output a CSV file like this:

Data Owner,Title,Delivery Unit
Sean Hammond,An Example Input Object,Commissioning
Frank Black,Another Example Object,Some Other Unit
...

The `columns.json` file above specifies three column headings for the output
table:

1. Data Owner
2. Title
3. Delivery Unit

The values for each column are retrieved from the input objects by following a
"pattern path": a list of regular expressions that are matched against the keys
of the input object and its sub-objects in turn to find a value.

For example the "Data Owner" field above has the pattern path "^author$" which
matches the string "author". This will find top-level keys named "author" in
the input objects and output their values in the "Data Owner" column of the
output table.

The "Delivery Unit" column above has a more complex pattern path:
`["^extras$", "^Delivery Unit$"]`. This will find the top-level key "extras" in
an input object and, assuming the value for the "extras" key is a sub-object,
will find and return the value for the "Delivery Unit" key in the sub-object.

Pattern paths can be arbitrarily long, recursing into arbitrarily deeply nested
sub-objects.

Losser queries an input object with a pattern path as follows:

1. Any pre-processor functions are applied to the input object.
2. Each column query in the `columns.json` file is applied to the input object
in turn to produce the values for the corresponding row in the output table.
For each column query:
1. When it hits an object/dictionary (either the top-level input object
itself or a sub-object) losser pops the next regular expression off the
pattern path, matches it against each of the object's keys, and recurses
on each key that matches.
2. When it hits a list losser iterates over the items in the list recursing
on each of them and collecting the results into a list.
3. When it hits a string, number, boolean or `None`/`null` value the
recursion bottoms-out and returns the value.
5. Once all of the column queries have been run on the input object and the
results collected, any post-processor functions are called on the output
object.

A pattern may match more than one key in an object, in which case each one of
the matching keys' item's will be recursed into, and a list of matching values
will eventually be returned instead of a single value. For example given this
input object:

{
"update": "yearly",
"update frequency": "monthly"
}

The pattern path `"^update.*"` (which matches both "update" and "update
frequency") would output `"yearly, monthly"` (a quoted, comma-separated list)
in the corresponding cell in the output table.

This also happens when a pattern path matches a single field in the input
object and the field's value is a list.

Nested lists can occur (when the input object contains a list of lists, for
example). These are flattened and optionally deduplicated in the output cells.

Some of the filtering and transformations you can do with losser include:

* Extract some fields from the objects (by matching regular expression
patterns) and filter out others.

Any fields in an input object that do not match any of the pattern paths in
the `columns.json` file are filtered out.

([TODO](https://github.com/ckan/losser/issues/2): Support appending unmatched
fields to the end of the ouput table as additional columns).

* Specify the order of the columns in the output table.

Columns are output in the same order that they appear in the `columns.json`
file, which does not have to be the same order as the corresponding fields in
the input objects.

* Rename fields, using a different name for the column in the output table than
for the field in the input objects.

For example to get the "notes" field from each input object and place them
all in a "description" column in the output table, put this object in your
`columns.json`:

"Description": {
"pattern_path": "^notes$",
}


* Match patterns case-sensitively.

By default patterns are matched case-insensitively. To do case-sensitive
matching put `"case_sensitive": true` in a column query in your
`columns.json` file:

"Title": {
"pattern_path": "^title$",
"case_sensitive": true
},

* Transform the matched values, for example truncating or stripping whitespace
from strings.

* Provide arbitrary pre-processor and post-processor functions to do custom
transformations on the input and output objects
([TODO](https://github.com/ckan/losser/issues/1)).

* Find inconsistently-named fields using a pattern that matches any of the
names and combine them into a single column in the output table.

For example you can provide a pattern like "^update.*" that will find keys
named "update", "Update", "Update Frequency" etc. in different input objects
and collect their values in a single "Update Frequency" column.

* Recurse into sub-objects and extract fields from the sub-objects, promoting
them to top-level keys in the output table.

This is done using "pattern paths", ordered lists of regexes. For example
the pattern path ["^resources$"]["^format$"] will find the values of the
"format" fields of the "resources" sub-objects in each of the list of input
objects.

* You can specify that a pattern path should find a unique value in the object,
and if more than one value in the object matches the pattern (and a list
would be returned) an exception will be raised.

Use `"unique": true` in a column query in your `columns.json` file:

"Title": {
"pattern_path": "^title$",
"unique": true
},

This is useful for debugging pattern paths that you expect to be unique.

* You can specify that a pattern path *must* match a value in the object, and
an exception will be raised if there's no matching path through the object
([TODO](https://github.com/ckan/losser/issues/4)).

* When a pattern matches multiple paths through the input object, or matches a
path going through a sub-list, the resulting list of values in the output
table cell can be deduplicated. Put `"deduplicate": true` in a column query
in your `columns.json` file:

"Format": {
"pattern_path": ["^resources$", "^format$"],
"deduplicate": true
},


What it can't do (yet):

* Pattern match against the values of items (as opposed to their keys).

When following a pattern path through an object, when losser hits an
object/dictionary in the input, either one of the top-level objects in the
list of input objects or a sub-object, losser matches the relevant regex
against the object's keys and then recurses on the values of each of the
matched keys.

If the key matches the pattern it recurses, you can't also specify a pattern
to match the value against.

When it hits a string, number, boolean or ``None``/``null`` losser returns
it. You can't give it a pattern to match the value against to decide whether
to return it or not.

When it hits a list losser iterates over the items in the list and for each
item either returns it or, if it's a sub-list or sub-object, recurses.
(When sub-lists or sub-objects would cause a nested list to be returned it's
flattened into a single list and optionally deduplicated.) Again, you can't
provide a pattern to be matched against each item to decide whether to
return/recurse or not.

Adding pattern matching against values as well as keys would add a lot of
power.


Requirements
Expand All @@ -31,12 +267,29 @@ then do:
git clone https://github.com/ckan/losser.git
cd losser
python setup.py develop
pip install -r dev-requirements.txt


Usage
-----

TODO: How to use on command-line and from Python.
On the command-line losser reads input objects from stdin and writes the output
table to stdout, making it composable with other UNIX commands. For example:

losser --columns columns.json < input.json > output.csv

This will read input objects from `input.json`, read column queries from
`columns.json`, and write output objects to `output.csv`.

To use losser as a Python library:

import losser.losser as losser
table = losser.table(input_objects, columns)

`input_objects` should be a list of dicts. `columns` can be either a list of
dicts or the path to a `columns.json` file (string). The returned `table` will
be a list of dicts. If you pass `csv=True` to `table()` it'll return a
CSV-formatted string instead. See `table()`'s docstring for more arguments.


Running the Tests
Expand Down
37 changes: 13 additions & 24 deletions losser/losser.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,3 @@
"""Filter, transform and export a list of JSON objects to CSV.
A list of objects (as JSON text) is read from stdin, filtered and transformed,
and the resulting table is written to stdout as CSV text.
(TODO: Also support CSV input and JSON output.)
A JSON file specifying the columns to output (and how to extract the values
for the columns from the input data) must be provided as a --columns argument.
Example usage:
losser --columns columns.json < input.json > output.csv
This will:
1. Read data from input.json
2. Transform and filter it according to the columns specified in columns.json
3. Write the result as UTF8-encoded, CSV-formatted text to output.csv
"""
import re
import collections
import sys
Expand All @@ -39,6 +19,19 @@ def table(dicts, columns, csv=False):
A "table" is a list of OrderedDicts each having the same keys in the same
order.
:param dicts: the list of input dicts
:type dicts: list of dicts
:param columns: the list of column query dicts, or the path to a JSON file
containing the list of column query dicts
:type columns: list of dicts, or string
:param csv: return a UTF8-encoded, CSV-formatted string instead of a list
of dicts
:type csv: bool
:rtype: list of dicts, or CSV string
"""
# Optionally read columns from file.
if isinstance(columns, basestring):
Expand Down Expand Up @@ -102,10 +95,6 @@ def query(pattern_path, dict_, max_length=None, strip=False,
If the dict contains sub-lists or sub-dicts values from these will be
flattened into a simple flat list to be returned.
# FIXME: If the pattern path doesn't match the keys in the dict then None
# is returned, which is indistinguishable from if it matched a path to a
# key whose value was None. Raise an UnmatchedPathError instead.
"""
if string_transformations is None:
string_transformations = []
Expand Down

0 comments on commit 0401317

Please sign in to comment.