Skip to content

Commit

Permalink
Update to readme, samples
Browse files Browse the repository at this point in the history
  • Loading branch information
akariv committed Oct 23, 2018
1 parent 62f4c9f commit 9c26255
Show file tree
Hide file tree
Showing 2 changed files with 191 additions and 111 deletions.
283 changes: 190 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,12 @@ worldbank-co2-emissions:
title: 'CO2 emissions (metric tons per capita)'
homepage: 'http://worldbank.org/'
-
run: add_resource
run: load
parameters:
from: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
name: 'global-data'
url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
format: xls
headers: 4
-
run: stream_remote_resources
cache: True
-
run: set_types
parameters:
Expand All @@ -65,9 +62,8 @@ worldbank-co2-emissions:
In this example we see one pipeline called `worldbank-co2-emissions`. Its pipeline consists of 4 steps:

- `metadata`: This is a library processor (see below), which modifies the data-package's descriptor (in our case: the initial, empty descriptor) - adding `name`, `title` and other properties to the datapackage.
- `add_resource`: This is another library processor, which adds a single resource to the data-package.
This resource has a `name` and a `url`, pointing to the remote location of the data.
- `stream_remote_resources`: This processor will stream data from resources (like the one we defined in the 1st step) into the pipeline, on to processors further down the pipeline (see more about streaming below).
- `load`: This is another library processor, which loads data into the data-package.
This resource has a `name` and a `from` property, pointing to the remote location of the data.
- `set_types`: This processor assigns data types to fields in the data. In this example, field headers looking like years will be assigned the `number` type.
- `dump_to_zip`: Create a zipped and validated datapackage with the provided file name.

Expand All @@ -94,23 +90,37 @@ $ dpp
Available Pipelines:
- ./worldbank-co2-emissions (*)

$ dpp run ./worldbank-co2-emissions
INFO :Main:RUNNING ./worldbank-co2-emissions
INFO :Main:- lib/update_package.py
INFO :Main:- lib/add_resource.py
INFO :Main:- lib/stream_remote_resources.py
INFO :Main:- lib/dump/to_zip.py
INFO :Main:DONE lib/update_package.py
INFO :Main:DONE lib/add_resource.py
INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
INFO :Main:stream_remote_resources: TOTAL 264 rows
INFO :Main:stream_remote_resources: Processed 264 rows
INFO :Main:DONE lib/stream_remote_resources.py
INFO :Main:dump_to_zip: INFO :Main:Processed 264 rows
INFO :Main:DONE lib/dump/to_zip.py
INFO :Main:RESULTS:
INFO :Main:SUCCESS: ./worldbank-co2-emissions
{'dataset-name': 'co2-emissions', 'total_row_count': 264}
$ $ dpp run --verbose ./worldbank-co2-emissions
RUNNING ./worldbank-co2-emissions
Collecting dependencies
Running async task
Waiting for completion
Async task starting
Searching for existing caches
Building process chain:
- update_package
- load
- set_types
- dump_to_zip
- (sink)
DONE /Users/adam/code/dhq/specstore/dpp_repo/datapackage_pipelines/specs/../lib/update_package.py
load: DEBUG :Starting new HTTP connection (1): api.worldbank.org:80
load: DEBUG :http://api.worldbank.org:80 "GET /v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel HTTP/1.1" 200 308736
load: DEBUG :http://api.worldbank.org:80 "GET /v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel HTTP/1.1" 200 308736
load: DEBUG :Starting new HTTP connection (1): api.worldbank.org:80
load: DEBUG :http://api.worldbank.org:80 "GET /v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel HTTP/1.1" 200 308736
load: DEBUG :http://api.worldbank.org:80 "GET /v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel HTTP/1.1" 200 308736
set_types: INFO :(<dataflows.processors.set_type.set_type object at 0x10a5c79b0>,)
load: INFO :Processed 264 rows
set_types: INFO :Processed 264 rows
DONE /Users/adam/code/dhq/specstore/dpp_repo/datapackage_pipelines/specs/../lib/load.py
DONE /Users/adam/code/dhq/specstore/dpp_repo/datapackage_pipelines/specs/../lib/set_types.py
dump_to_zip: INFO :Processed 264 rows
DONE /Users/adam/code/dhq/specstore/dpp_repo/datapackage_pipelines/manager/../lib/internal/sink.py
DONE /Users/adam/code/dhq/specstore/dpp_repo/datapackage_pipelines/specs/../lib/dump_to_zip.py
DONE V ./worldbank-co2-emissions {'bytes': 692741, 'count_of_rows': 264, 'dataset_name': 'co2-emissions', 'hash': '4dd18effcdfbf5fc267221b4ffc28fa4'}
INFO :RESULTS:
INFO :SUCCESS: ./worldbank-co2-emissions {'bytes': 692741, 'count_of_rows': 264, 'dataset_name': 'co2-emissions', 'hash': '4dd18effcdfbf5fc267221b4ffc28fa4'}
```

Alternatively, you could use our [Docker](https://www.docker.com/) image:
Expand Down Expand Up @@ -272,77 +282,36 @@ Any allowed property (according to the [spec]([http://specs.frictionlessdata.io/
- samwise gamgee <samwise1992@yahoo.com>
```

### ***`add_resource`***
### ***`load`***

Adds a new external tabular resource to the data-package.
Loads data into the package, infers the schema and optionally casts values.

_Parameters_:

You should provide a `name` and `url` attributes, and other optional attributes as defined in the [spec]([http://specs.frictionlessdata.io/data-packages/#resource-information).

`url` indicates where the data for this resource resides. Later on, when `stream_remote_resources` runs, it will use the `url` (which is stored in the resource in the `dpp:streamedFrom` property) to read the data rows and push them into the pipeline.

Note that `url` also supports `env://<environment-variable>`, which indicates that the resource url should be fetched from the indicated environment variable. This is useful in case you are supplying a string with sensitive information (such as an SQL connection string for streaming from a database table).

Parameters are basically arguments that are passed to a `tabulator.Stream` instance (see the [API](https://github.com/frictionlessdata/tabulator-py#api-reference)).
Other than those, you can pass a `constants` parameter which should be a mapping of headers to string values.
When used in conjunction with `stream_remote_resources`, these constant values will be added to each generated row
(as well as to the default schema).

You may also provide a schema here, or use the default schema generated by the `stream_remote_resources` processor.
In case `path` is specified, it will be used. If not, the `stream_remote_resources` processor will assign a `path` for you with a `csv` extension.

*Example*:

```yaml
- run: add_resource
parameters:
url: http://example.com/my-excel-file.xlsx
sheet: 1
headers: 2
- run: add_resource
parameters:
url: http://example.com/my-csv-file.csv
encoding: "iso-8859-2"
```

### ***`stream_remote_resources`***

Converts external resources to streamed resources.

External resources are ones that link to a remote data source (url or file path), but are not processed by the pipeline and are kept as-is.

Streamed resources are ones that can be processed by the pipeline, and their output is saved as part of the resulting datapackage.

In case a resource has no schema, a default one is generated automatically here by creating a `string` field from each column in the data source.
- `from` - location of the data that is to be loaded. This can be either:
- a local path (e.g. /path/to/the/data.csv)
- a remote URL (e.g. https://path.to/the/data.csv)
- Other supported links, based on the current support of schemes and formats in [tabulator](https://github.com/frictionlessdata/tabulator-py#schemes)
- a local path or remote URL to a datapackage.json file (e.g. https://path.to/data_package/datapackage.json)
- a reference to an environment variable containing the source location, in the form of `env://ENV_VAR`
- a tuple containing (datapackage_descriptor, resources_iterator)
- `resources` - optional, relevant only if source points to a datapackage.json file or datapackage/resource tuple. Value should be one of the following:
- Name of a single resource to load
- A regular expression matching resource names to load
- A list of resource names to load
- 'None' indicates to load all resources
- The index of the resource in the package
- `validate` - Should data be casted to the inferred data-types or not. Relevant only when not loading data from datapackage.
- other options - based on the loaded file, extra options (e.g. sheet for Excel files etc., see the link to tabulator above)

### ***`printer`***

Just prints whatever it sees. Good for debugging.

_Parameters_:

- `resources` - Which resources to stream. Can be:

- List of strings, interpreted as resource names to stream
- String, interpreted as a regular expression to be used to match resource names

If omitted, all resources in datapackage are streamed.

- `ignore-missing` - if true, then missing resources won't raise an error but will be treated as 'empty' (i.e. with zero rows).
Resources with empty URLs will be treated the same (i.e. will generate an 'empty' resource).

- `limit-rows` - if provided, will limit the number of rows fetched from the source. Takes an integer value which specifies how many rows of the source to stream.

*Example*:

```yaml
- run: stream_remote_resources
parameters:
resources: ['2014-data', '2015-data']
- run: stream_remote_resources
parameters:
resources: '201[67]-data'
```

This processor also supports loading plain-text resources (e.g. html pages) and handling them as tabular data - split into rows with a single "data" column.
To enable this behavior, add the following attribute to the resource: `"format": "txt"`.
- `num_rows` - modify the number of rows to preview, printer will print multiple samples of this number of rows from different places in the stream
- `last_rows` - how many of the last rows in the stream to print. optional, defaults to the value of num_rows
- `fields` - optional, list of field names to preview
- `resources` - optional, allows to limit the printed resources, same semantics as load processor resources argument

### ***`set_types`***

Expand Down Expand Up @@ -1047,6 +1016,134 @@ Saves the datapackage to a filesystem path.

_Parameters_:

- `out-path` - Name of the output path where `datapackage.json` will be stored.

This path will be created if it doesn't exist, as well as internal data-package paths.

If omitted, then `.` (the current directory) will be assumed.

- `force-format` - Specifies whether to force all output files to be generated with the same format
- if `True` (the default), all resources will use the same format
- if `False`, format will be deduced from the file extension. Resources with unknown extensions will be discarded.
- `format` - Specifies the type of output files to be generated (if `force-format` is true): `csv` (the default) or `json`
- `add-filehash-to-path`: Specifies whether to include file md5 hash into the resource path. Defaults to `False`. If `True` Embeds hash in path like so:
- If original path is `path/to/the/file.ext`
- Modified path will be `path/to/the/HASH/file.ext`
- `counters` - Specifies whether to count rows, bytes or md5 hash of the data and where it should be stored. An object with the following properties:
- `datapackage-rowcount`: Where should a total row count of the datapackage be stored (default: `count_of_rows`)
- `datapackage-bytes`: Where should a total byte count of the datapackage be stored (default: `bytes`)
- `datapackage-hash`: Where should an md5 hash of the datapackage be stored (default: `hash`)
- `resource-rowcount`: Where should a total row count of each resource be stored (default: `count_of_rows`)
- `resource-bytes`: Where should a total byte count of each resource be stored (default: `bytes`)
- `resource-hash`: Where should an md5 hash of each resource be stored (default: `hash`)
Each of these attributes could be set to null in order to prevent the counting.
Each property could be a dot-separated string, for storing the data inside a nested object (e.g. `stats.rowcount`)
- `pretty-descriptor`: Specifies how datapackage descriptor (`datapackage.json`) file will look like:
- `False` (default) - descriptor will be written in one line.
- `True` - descriptor will have indents and new lines for each key, so it becomes more human-readable.

### ***`dump_to_zip`***

Saves the datapackage to a zipped archive.

_Parameters_:

- `out-file` - Name of the output file where the zipped data will be stored
- `force-format` and `format` - Same as in `dump_to_path`
- `add-filehash-to-path` - Same as in `dump_to_path`
- `counters` - Same as in `dump_to_path`
- `pretty-descriptor` - Same as in `dump_to_path`

## Deprecated Processors

These processors will be removed in the next major version.

### ***`add_metadata`***

Alias for `update_package`, is kept for backward compatibility reasons.

### ***`add_resource`***

Adds a new external tabular resource to the data-package.

_Parameters_:

You should provide a `name` and `url` attributes, and other optional attributes as defined in the [spec]([http://specs.frictionlessdata.io/data-packages/#resource-information).

`url` indicates where the data for this resource resides. Later on, when `stream_remote_resources` runs, it will use the `url` (which is stored in the resource in the `dpp:streamedFrom` property) to read the data rows and push them into the pipeline.

Note that `url` also supports `env://<environment-variable>`, which indicates that the resource url should be fetched from the indicated environment variable. This is useful in case you are supplying a string with sensitive information (such as an SQL connection string for streaming from a database table).

Parameters are basically arguments that are passed to a `tabulator.Stream` instance (see the [API](https://github.com/frictionlessdata/tabulator-py#api-reference)).
Other than those, you can pass a `constants` parameter which should be a mapping of headers to string values.
When used in conjunction with `stream_remote_resources`, these constant values will be added to each generated row
(as well as to the default schema).

You may also provide a schema here, or use the default schema generated by the `stream_remote_resources` processor.
In case `path` is specified, it will be used. If not, the `stream_remote_resources` processor will assign a `path` for you with a `csv` extension.

*Example*:

```yaml
- run: add_resource
parameters:
url: http://example.com/my-excel-file.xlsx
sheet: 1
headers: 2
- run: add_resource
parameters:
url: http://example.com/my-csv-file.csv
encoding: "iso-8859-2"
```

### ***`stream_remote_resources`***

Converts external resources to streamed resources.

External resources are ones that link to a remote data source (url or file path), but are not processed by the pipeline and are kept as-is.

Streamed resources are ones that can be processed by the pipeline, and their output is saved as part of the resulting datapackage.

In case a resource has no schema, a default one is generated automatically here by creating a `string` field from each column in the data source.

_Parameters_:

- `resources` - Which resources to stream. Can be:

- List of strings, interpreted as resource names to stream
- String, interpreted as a regular expression to be used to match resource names

If omitted, all resources in datapackage are streamed.

- `ignore-missing` - if true, then missing resources won't raise an error but will be treated as 'empty' (i.e. with zero rows).
Resources with empty URLs will be treated the same (i.e. will generate an 'empty' resource).

- `limit-rows` - if provided, will limit the number of rows fetched from the source. Takes an integer value which specifies how many rows of the source to stream.

*Example*:

```yaml
- run: stream_remote_resources
parameters:
resources: ['2014-data', '2015-data']
- run: stream_remote_resources
parameters:
resources: '201[67]-data'
```

This processor also supports loading plain-text resources (e.g. html pages) and handling them as tabular data - split into rows with a single "data" column.
To enable this behavior, add the following attribute to the resource: `"format": "txt"`.

### ***`dump.to_sql`***

Alias for `dump_to_sql`, is kept for backward compatibility reasons.

### ***`dump.to_path`***

Saves the datapackage to a filesystem path.

_Parameters_:

- `out-path` - Name of the output path where `datapackage.json` will be stored.

This path will be created if it doesn't exist, as well as internal data-package paths.
Expand Down Expand Up @@ -1079,7 +1176,7 @@ _Parameters_:
- Note that such changes may make the resulting datapackage incompatible with the frictionlessdata specs and may cause interoperability problems.
- Example usage: [pipeline-spec.yaml](tests/cli/pipeline-spec.yaml) (under the `custom-formatters` pipeline), [XLSXFormat class](tests/cli/custom_formatters/xlsx_format.py)

### ***`dump_to_zip`***
### ***`dump.to_zip`***

Saves the datapackage to a zipped archive.

Expand All @@ -1095,7 +1192,7 @@ _Parameters_:

#### *Note*

`dump_to_path` and `dump_to_zip` processors will handle non-tabular resources as well.
`dump.to_path` and `dump.to_zip` processors will handle non-tabular resources as well.
These resources must have both a `url` and `path` properties, and _must not_ contain a `schema` property.
In such cases, the file will be downloaded from the `url` and placed in the provided `path`.

Expand Down
19 changes: 1 addition & 18 deletions samples/pipeline-spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,7 @@ worldbank-co2-emissions:
types:
"[12][0-9]{3}":
type: number
-
run: add_constant
parameters:
column-name: the_constant
value: the value
-
run: dump_to_zip
parameters:
out-file: co2-emisonss-wb.zip
force-format: false
-
run: dump_to_path
parameters:
out-path: co2-emisonss-wb
force-format: false
-
run: dump_to_sql
parameters:
tables:
co2_emisonss_wb:
resource-name: global-data
out-file: co2-emissions-wb.zip

0 comments on commit 9c26255

Please sign in to comment.