Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Package #46

Merged
merged 7 commits into from
Aug 12, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 119 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[![SemVer](https://img.shields.io/badge/versions-SemVer-brightgreen.svg)](http://semver.org/)
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)

A ruby library for working with [Data Packages](http://dataprotocols.org/data-packages/).
A ruby library for working with [Data Packages](https://specs.frictionlessdata.io/data-package/).

The library is intending to support:

Expand Down Expand Up @@ -35,127 +35,192 @@ Require the gem, if you need to:
require 'datapackage'
```

Parsing a Data Package from a remote location:
Parsing a data package descriptor from a remote location:

```ruby
package = DataPackage::Package.new( "http://example.org/datasets/a" )
package = DataPackage::Package.new( "http://example.org/datasets/a/datapackage.json" )
```

This assumes that `http://example.org/datasets/a/datapackage.json` exists, or specifically load a JSON file:
This assumes that `http://example.org/datasets/a/datapackage.json` exists.
Similarly you can load a package descriptor from a local JSON file.

```ruby
package = DataPackage::Package.new( "http://example.org/datasets/a/datapackage.json" )
package = DataPackage::Package.new( "/my/data/package/datapackage.json" )
```

Similarly you can load a package from a local JSON file, or specify a directory:
The data package descriptor
i.e. `datapackage.json` file, is expected to be at the _root_ directory
of the data package and the `path` attribute of the package's `resources` will be resolved
relative to it.

You can also load a data package descriptor directly from a Hash:

```ruby
package = DataPackage::Package.new( "/my/data/package" )
package = DataPackage::Package.new( "/my/data/package/datapackage.json" )
descriptor = {
'resources'=> [
{
'name'=> 'example',
'profile'=> 'tabular-data-resource',
'data'=> [
['height', 'age', 'name'],
['180', '18', 'Tony'],
['192', '32', 'Jacob'],
],
'schema'=> {
'fields'=> [
{'name'=> 'height', 'type'=> 'integer'},
{'name'=> 'age', 'type'=> 'integer'},
{'name'=> 'name', 'type'=> 'string'},
],
}
}
]
}

package = DataPackage::Package.new(descriptor)
```

There are a set of helper methods for accessing data from the package, e.g:

```ruby
package = DataPackage::Package.new( "/my/data/package" )
package.name
package.title
package.description
package.homepage
package.license
```

## Reading a Data Package and its resources
## Reading Data Resources

```ruby
require 'datapackage'
A data package must contain an array of [Data Resources](https://specs.frictionlessdata.io/data-resource).
You can access the resources in your Data Package either by their name or by their index in the `resources` array:

dp = DataPackage::Package.new('http://data.okfn.org/data/core/gdp/datapackage.json')
```ruby
first_resource = package.resources[0]
first_resource = package.get_resource('example')

data = CSV.parse(dp.resources[0].data, headers: true)
brazil_gdp = data.select { |r| r["Country Code"] == "BRA" }.
map { |row| { year: Integer(row["Year"]), value: Float(row['Value']) } }
# Get info about the data source of this resource
first_resource.source_type
first_resource.source
```

max_gdp = brazil_gdp.max_by { |r| r[:value] }
min_gdp = brazil_gdp.min_by { |r| r[:value] }
You can then read the source depending on its `source_type`: `inline`, `remote` or `local`.

percentual_increase = (max_gdp[:value] / min_gdp[:value]).round(2)
max_gdp_val = max_gdp[:value].to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
If a resource complies with the [Tabular Data Resource spec](https://specs.frictionlessdata.io/tabular-data-resource/) or uses the
`tabular-data-resource` [profile](#profiles) you can make a [TableSchema::Table](https://github.com/frictionlessdata/tableschema-rb) for it:

msg = "The highest Brazilian GDP occured in #{max_gdp[:year]}, when it peaked at US$ " +
"#{max_gdp_val}. This was #{percentual_increase}% more than its minumum GDP " +
"in #{min_gdp[:year]}"
```ruby
package.resources[0].tabular?
table = package.resources[0].table

print msg
# Read the entire table at once
data = table.read

# The highest Brazilian GDP occured in 2011, when it peaked at US$ 2,615,189,973,181. This was 172.44% more than its minimum GDP in 1960.
# Or iterate through it
data = table.iter {|row| print row}
```

See [TableSchema](https://github.com/frictionlessdata/tableschema-rb) documentation for other things you can do with tabular resource.

## Creating a Data Package

```ruby
package = DataPackage::Package.new

# Add package properties
package.name = 'my_sleep_duration'
package.resources = [
{'name': 'data'}
]

resource = package.resources[0]
resource.descriptor['data'] = [
7, 8, 5, 6, 9, 7, 8
]

File.open('datapackage.json', 'w') do |f|
f.write(package.to_json)
end

# {"name": "my_sleep_duration", "resources": [{"name": "data", "data": [7, 8, 5, 6, 9, 7, 8]}]}
# Add a resource
package.add_resource(
{
'name'=> 'sleep_durations_this_week',
'data'=> [7, 8, 5, 6, 9, 7, 8],
}
)
```

## Validating a Data Package

Data Package descriptors can be validated against a [JSON schema](https://tools.ietf.org/html/draft-zyp-json-schema-04) that we call `profile`.

By default, the gem uses the standard [Data Package profile](http://specs.frictionlessdata.io/schemas/data-package.json), but alternative profiles are available.
If the resource is valid it will be added to the `resources` array of the Data Package;
if it's invalid it will not be added and you should try creating and [validating](#validating-a-resource) your resource to see why it fails.

```ruby
package = DataPackage::Package.new('http://data.okfn.org/data/core/gdp/datapackage.json')
# Update a resource
my_resource = package.get_resource('sleep_durations_this_week')
my_resource['schema'] = {
'fields'=> [
{'name'=> 'number_hours', 'type'=> 'integer'},
]
}

# Save the Data Package descriptor to the target file
package.save('datapackage.json')

package.valid?
#=> true
package.errors
#=> [] # An array of errors
# Remove a resource
package.remove_resource('sleep_durations_this_week')
```

## Using a different profile
## Profiles

Data Package and Data Resource descriptors can be validated against [JSON schemas](https://tools.ietf.org/html/draft-zyp-json-schema-04) that we call `profiles`.

By default, this gem uses the standard [Data Package profile](http://specs.frictionlessdata.io/schemas/data-package.json) and [Data Resource profile](http://specs.frictionlessdata.io/schemas/data-resource.json) but alternative profiles are available for both.

According to the [specs](https://specs.frictionlessdata.io/profiles/) the value of
the `profile` property can be either a URL or an indentifier from [the registry](https://specs.frictionlessdata.io/schemas/registry.json).

### Profiles in the local cache

The profiles from the registry come bundled with the gem. You can reference them in your DataPackage descriptor by their identifier in [the registry](https://specs.frictionlessdata.io/schemas/registry.json):
The profiles from the registry come bundled with the gem. You can reference them in your Data Package descriptor by their identifier in [the registry](https://specs.frictionlessdata.io/schemas/registry.json):

- `tabular-data-package` for a [Tabular Data Package](http://specs.frictionlessdata.io/tabular-data-package/)
- `fiscal-data-package` for a [Fiscal Data Package](http://fiscal.dataprotocols.org/spec/)
- `tabular-data-package` for a [Tabular Data Package](http://specs.frictionlessdata.io/tabular-data-package/)
- `fiscal-data-package` for a [Fiscal Data Package](https://specs.frictionlessdata.io/fiscal-data-package/)
- `tabular-data-resource` for a [Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-resource/)

```ruby
{
"profile": "tabular-data-package" #or "fiscal-data-package"
"profile": "tabular-data-package"
}
```

### Profiles from elsewhere

If you have a custom profile schema you can reference it by its URL
If you have a custom profile schema you can reference it by its URL:

```ruby
{
"profile": "https://specs.frictionlessdata.io/schemas/tabular-data-package.json"
}
```

## Validation

Data Resources and Data Packages are validated against their profiles to ensure they respect the expected structure.

### Validating a Resource

```ruby
descriptor = {
'name'=> 'incorrect name',
'path'=> 'https://cdn.rawgit.com/frictionlessdata/datapackage-rb/master/spec/fixtures/test-pkg/test.csv',
}
resource = DataPackage::Resource.new(descriptor, base_path='')

# Returns true if resource is valid, false otherwise
resource.valid?

# Returns true or raises DataPackage::ValidationError
resource.validate

# Iterate through validation errors
resource.iter_errors{ |err| p err}
```

### Validating a Package

The same methods used to check the validity of a Resource - `valid?`, `validate` and `iter_errors`- are also available for a Package.
The difference is that after a Package descriptor is validated against its `profile`, each of its `resources` are also validated against their `profile`.

In order for a Package to be valid all its Resources have to be valid.

## Developer notes

These notes are intended to help people that want to contribute to this package itself. If you just want to use it, you can safely ignore them.
Expand Down
2 changes: 2 additions & 0 deletions lib/datapackage/exceptions.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@ class Exception < ::Exception; end
class RegistryException < Exception; end
class ResourceException < Exception; end
class ProfileException < Exception; end
class PackageException < Exception; end
class ValidationError < Exception; end
end
Loading