Skip to content

Commit

Permalink
Implemented groups of resources (#243)
Browse files Browse the repository at this point in the history
* Implemented Group class

* Implemented package.get_group

* Added package.get_group test

* Required tabulator>1.24.1

* Fixed linting

* Skip compression tests in Python2

* Added `tableschema-sql` to tests require

* Added resource.group

* Fixed package.save for storage

* Implemented `merge_groups` for `package_save`

* Added documentation

* Updated readme

* Updated readme

* Updated readme

* Updated readme

* Fixed tests

* Updated readme

* Fixed remote schema/dialect
  • Loading branch information
roll authored Aug 27, 2019
1 parent effd75b commit 503412e
Show file tree
Hide file tree
Showing 14 changed files with 385 additions and 10 deletions.
143 changes: 142 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ A library for working with [Data Packages](http://specs.frictionlessdata.io/data
- [Documentation](#documentation)
- [Package](#package)
- [Resource](#resource)
- [Group](#group)
- [Profile](#profile)
- [validate](#validate)
- [infer](#infer)
Expand Down Expand Up @@ -215,6 +216,13 @@ Remove data package resource by name. The data package descriptor will be valida
- `(exceptions.DataPackageException)` - raises error if something goes wrong
- `(Resource/None)` - returns removed `Resource` instances or null if not found

#### `package.get_group(name)`

Returns a group of tabular resources by name. For more information about groups see [Group](#group).

- `name (str)` - name of a group of resources
- `(exceptions.DataPackageException)` - raises error if something goes wrong
- `(Group/None)` - returns a `Group` instance or null if not found

#### `package.infer(pattern=False)`

Expand Down Expand Up @@ -246,12 +254,13 @@ package.commit()
package.name # renamed-package
```

#### `package.save(target=None, storage=None, **options)`
#### `package.save(target=None, storage=None, merge_groups=False, **options)`

Saves this data package to storage if `storage` argument is passed or saves this data package's descriptor to json file if `target` arguments ends with `.json` or saves this data package to zip file otherwise.

- `target (string/filelike)` - the file path or a file-like object where the contents of this Data Package will be saved into.
- `storage (str/tableschema.Storage)` - storage name like `sql` or storage instance
- `merge_groups` (bool) - save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about [Group](#group).
- `options (dict)` - storage options to use for storage creation
- `(exceptions.DataPackageException)` - raises if there was some error writing the package
- `(bool)` - return true on success
Expand Down Expand Up @@ -552,6 +561,138 @@ Saves this resource into storage if `storage` argument is passed or saves this r
- `(exceptions.DataPackageException)` - raises error if something goes wrong
- `(bool)` - returns true on success

### Group

A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: <name>` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name).

Consider we have a data package with two tables partitioned by a year and a shared schema stored separately:

> cars-2017.csv
```csv
name,value
bmw,2017
tesla,2017
nissan,2017
```

> cars-2018.csv
```csv
name,value
bmw,2018
tesla,2018
nissan,2018
```

> cars.schema.json
```json
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "value",
"type": "integer"
}
]
}
```

> datapackage.json
```json
{
"name": "datapackage",
"resources": [
{
"group": "cars",
"name": "cars-2017",
"path": "cars-2017.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
},
{
"group": "cars",
"name": "cars-2018",
"path": "cars-2018.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
}
]
}
```

Let's read the resources separately:

```python
package = Package('datapackage.json')
package.get_resource('cars-2017').read(keyed=True) == [
{'name': 'bmw', 'value': 2017},
{'name': 'tesla', 'value': 2017},
{'name': 'nissan', 'value': 2017},
]
package.get_resource('cars-2018').read(keyed=True) == [
{'name': 'bmw', 'value': 2018},
{'name': 'tesla', 'value': 2018},
{'name': 'nissan', 'value': 2018},
]
```

On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group:

```python
package = Package('datapackage.json')
package.get_group('cars').read(keyed=True) == [
{'name': 'bmw', 'value': 2017},
{'name': 'tesla', 'value': 2017},
{'name': 'nissan', 'value': 2017},
{'name': 'bmw', 'value': 2018},
{'name': 'tesla', 'value': 2018},
{'name': 'nissan', 'value': 2018},
]
```

We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour:

```python
package = Package('datapackage.json')
package.save(storage='sql', engine=engine)
# SQL tables:
# - cars-2017
# - cars-2018
package.save(storage='sql', engine=engine, merge_groups=True)
# SQL tables:
# - cars
```

#### `Group`

This class doesn't have any public constructor. Use `package.get_group`.

#### `group.name`

- `(str)` - returns the group name

#### `group.headers`

The same as `resource.headers`

#### `group.schema`

The same as `resource.schema`

#### `group.iter(...)`

The same as `resource.iter`

#### `group.read(...)`

The same as `resource.read`

### Profile

A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json):
Expand Down
4 changes: 4 additions & 0 deletions data/datapackage-groups/cars-2016.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name,value
bmw,2016
tesla,2016
nissan,2016
4 changes: 4 additions & 0 deletions data/datapackage-groups/cars-2017.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name,value
bmw,2017
tesla,2017
nissan,2017
4 changes: 4 additions & 0 deletions data/datapackage-groups/cars-2018.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name,value
bmw,2018
tesla,2018
nissan,2018
Binary file added data/datapackage-groups/cars-2018.csv.zip
Binary file not shown.
12 changes: 12 additions & 0 deletions data/datapackage-groups/cars.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "value",
"type": "integer"
}
]
}
28 changes: 28 additions & 0 deletions data/datapackage-groups/datapackage.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"name": "datapackage",
"resources": [
{
"group": "cars",
"name": "cars-2016",
"path": "cars-2016.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
},
{
"group": "cars",
"name": "cars-2017",
"path": "cars-2017.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
},
{
"group": "cars",
"name": "cars-2018",
"path": "cars-2018.csv.zip",
"profile": "tabular-data-resource",
"compression": "zip",
"format": "csv",
"schema": "cars.schema.json"
}
]
}
54 changes: 54 additions & 0 deletions datapackage/group.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from itertools import chain


# Module API

class Group(object):

# Public

def __init__(self, resources):

# Contract checks
assert resources
assert all([resource.tabular for resource in resources])
assert all([resource.group for resource in resources])

# Get props from the resources
self.__name = resources[0].group
self.__headers = resources[0].headers
self.__schema = resources[0].schema
self.__resources = resources

@property
def name(self):
"""https://github.com/frictionlessdata/datapackage-py#group
"""
return self.__name

@property
def headers(self):
"""https://github.com/frictionlessdata/datapackage-py#group
"""
return self.__headers

@property
def schema(self):
"""https://github.com/frictionlessdata/datapackage-py#group
"""
return self.__schema

def iter(self, **options):
"""https://github.com/frictionlessdata/datapackage-py#group
"""
return chain(*[resource.iter(**options) for resource in self.__resources])

def read(self, limit=None, **options):
"""https://github.com/frictionlessdata/datapackage-py#group
"""
rows = []
for count, row in enumerate(self.iter(**options), start=1):
rows.append(row)
if count == limit:
break
return rows
7 changes: 5 additions & 2 deletions datapackage/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,12 @@ def dereference_resource_descriptor(descriptor, base_path, base_descriptor=None)
)

# URI -> Remote
elif value.startswith('http'):
elif base_path.startswith('http') or value.startswith('http'):
try:
response = requests.get(value)
fullpath = value
if not value.startswith('http'):
fullpath = os.path.join(base_path, value)
response = requests.get(fullpath)
response.raise_for_status()
descriptor[property] = response.json()
except Exception as error:
Expand Down
39 changes: 33 additions & 6 deletions datapackage/package.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from tableschema import Storage
from .resource import Resource
from .profile import Profile
from .group import Group
from . import exceptions
from . import helpers
from . import config
Expand Down Expand Up @@ -180,6 +181,16 @@ def remove_resource(self, name):
self.__build()
return resource

def get_group(self, name):
"""https://github.com/frictionlessdata/datapackage-py#package
"""
resources = [resource
for resource in self.resources
if resource.tabular and resource.group == name]
if not resources:
return None
return Group(resources)

def infer(self, pattern=False):
"""https://github.com/frictionlessdata/datapackage-py#package
"""
Expand Down Expand Up @@ -222,7 +233,7 @@ def commit(self, strict=None):
self.__build()
return True

def save(self, target=None, storage=None, **options):
def save(self, target=None, storage=None, merge_groups=False, **options):
"""https://github.com/frictionlessdata/datapackage-py#package
"""

Expand All @@ -232,16 +243,32 @@ def save(self, target=None, storage=None, **options):
storage = Storage.connect(storage, **options)
buckets = []
schemas = []
sources = []
group_names = []
for resource in self.resources:
if resource.tabular:
if not resource.tabular:
continue
if merge_groups and resource.group:
if resource.group in group_names:
continue
group = self.get_group(resource.group)
name = group.name
schema = group.schema
source = group.iter
group_names.append(name)
else:
resource.infer()
buckets.append(_slugify_resource_name(resource.name))
schemas.append(resource.schema.descriptor)
name = resource.name
schema = resource.schema
source = resource.iter
buckets.append(_slugify_resource_name(name))
schemas.append(schema.descriptor)
sources.append(source)
schemas = list(map(_slugify_foreign_key, schemas))
storage.create(buckets, schemas, force=True)
for bucket in storage.buckets:
resource = self.resources[buckets.index(bucket)]
storage.write(bucket, resource.iter())
source = sources[buckets.index(bucket)]
storage.write(bucket, source())

# Save descriptor to json
elif str(target).endswith('.json'):
Expand Down
6 changes: 6 additions & 0 deletions datapackage/resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,12 @@ def descriptor(self):
# Never use self.descriptor inside self class (!!!)
return self.__next_descriptor

@property
def group(self):
"""https://github.com/frictionlessdata/datapackage-py#resource
"""
return self.__current_descriptor.get('group')

@property
def name(self):
"""https://github.com/frictionlessdata/datapackage-py#resource
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ def read(*paths):
'unicodecsv>=0.14',
'jsonpointer>=1.10',
'tableschema>=1.1.0',
'tabulator>=1.20',
'tabulator>=1.24.1',
]
TESTS_REQUIRE = [
'tableschema-sql',
'pylama',
'pytest',
'mock',
Expand Down
Loading

0 comments on commit 503412e

Please sign in to comment.