Implemented groups of resources (#243)

* Implemented Group class * Implemented package.get_group * Added package.get_group test * Required tabulator>1.24.1 * Fixed linting * Skip compression tests in Python2 * Added `tableschema-sql` to tests require * Added resource.group * Fixed package.save for storage * Implemented `merge_groups` for `package_save` * Added documentation * Updated readme * Updated readme * Updated readme * Updated readme * Fixed tests * Updated readme * Fixed remote schema/dialect
frictionlessdata · Aug 27, 2019 · 503412e · 503412e
1 parent effd75b
commit 503412e
Show file tree

Hide file tree

Showing 14 changed files with 385 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ A library for working with [Data Packages](http://specs.frictionlessdata.io/data
   - [Documentation](#documentation)
     - [Package](#package)
     - [Resource](#resource)
+    - [Group](#group)
     - [Profile](#profile)
     - [validate](#validate)
     - [infer](#infer)
@@ -215,6 +216,13 @@ Remove data package resource by name. The data package descriptor will be valida
 - `(exceptions.DataPackageException)` - raises error if something goes wrong
 - `(Resource/None)` - returns removed `Resource` instances or null if not found
 
+#### `package.get_group(name)`
+
+Returns a group of tabular resources by name. For more information about groups see [Group](#group).
+
+- `name (str)` - name of a group of resources
+- `(exceptions.DataPackageException)` - raises error if something goes wrong
+- `(Group/None)` - returns a `Group` instance or null if not found
 
 #### `package.infer(pattern=False)`
 
@@ -246,12 +254,13 @@ package.commit()
 package.name # renamed-package
 ```
 
-#### `package.save(target=None, storage=None, **options)`
+#### `package.save(target=None, storage=None, merge_groups=False,  **options)`
 
 Saves this data package to storage if `storage` argument is passed or saves this data package's descriptor to json file if `target` arguments ends with `.json` or saves this data package to zip file otherwise.
 
 - `target (string/filelike)` - the file path or a file-like object where the contents of this Data Package will be saved into.
 - `storage (str/tableschema.Storage)` - storage name like `sql` or storage instance
+- `merge_groups` (bool) - save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about [Group](#group).
 - `options (dict)` - storage options to use for storage creation
 - `(exceptions.DataPackageException)` - raises if there was some error writing the package
 - `(bool)` - return true on success
@@ -552,6 +561,138 @@ Saves this resource into storage if `storage` argument is passed or saves this r
 - `(exceptions.DataPackageException)` - raises error if something goes wrong
 - `(bool)` - returns true on success
 
+### Group
+
+A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: <name>` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name).
+
+Consider we have a data package with two tables partitioned by a year and a shared schema stored separately:
+
+>  cars-2017.csv
+
+```csv
+name,value
+bmw,2017
+tesla,2017
+nissan,2017
+```
+
+>  cars-2018.csv
+
+```csv
+name,value
+bmw,2018
+tesla,2018
+nissan,2018
+```
+
+> cars.schema.json
+
+```json
+{
+    "fields": [
+        {
+            "name": "name",
+            "type": "string"
+        },
+        {
+            "name": "value",
+            "type": "integer"
+        }
+    ]
+}
+```
+
+> datapackage.json
+
+```json
+{
+    "name": "datapackage",
+    "resources": [
+        {
+            "group": "cars",
+            "name": "cars-2017",
+            "path": "cars-2017.csv",
+            "profile": "tabular-data-resource",
+            "schema": "cars.schema.json"
+        },
+        {
+            "group": "cars",
+            "name": "cars-2018",
+            "path": "cars-2018.csv",
+            "profile": "tabular-data-resource",
+            "schema": "cars.schema.json"
+        }
+    ]
+}
+```
+
+Let's read the resources separately:
+
+```python
+package = Package('datapackage.json')
+package.get_resource('cars-2017').read(keyed=True) == [
+    {'name': 'bmw', 'value': 2017},
+    {'name': 'tesla', 'value': 2017},
+    {'name': 'nissan', 'value': 2017},
+]
+package.get_resource('cars-2018').read(keyed=True) == [
+    {'name': 'bmw', 'value': 2018},
+    {'name': 'tesla', 'value': 2018},
+    {'name': 'nissan', 'value': 2018},
+]
+```
+
+On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group:
+
+```python
+package = Package('datapackage.json')
+package.get_group('cars').read(keyed=True) == [
+    {'name': 'bmw', 'value': 2017},
+    {'name': 'tesla', 'value': 2017},
+    {'name': 'nissan', 'value': 2017},
+    {'name': 'bmw', 'value': 2018},
+    {'name': 'tesla', 'value': 2018},
+    {'name': 'nissan', 'value': 2018},
+]
+```
+
+We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour:
+
+```python
+package = Package('datapackage.json')
+package.save(storage='sql', engine=engine)
+# SQL tables:
+# - cars-2017
+# - cars-2018
+package.save(storage='sql', engine=engine, merge_groups=True)
+# SQL tables:
+# - cars
+```
+
+#### `Group`
+
+This class doesn't have any public constructor. Use `package.get_group`.
+
+#### `group.name`
+
+- `(str)` - returns the group name
+
+#### `group.headers`
+
+The same as `resource.headers`
+
+#### `group.schema`
+
+The same as `resource.schema`
+
+#### `group.iter(...)`
+
+The same as `resource.iter`
+
+#### `group.read(...)`
+
+The same as `resource.read`
+
 ### Profile
 
 A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json):

diff --git a/data/datapackage-groups/cars-2016.csv b/data/datapackage-groups/cars-2016.csv
@@ -0,0 +1,4 @@
+name,value
+bmw,2016
+tesla,2016
+nissan,2016
diff --git a/data/datapackage-groups/cars-2017.csv b/data/datapackage-groups/cars-2017.csv
@@ -0,0 +1,4 @@
+name,value
+bmw,2017
+tesla,2017
+nissan,2017
diff --git a/data/datapackage-groups/cars-2018.csv b/data/datapackage-groups/cars-2018.csv
@@ -0,0 +1,4 @@
+name,value
+bmw,2018
+tesla,2018
+nissan,2018
diff --git a/data/datapackage-groups/cars-2018.csv.zip b/data/datapackage-groups/cars-2018.csv.zip
diff --git a/data/datapackage-groups/cars.schema.json b/data/datapackage-groups/cars.schema.json
@@ -0,0 +1,12 @@
+{
+    "fields": [
+        {
+            "name": "name",
+            "type": "string"
+        },
+        {
+            "name": "value",
+            "type": "integer"
+        }
+    ]
+}
diff --git a/data/datapackage-groups/datapackage.json b/data/datapackage-groups/datapackage.json
@@ -0,0 +1,28 @@
+{
+    "name": "datapackage",
+    "resources": [
+        {
+            "group": "cars",
+            "name": "cars-2016",
+            "path": "cars-2016.csv",
+            "profile": "tabular-data-resource",
+            "schema": "cars.schema.json"
+        },
+        {
+            "group": "cars",
+            "name": "cars-2017",
+            "path": "cars-2017.csv",
+            "profile": "tabular-data-resource",
+            "schema": "cars.schema.json"
+        },
+        {
+            "group": "cars",
+            "name": "cars-2018",
+            "path": "cars-2018.csv.zip",
+            "profile": "tabular-data-resource",
+            "compression": "zip",
+            "format": "csv",
+            "schema": "cars.schema.json"
+        }
+    ]
+}
diff --git a/datapackage/group.py b/datapackage/group.py
@@ -0,0 +1,54 @@
+from itertools import chain
+
+
+# Module API
+
+class Group(object):
+
+    # Public
+
+    def __init__(self, resources):
+
+        # Contract checks
+        assert resources
+        assert all([resource.tabular for resource in resources])
+        assert all([resource.group for resource in resources])
+
+        # Get props from the resources
+        self.__name = resources[0].group
+        self.__headers = resources[0].headers
+        self.__schema = resources[0].schema
+        self.__resources = resources
+
+    @property
+    def name(self):
+        """https://github.com/frictionlessdata/datapackage-py#group
+        """
+        return self.__name
+
+    @property
+    def headers(self):
+        """https://github.com/frictionlessdata/datapackage-py#group
+        """
+        return self.__headers
+
+    @property
+    def schema(self):
+        """https://github.com/frictionlessdata/datapackage-py#group
+        """
+        return self.__schema
+
+    def iter(self, **options):
+        """https://github.com/frictionlessdata/datapackage-py#group
+        """
+        return chain(*[resource.iter(**options) for resource in self.__resources])
+
+    def read(self, limit=None, **options):
+        """https://github.com/frictionlessdata/datapackage-py#group
+        """
+        rows = []
+        for count, row in enumerate(self.iter(**options), start=1):
+            rows.append(row)
+            if count == limit:
+                break
+        return rows
diff --git a/datapackage/helpers.py b/datapackage/helpers.py
@@ -114,9 +114,12 @@ def dereference_resource_descriptor(descriptor, base_path, base_descriptor=None)
                 )
 
         # URI -> Remote
-        elif value.startswith('http'):
+        elif base_path.startswith('http') or value.startswith('http'):
             try:
-                response = requests.get(value)
+                fullpath = value
+                if not value.startswith('http'):
+                    fullpath = os.path.join(base_path, value)
+                response = requests.get(fullpath)
                 response.raise_for_status()
                 descriptor[property] = response.json()
             except Exception as error:

diff --git a/datapackage/package.py b/datapackage/package.py
@@ -19,6 +19,7 @@
 from tableschema import Storage
 from .resource import Resource
 from .profile import Profile
+from .group import Group
 from . import exceptions
 from . import helpers
 from . import config
@@ -180,6 +181,16 @@ def remove_resource(self, name):
             self.__build()
         return resource
 
+    def get_group(self, name):
+        """https://github.com/frictionlessdata/datapackage-py#package
+        """
+        resources = [resource
+            for resource in self.resources
+            if resource.tabular and resource.group == name]
+        if not resources:
+            return None
+        return Group(resources)
+
     def infer(self, pattern=False):
         """https://github.com/frictionlessdata/datapackage-py#package
         """
@@ -222,7 +233,7 @@ def commit(self, strict=None):
         self.__build()
         return True
 
-    def save(self, target=None, storage=None, **options):
+    def save(self, target=None, storage=None, merge_groups=False, **options):
         """https://github.com/frictionlessdata/datapackage-py#package
         """
 
@@ -232,16 +243,32 @@ def save(self, target=None, storage=None, **options):
                 storage = Storage.connect(storage, **options)
             buckets = []
             schemas = []
+            sources = []
+            group_names = []
             for resource in self.resources:
-                if resource.tabular:
+                if not resource.tabular:
+                    continue
+                if merge_groups and resource.group:
+                    if resource.group in group_names:
+                        continue
+                    group = self.get_group(resource.group)
+                    name = group.name
+                    schema = group.schema
+                    source = group.iter
+                    group_names.append(name)
+                else:
                     resource.infer()
-                    buckets.append(_slugify_resource_name(resource.name))
-                    schemas.append(resource.schema.descriptor)
+                    name = resource.name
+                    schema = resource.schema
+                    source = resource.iter
+                buckets.append(_slugify_resource_name(name))
+                schemas.append(schema.descriptor)
+                sources.append(source)
             schemas = list(map(_slugify_foreign_key, schemas))
             storage.create(buckets, schemas, force=True)
             for bucket in storage.buckets:
-                resource = self.resources[buckets.index(bucket)]
-                storage.write(bucket, resource.iter())
+                source = sources[buckets.index(bucket)]
+                storage.write(bucket, source())
 
         # Save descriptor to json
         elif str(target).endswith('.json'):

diff --git a/datapackage/resource.py b/datapackage/resource.py
@@ -93,6 +93,12 @@ def descriptor(self):
         # Never use self.descriptor inside self class (!!!)
         return self.__next_descriptor
 
+    @property
+    def group(self):
+        """https://github.com/frictionlessdata/datapackage-py#resource
+        """
+        return self.__current_descriptor.get('group')
+
     @property
     def name(self):
         """https://github.com/frictionlessdata/datapackage-py#resource

diff --git a/setup.py b/setup.py
@@ -29,9 +29,10 @@ def read(*paths):
     'unicodecsv>=0.14',
     'jsonpointer>=1.10',
     'tableschema>=1.1.0',
-    'tabulator>=1.20',
+    'tabulator>=1.24.1',
 ]
 TESTS_REQUIRE = [
+    'tableschema-sql',
     'pylama',
     'pytest',
     'mock',