Feature/plugins and storage #51

roll · 2016-03-05T17:58:46Z

Changes:

added plugin system - Plugin system #49
added storage interface - Add Storage interface #47
added export/import resource - Add push/pull_resource #48
fixed input schema mutation - Do not mutate input schema? #43

pwalsh · 2016-03-06T06:38:30Z

README.md

+
+Note: `validate()` validates whether a **schema** is a validate JSON Table Schema. It does **not** validate data against a schema.
+
+### Export/import


I don't think we should include this as part of default package. We should refer to the plugins, but not include them by default. We can still have this section in README, but make it more clear as a Plugin section and explain the downloads required for this to work.

I thought you both with @vitorbaptista was leaning more for out-of-box solution (next step - #52).

Personally I also more prefer distinction between core and plugins where user need to install plugin by himself and this description lives in plugins section and export/import section having just a general info about it.

We should sync on it with datapackage-py. So @vitorbaptista's opinion is required here.

@roll no: definitely stuff like this as plugins, meaning, install them if you need them. The out-of-the-box thing was more about the registry and validate stuff, where I finally agreed that @vitorbaptista was right that they are too core to be external modules.

I'm +1 with it - I'll update a PR.

Also than datapackage-py should follow this logic too.

We should strive for making the "most common path" as easy as possible. For example, I think importing and exporting to a zip file is pretty common with datapackage-py, so it's on core, but I wouldn't add (say) BigQuery.

Here maybe SQLAlchemy is common enough to be on core, but I don't know. I'm happy with whatever you guys decide.

pwalsh · 2016-03-06T06:49:10Z

@roll this is good for me. I also want @vitorbaptista to look at it as we've all been in this conversation since beginning. After his comments we can merge.

One note is that I think the original model class I wrote was a bit all over the place, and thinking in wider terms, we should probably break this up into two things soon (ref. here):

A model class for the schema object (similar to most of what we have there already, exposing the schema and various helper methods)
A Resource class which takes a SchemaModel and a data source, and looks something like the resource class in datapackage-py (data iterator, etc).

This pushes most of the Resource logic into this package, where it can be used standalone, and also as part of a data package via datapackage-py.

This would pretty much finalise the API of this package as far as I see.

roll · 2016-03-06T06:52:48Z

I need some time to read other thing you've add. But want to write right now it's awesome you've mentioned Resource class - I was thinking about it a lot. Because it's really missing part of abstractions (schema, resource=schema+data, datapackage=[schema+data]).

pwalsh · 2016-03-06T06:55:30Z

@roll yes. great. Actually, @vitorbaptista probably has most of the required Resource logic in datapackage-py so it should be quite simple to do this, and it really is the missing link here. If you have a direction on that I'd say go for it and do a PR on this repo with it.

roll · 2016-03-06T12:21:57Z

I've updated readme in PR to make distinction between core and plugins

pwalsh · 2016-03-06T15:13:04Z

@roll ok great. We'll wait for @vitorbaptista to look early in the week, and then merge.

roll · 2016-03-07T10:10:28Z

Great! For datapackage-py implementation could be the same - just copying with minor changes code from package.py and helpers.py - https://github.com/okfn/datapackage-storage-py/tree/master/datapackage_storage - to things like dp.to_storage and classmethod(dp.from_storage) - it's question for Vitor what's API it could be.

Some notes:

import and export should be substituted (in datapackage-storage-py we export from storage to filesystem but here we export resource or datapackage to storage - OOP approach)
instead of passing storage we're passing backend and **backend_options. To get a storage:

# I don't think we should catch an exception here because JTS plugin system
# raises exception with message describing lack of plugin problem
plugin = import_module('jsontableschema.plugins.%s' % backend)
storage = plugin.Storage(**backend_options)

vitorbaptista · 2016-03-07T10:37:46Z

README.md

+with io.open(filepath) as stream:
+    schema = json.load(stream)
+
+is_valid = jsontableschema.validate(schema)


With the current API jsontableschema.validate is only used to catch exceptions. What I mean is that is_valid will never be False. This method should always be inside a try/catch block, so I suggest changing this example to something like:

try: jsontableschema.validate(schema) except jsontableschema.exceptions.SchemaValidationError as e: # handle errors

For this reason, the return value of jsontableschema.validate doesn't matter. I wouldn't return anything to make this clearer.

It's not a part of PR's piece of work but I think it's OK to fix it here - I've updated PR.

vitorbaptista · 2016-03-07T11:51:17Z

I always mix importing and exporting resources. Importing a resource is going through a JSON Table Schema file to (say) a SQL table, and exporting is going from a SQL table to a JTS file, right?

Reading the code, I've seen that both while importing and exporting there're files being created. I couldn't figure out why.

I'm also concerned with the tests. This PR is mostly about creating methods that delegate to other libraries, so it's OK to assume that these other libraries work and just testing that we're calling them correctly. I'd prefer to have fewer mocks, testing at least that the CSV files were written correctly, but it's OK.

There're some naming issues, with the test class named Test_export_resource. It should be TestExportResource but, looking into the code, I see that it tests both export and import resource, so it should be TestResource. Also, they're using unittest instead of py.test (check https://github.com/okfn/coding-standards#python-1). The mock package was renamed in Pyton3 as well, so it should be imported as:

try:
    import mock
except ImportError:
    import unittest.mock as mock

To support both Python versions.

There're a few tests missing as well, specially for the utilities.py file. Ideally, we would have many small unit tests.

Overall, the changes look good. I would just like to improve the tests a bit before merging.

pwalsh · 2016-03-07T14:52:58Z

@roll good points from @vitorbaptista as always. Would be great to priortize this when you return from holidays so we can merge.

roll · 2016-03-08T09:05:38Z

Fixed test name - thanks.

About unittest usage - it's not a part of PR (resource doesn't use it) - we could update tests across the whole codebase if needed in separate PR (but I don't think it's a high priority work).

And about mock - we use python3's mock backport (because mock wasn't part of python2 stdlib at all) - https://pypi.python.org/pypi/mock - cross-platform way to have mock for py2-py3 projects.

roll · 2016-03-08T09:11:06Z

The most important name - naming =)

When user has database it's natural to export database to filesystem or import it from filesystem.
When use has datapackage/resource it's natural to export resource to database/filesystem etc or import(?) it from database/filesystem.

Here I really don't now how it would be better. But I know we should be consistent accross all our codebase. So what do you (@vitorbaptista @pwalsh) think about import/export naming for:

Datapackage class
Resource class

Because if we have for example Datapackage.export function exporting datapackage to database (it's natural as OOP-approach) we really can't have export_resource function with opposite semantics (db to dp).

May be better way will be to do not use import/export words and use something else.

vitorbaptista · 2016-03-08T11:17:44Z

Agreed that moving from unittest to py.test should be in a separate PR (and isn't high priority).

About naming, as the code is related to a jsontableschema's resource, even though it isn't OOP, the "export" and "import" should be from the point of view of a resource. Exporting makes more sense for me, as we're exporting a resource to another place (DB, ...), but importing is a bit tricky because we're not really importing a resource from someplace (DB) to someplace else (filesystem), but actually converting something (DB) to a resource.

Thinking in OOP terms, if I saw resource.export(), I would expect to it to export the resource object to some place. Then Resource.import() would be a class-method and would return me a resource object with its data imported from the params.

In the end, export gives me a DB (or something else) with the resource's data, and import gives me a resource with the source's data.

Confusing...

roll · 2016-03-08T13:50:45Z

In our final design (with Resource class) there will not be this imoprt_resource and export_resource functions at all - may be I'll try to provide initial Resource class with this functionality tomorrow morning because it looks like no reason to release this functions and, in the next version, remove it.

Other way - release with better naming and mark as unstable part of API. But I don't see a much better naming for now.

pwalsh · 2016-03-08T14:02:10Z

@roll @vitorbaptista how about push (from JSON+CSV to database) and pull (from database to JSON+CSV)

roll · 2016-03-08T14:06:10Z

Or it's much more interesting.

Across our codebase it could be:

save (save to zip?)
push (push to storage)
pull (pull from storage)

vitorbaptista · 2016-03-08T15:22:33Z

I'm happy for this to be merged and work on the naming afterwards. Maybe it'll become clearer when the code is changed to a Resource class. Overall, I prefer import/export over push/pull.

roll · 2016-03-08T22:37:42Z

Big downside of export/import is python problem with import single word usage. For me it's very annoying in API design when I can use it only in things like Resource.import_resource or other non semantic hacks non symmetrical to export word usage. So push/pull looks better for me.

But in our situation may be we don't need import word at all. Because we have 2 use cases (as I see it):

datapackage/resource.export/to_storage()

and

instance = Datapackage/Resource.from_storage()
instance.save_to_filesystem()

If we have classes like Datapackage/Resource the second one is more natural for me then single import_resource()/Datapackage.pull() call in procedural manner.

roll · 2016-03-09T08:42:02Z

I thought a little bit more about it and I have to agree with Victor - without Resource design we can't now to say how it should look.

For example use case - big big resource stored in BigQuery - probably we want to provide interface to work with it without saving to filesystem or memory(Storage interface support it on low-level). This use case really affects Resource design. We need to check this and others scenario before making any decision on naming.

So I see two options for now:

be more iterative
- rename export_resource to push_resource and import_resource to pull_resource (we still need this renaming because even we can't understand what's direction export/import means). Other options - upload/download instead of pull/push because this function really does uploading and downloading (filesystem-storage).
- mark this API as unstable in docs
- merge this
- add push_package and pull_package to datapackage-py as functions do not integrating it
  with Datapackage class (just as side utils)
- mark this API as unstable in docs
- deprecate datapackage-storage-py
- so here our consolidation card we'll be done - we will be having real working tools (we will need it for OpenTrials soon)
- design Resource, integrate storages with Datapackage etc on the next iteration
be less iterative
- do not merge
- design Resource first

pwalsh · 2016-03-11T07:11:27Z

@roll I'm happy to go with your last points. I'd rather we design Resource now for use across both the jts and dp libs, but I think your analysis and approach is more reasonable given our current resources. Please do this so we can tidy up the APIs of all these libs, and open a new issue with everything we've said about Resource, and we'll target that for the subsequent release.

…data/jsontableschema-py into feature/plugins-and-storage

coveralls · 2016-03-26T11:13:08Z

Changes Unknown when pulling bd21ab4 on feature/plugins-and-storage into * on master*.

coveralls · 2016-03-26T11:17:47Z

Coverage decreased (-3.7%) to 91.737% when pulling 21d73b5 on feature/plugins-and-storage into 1e2445b on master.

coveralls · 2016-03-26T16:15:01Z

Coverage decreased (-4.9%) to 90.572% when pulling 442d22d on feature/plugins-and-storage into 1e2445b on master.

roll added 8 commits March 5, 2016 12:43

[#49] implemented plugin system like in flask

da5fa03

[#47] added storage interface

749c927

[#47] fixed readme

3dcafe9

moved schemas to schemas dir, merged helpers-utils

e8acba4

updated readme

c32c3e9

[#48] copied import/export_resource from dp-storage

a16ee32

[#48] updated export/import resource interface

70dd7eb

updated readme

3108e08

roll added this to the 0.6.0 milestone Mar 5, 2016

roll self-assigned this Mar 5, 2016

[#43] fixed input schema mutating

8d745e6

pwalsh reviewed Mar 6, 2016
View reviewed changes

pwalsh mentioned this pull request Mar 6, 2016

Make the model class better #34

Closed

added distinction between core and plugins

4218dbb

vitorbaptista reviewed Mar 7, 2016
View reviewed changes

roll added 2 commits March 8, 2016 11:34

fixed validate readme

72d2dc7

fixed test resource name

148e667

roll removed their assignment Mar 12, 2016

roll added 14 commits March 24, 2016 11:03

[#49] implemented plugin system like in flask

bf4b479

[#47] added storage interface

1bf07ed

[#47] fixed readme

2ad0c67

moved schemas to schemas dir, merged helpers-utils

e8d36ea

updated readme

b5ac169

[#48] copied import/export_resource from dp-storage

c8facbf

[#48] updated export/import resource interface

8970e68

updated readme

fc21463

[#43] fixed input schema mutating

2b7cc65

added distinction between core and plugins

06c1762

fixed validate readme

a442a61

fixed test resource name

736bc54

fixed setup.py package name, repo name

5e15bad

Merge branch 'feature/plugins-and-storage' of github.com:frictionless…

bd21ab4

…data/jsontableschema-py into feature/plugins-and-storage

Merge branch 'master' into feature/plugins-and-storage

21d73b5

renamed export/import to push/pull

442d22d

roll merged commit 6a46f66 into master Mar 26, 2016

roll deleted the feature/plugins-and-storage branch March 26, 2016 19:14

roll removed this from the 0.6.0 milestone Mar 28, 2016

roll mentioned this pull request Mar 31, 2016

[WIP] added push_datapackage and pull_datapackage frictionlessdata/datapackage-py#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/plugins and storage #51

Feature/plugins and storage #51

roll commented Mar 5, 2016

pwalsh Mar 6, 2016

roll Mar 6, 2016

pwalsh Mar 6, 2016

roll Mar 6, 2016

vitorbaptista Mar 7, 2016

pwalsh commented Mar 6, 2016

roll commented Mar 6, 2016

pwalsh commented Mar 6, 2016

roll commented Mar 6, 2016

pwalsh commented Mar 6, 2016

roll commented Mar 7, 2016

vitorbaptista Mar 7, 2016

roll Mar 8, 2016

vitorbaptista commented Mar 7, 2016

pwalsh commented Mar 7, 2016

roll commented Mar 8, 2016

roll commented Mar 8, 2016

vitorbaptista commented Mar 8, 2016

roll commented Mar 8, 2016

pwalsh commented Mar 8, 2016

roll commented Mar 8, 2016

vitorbaptista commented Mar 8, 2016

roll commented Mar 8, 2016

roll commented Mar 9, 2016

pwalsh commented Mar 11, 2016

coveralls commented Mar 26, 2016

coveralls commented Mar 26, 2016

coveralls commented Mar 26, 2016


		Note: `validate()` validates whether a schema is a validate JSON Table Schema. It does not validate data against a schema.

		### Export/import

Feature/plugins and storage #51

Feature/plugins and storage #51

Conversation

roll commented Mar 5, 2016

pwalsh Mar 6, 2016

Choose a reason for hiding this comment

roll Mar 6, 2016

Choose a reason for hiding this comment

pwalsh Mar 6, 2016

Choose a reason for hiding this comment

roll Mar 6, 2016

Choose a reason for hiding this comment

vitorbaptista Mar 7, 2016

Choose a reason for hiding this comment

pwalsh commented Mar 6, 2016

roll commented Mar 6, 2016

pwalsh commented Mar 6, 2016

roll commented Mar 6, 2016

pwalsh commented Mar 6, 2016

roll commented Mar 7, 2016

vitorbaptista Mar 7, 2016

Choose a reason for hiding this comment

roll Mar 8, 2016

Choose a reason for hiding this comment

vitorbaptista commented Mar 7, 2016

pwalsh commented Mar 7, 2016

roll commented Mar 8, 2016

roll commented Mar 8, 2016

vitorbaptista commented Mar 8, 2016

roll commented Mar 8, 2016

pwalsh commented Mar 8, 2016

roll commented Mar 8, 2016

vitorbaptista commented Mar 8, 2016

roll commented Mar 8, 2016

roll commented Mar 9, 2016

pwalsh commented Mar 11, 2016

coveralls commented Mar 26, 2016

coveralls commented Mar 26, 2016

coveralls commented Mar 26, 2016