Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/plugins and storage #51

Merged
merged 28 commits into from
Mar 26, 2016
Merged

Feature/plugins and storage #51

merged 28 commits into from
Mar 26, 2016

Conversation

roll
Copy link
Member

@roll roll commented Mar 5, 2016

Changes:

@roll roll added this to the 0.6.0 milestone Mar 5, 2016
@roll roll self-assigned this Mar 5, 2016

Note: `validate()` validates whether a **schema** is a validate JSON Table Schema. It does **not** validate data against a schema.

### Export/import
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should include this as part of default package. We should refer to the plugins, but not include them by default. We can still have this section in README, but make it more clear as a Plugin section and explain the downloads required for this to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you both with @vitorbaptista was leaning more for out-of-box solution (next step - #52).

Personally I also more prefer distinction between core and plugins where user need to install plugin by himself and this description lives in plugins section and export/import section having just a general info about it.

We should sync on it with datapackage-py. So @vitorbaptista's opinion is required here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roll no: definitely stuff like this as plugins, meaning, install them if you need them. The out-of-the-box thing was more about the registry and validate stuff, where I finally agreed that @vitorbaptista was right that they are too core to be external modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 with it - I'll update a PR.

Also than datapackage-py should follow this logic too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should strive for making the "most common path" as easy as possible. For example, I think importing and exporting to a zip file is pretty common with datapackage-py, so it's on core, but I wouldn't add (say) BigQuery.

Here maybe SQLAlchemy is common enough to be on core, but I don't know. I'm happy with whatever you guys decide.

@pwalsh
Copy link
Member

pwalsh commented Mar 6, 2016

@roll this is good for me. I also want @vitorbaptista to look at it as we've all been in this conversation since beginning. After his comments we can merge.

One note is that I think the original model class I wrote was a bit all over the place, and thinking in wider terms, we should probably break this up into two things soon (ref. here):

  1. A model class for the schema object (similar to most of what we have there already, exposing the schema and various helper methods)
  2. A Resource class which takes a SchemaModel and a data source, and looks something like the resource class in datapackage-py (data iterator, etc).

This pushes most of the Resource logic into this package, where it can be used standalone, and also as part of a data package via datapackage-py.

This would pretty much finalise the API of this package as far as I see.

@pwalsh pwalsh mentioned this pull request Mar 6, 2016
@roll
Copy link
Member Author

roll commented Mar 6, 2016

I need some time to read other thing you've add. But want to write right now it's awesome you've mentioned Resource class - I was thinking about it a lot. Because it's really missing part of abstractions (schema, resource=schema+data, datapackage=[schema+data]).

@pwalsh
Copy link
Member

pwalsh commented Mar 6, 2016

@roll yes. great. Actually, @vitorbaptista probably has most of the required Resource logic in datapackage-py so it should be quite simple to do this, and it really is the missing link here. If you have a direction on that I'd say go for it and do a PR on this repo with it.

@roll
Copy link
Member Author

roll commented Mar 6, 2016

I've updated readme in PR to make distinction between core and plugins

@pwalsh
Copy link
Member

pwalsh commented Mar 6, 2016

@roll ok great. We'll wait for @vitorbaptista to look early in the week, and then merge.

@roll
Copy link
Member Author

roll commented Mar 7, 2016

Great! For datapackage-py implementation could be the same - just copying with minor changes code from package.py and helpers.py - https://github.com/okfn/datapackage-storage-py/tree/master/datapackage_storage - to things like dp.to_storage and classmethod(dp.from_storage) - it's question for Vitor what's API it could be.

Some notes:

  • import and export should be substituted (in datapackage-storage-py we export from storage to filesystem but here we export resource or datapackage to storage - OOP approach)
  • instead of passing storage we're passing backend and **backend_options. To get a storage:
# I don't think we should catch an exception here because JTS plugin system
# raises exception with message describing lack of plugin problem
plugin = import_module('jsontableschema.plugins.%s' % backend)
storage = plugin.Storage(**backend_options)

with io.open(filepath) as stream:
schema = json.load(stream)

is_valid = jsontableschema.validate(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current API jsontableschema.validate is only used to catch exceptions. What I mean is that is_valid will never be False. This method should always be inside a try/catch block, so I suggest changing this example to something like:

try:
    jsontableschema.validate(schema)
except jsontableschema.exceptions.SchemaValidationError as e:
   # handle errors

For this reason, the return value of jsontableschema.validate doesn't matter. I wouldn't return anything to make this clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a part of PR's piece of work but I think it's OK to fix it here - I've updated PR.

@vitorbaptista
Copy link
Contributor

I always mix importing and exporting resources. Importing a resource is going through a JSON Table Schema file to (say) a SQL table, and exporting is going from a SQL table to a JTS file, right?

Reading the code, I've seen that both while importing and exporting there're files being created. I couldn't figure out why.

I'm also concerned with the tests. This PR is mostly about creating methods that delegate to other libraries, so it's OK to assume that these other libraries work and just testing that we're calling them correctly. I'd prefer to have fewer mocks, testing at least that the CSV files were written correctly, but it's OK.

There're some naming issues, with the test class named Test_export_resource. It should be TestExportResource but, looking into the code, I see that it tests both export and import resource, so it should be TestResource. Also, they're using unittest instead of py.test (check https://github.com/okfn/coding-standards#python-1). The mock package was renamed in Pyton3 as well, so it should be imported as:

try:
    import mock
except ImportError:
    import unittest.mock as mock

To support both Python versions.

There're a few tests missing as well, specially for the utilities.py file. Ideally, we would have many small unit tests.

Overall, the changes look good. I would just like to improve the tests a bit before merging.

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

@roll good points from @vitorbaptista as always. Would be great to priortize this when you return from holidays so we can merge.

@roll
Copy link
Member Author

roll commented Mar 8, 2016

Fixed test name - thanks.

About unittest usage - it's not a part of PR (resource doesn't use it) - we could update tests across the whole codebase if needed in separate PR (but I don't think it's a high priority work).

And about mock - we use python3's mock backport (because mock wasn't part of python2 stdlib at all) - https://pypi.python.org/pypi/mock - cross-platform way to have mock for py2-py3 projects.

@roll
Copy link
Member Author

roll commented Mar 8, 2016

The most important name - naming =)

When user has database it's natural to export database to filesystem or import it from filesystem.
When use has datapackage/resource it's natural to export resource to database/filesystem etc or import(?) it from database/filesystem.

Here I really don't now how it would be better. But I know we should be consistent accross all our codebase. So what do you (@vitorbaptista @pwalsh) think about import/export naming for:

  • Datapackage class
  • Resource class

Because if we have for example Datapackage.export function exporting datapackage to database (it's natural as OOP-approach) we really can't have export_resource function with opposite semantics (db to dp).

May be better way will be to do not use import/export words and use something else.

@vitorbaptista
Copy link
Contributor

Agreed that moving from unittest to py.test should be in a separate PR (and isn't high priority).

About naming, as the code is related to a jsontableschema's resource, even though it isn't OOP, the "export" and "import" should be from the point of view of a resource. Exporting makes more sense for me, as we're exporting a resource to another place (DB, ...), but importing is a bit tricky because we're not really importing a resource from someplace (DB) to someplace else (filesystem), but actually converting something (DB) to a resource.

Thinking in OOP terms, if I saw resource.export(), I would expect to it to export the resource object to some place. Then Resource.import() would be a class-method and would return me a resource object with its data imported from the params.

In the end, export gives me a DB (or something else) with the resource's data, and import gives me a resource with the source's data.

Confusing...

@roll
Copy link
Member Author

roll commented Mar 8, 2016

In our final design (with Resource class) there will not be this imoprt_resource and export_resource functions at all - may be I'll try to provide initial Resource class with this functionality tomorrow morning because it looks like no reason to release this functions and, in the next version, remove it.

Other way - release with better naming and mark as unstable part of API. But I don't see a much better naming for now.

@pwalsh
Copy link
Member

pwalsh commented Mar 8, 2016

@roll @vitorbaptista how about push (from JSON+CSV to database) and pull (from database to JSON+CSV)

@roll
Copy link
Member Author

roll commented Mar 8, 2016

Or it's much more interesting.

Across our codebase it could be:

  • save (save to zip?)
  • push (push to storage)
  • pull (pull from storage)

@vitorbaptista
Copy link
Contributor

I'm happy for this to be merged and work on the naming afterwards. Maybe it'll become clearer when the code is changed to a Resource class. Overall, I prefer import/export over push/pull.

@roll
Copy link
Member Author

roll commented Mar 8, 2016

Big downside of export/import is python problem with import single word usage. For me it's very annoying in API design when I can use it only in things like Resource.import_resource or other non semantic hacks non symmetrical to export word usage. So push/pull looks better for me.

But in our situation may be we don't need import word at all. Because we have 2 use cases (as I see it):

datapackage/resource.export/to_storage()

and

instance = Datapackage/Resource.from_storage()
instance.save_to_filesystem()

If we have classes like Datapackage/Resource the second one is more natural for me then single import_resource()/Datapackage.pull() call in procedural manner.

@roll
Copy link
Member Author

roll commented Mar 9, 2016

I thought a little bit more about it and I have to agree with Victor - without Resource design we can't now to say how it should look.

For example use case - big big resource stored in BigQuery - probably we want to provide interface to work with it without saving to filesystem or memory(Storage interface support it on low-level). This use case really affects Resource design. We need to check this and others scenario before making any decision on naming.

So I see two options for now:

  • be more iterative
    • rename export_resource to push_resource and import_resource to pull_resource (we still need this renaming because even we can't understand what's direction export/import means). Other options - upload/download instead of pull/push because this function really does uploading and downloading (filesystem-storage).
    • mark this API as unstable in docs
    • merge this
    • add push_package and pull_package to datapackage-py as functions do not integrating it
      with Datapackage class (just as side utils)
    • mark this API as unstable in docs
    • deprecate datapackage-storage-py
    • so here our consolidation card we'll be done - we will be having real working tools (we will need it for OpenTrials soon)
    • design Resource, integrate storages with Datapackage etc on the next iteration
  • be less iterative
    • do not merge
    • design Resource first

@pwalsh
Copy link
Member

pwalsh commented Mar 11, 2016

@roll I'm happy to go with your last points. I'd rather we design Resource now for use across both the jts and dp libs, but I think your analysis and approach is more reasonable given our current resources. Please do this so we can tidy up the APIs of all these libs, and open a new issue with everything we've said about Resource, and we'll target that for the subsequent release.

@roll roll removed their assignment Mar 12, 2016
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling bd21ab4 on feature/plugins-and-storage into * on master*.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.7%) to 91.737% when pulling 21d73b5 on feature/plugins-and-storage into 1e2445b on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.9%) to 90.572% when pulling 442d22d on feature/plugins-and-storage into 1e2445b on master.

@roll roll merged commit 6a46f66 into master Mar 26, 2016
@roll roll deleted the feature/plugins-and-storage branch March 26, 2016 19:14
@roll roll removed this from the 0.6.0 milestone Mar 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants