Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata version control with git #170

Open
isedwards opened this issue Jul 23, 2013 · 8 comments
Open

Metadata version control with git #170

isedwards opened this issue Jul 23, 2013 · 8 comments

Comments

@isedwards
Copy link

Try using Git repository as a pycsw backend. CSW providing a search interface and Git as an alternative to CSW-T

Raised here: http://osgeo-org.1560.x6.nabble.com/General-CSW-questions-tp5067534p5067661.html

@isedwards
Copy link
Author

I prefer git for most things, but perhaps fossil-scm is also worthy of attention on this ticket (with its single file repository based on sqlite database and immutable history)?

The geonetwork experience (using svn) is here:
"Not all records in GeoNetwork are tracked as the compute and systems admin cost of this tracking for every record, particularly in large catalogs, is too high."
http://geonetwork-opensource.org/manuals/trunk/eng/users/managing_metadata/versioning/index.html

@tomkralidis
Copy link
Member

Git seems like a good first step to implement an scm backend design pattern, which we can then apply to fossil-scm, svn, etc.

Some options/thinking out loud:

  • manage metadata in Git, and simply have a process/script to update the underlying pycsw repository from Git periodically, or as a post-commit hook. People could interact with Git by other means?
  • enhance CSW-T to additionally transact with the scm (thereby having a managed copy of the metadata in Git as well as the CSW repository), when a user does insert/update/delete
  • Implement CSW extensions kind of like GeoServer does (http://geoserver.org/display/GEOS/Versioning+WFS+-+Extensions), with GetLog, GetDiff, and enhancing GetRecordById to fetch by a given version (recordVersion)

Auth: I haven't given much thought yet to access control against specific elements, however it would be best to leverage an auth mechanism and use it as opposed to creating one inline

Migrations: the way the pycsw repository works, it is kind of agnostic to the structure of metadata records per se, but we should look into DB migrations regardless, for times where the underlying model itself changes.

@rclark
Copy link

rclark commented Aug 21, 2013

The first bullet seems very tractable, and would make for a great demonstration of the idea.

The second point would be required in the end, although honestly the major benefit of a git backend would be that you could manage the metadata content without CSW-T.

Third point is less intriguing to me -- again, less interested in CSW-based access to versioning. CSW's primary focus should be on search and discovery, and we can let real-life version control systems do the version control.

It would also be worth exploring Git as a more efficient mechanism for harvesting than CSW's protocol.

What would be stellar would be a git repo as a replacement for, not in addition to, the spatial database, but then you would certainly need some other mechanism for indexing... Maybe something like CouchDB is another backend to consider?

@kalxas
Copy link
Member

kalxas commented Aug 21, 2013

Mercurial would also be a good choice as a back-end, since it is written in Python and is very similar to Git.

Regarding CouchDB, there is an open issue #120 :)

@tomkralidis
Copy link
Member

@rclark good points here. I think a Git repo as the backend is a good next step.

Backends in pycsw are extensible. So something like pycsw/plugins/repository/git/git.py would be required, with the same setup/signatures as https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/geonode/geonode_.py or https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/odc/odc.py, adding insert, update, delete functions which would be the CSW-T functions to interact with Git.

I think this would be very easy to do for Git transactions, with a few config switches to detect it's a git backend, as well as u/p credentials.

The question then becomes how do we index and make the repository searchable.

Some options / further thinking out loud:

  • one could use, say, the GitHub API to search a repository but this would only loosely work for freetext style searching, so you would have to post-process the API response for finer grained searching like CSW can do (i.e. dc:title = 'foo'). This also goes for SQFQL spatial predicates
  • use a parallel indexing system like CouchDB. This would also require SFSQL spatial predicate support. Anyone know if GeoCouch support this?

@rclark
Copy link

rclark commented Aug 22, 2013

GeoNetwork and ESRI Geoportal both utilize lucene for indexing if I'm not mistaken. I think CouchDB has validity as its own backend for pycsw, but maybe not so much for this purpose.

Even more thinking out loud

  • Wouldn't want to rely on GitHub API unless you were explicitly making it a "GitHub" and not just a "Git" backend.
  • Lucene (or something along those lines) can abstract the search/indexing away from your backend implementation, and that's really intriguing, but at the same time one of the great things about pycsw is how light-weight it is in comparison to the other Java-based CSW servers. For a file-based backend though, you more or less have to rely on some other piece of the stack to index/search I guess?

@tomkralidis
Copy link
Member

@rclark thanks for the info. Agreed, lightweight is a rule of pycsw.

Has anyone tried whoosh (http://whoosh.ca)? From what I can see, pure Python index/search, and I think it would be a great fit. The only thing is that it doesn't do spatial. What would be really cool is for Whoosh to support Shapely (even if it's not PP, optional spatial support).

@kalxas
Copy link
Member

kalxas commented Oct 9, 2022

Update: External git workflow is being used in ESA's Open Science Catalogue https://opensciencedata.esa.int/ with pycsw as the Catalogue backend.

Records are stored/manipulated on GitHub and there is a hook that triggers pycsw harvesting from gihub pages to synchronize the records in the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants