Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Coopy Highlighter Diff format with column type changes #164

Closed
edwindj opened this issue Feb 25, 2015 · 13 comments
Closed

Extend Coopy Highlighter Diff format with column type changes #164

edwindj opened this issue Feb 25, 2015 · 13 comments

Comments

@edwindj
Copy link

edwindj commented Feb 25, 2015

A useful addition to the coopy highlighter diff format would be column type changes.

For example:

dataset a
A,B
1.1,1

and

dataset b
A,B,C
1,"1",2.1

The Coopy Diff is:

!,,+++
@@,A,B,C
->,1.1->1,1,2.1

A typed version of the format could be:

!,{number->integer},{integer->string},+++{number}
@@,A,B,C
->,1.1->1,1,2.1

In which the schema row can contain a column type change. IMHO type information is not obligatory, but should be interpreted by an implementation as a type suggestion, since types differs across programming languages. The types of json table schema seems like a good candidate for denoting common types.

@paulfitz
Copy link
Contributor

Thanks @edwindj. There's also work on refining the types in json table scheme in #159.

What do you think about leaving types in a separate optional row, like:

!,,,+++
@type,number->integer,integer->string,number
@@,A,B,C
->,1.1->1,1,2.1

I'm thinking that the spec could leave space for meta data associated with columns via a series of @foo lines (say @type, @precision, @special_stuff_for_R etc). Conforming consumers of diffs can ignore all that stuff, or try to use it. Conforming producers of diffs and add some of that stuff, or none of it.

The advantage of the separate rows is that the cells can behave exactly as in ordinary rows and be parsed in just the same way.

@paulfitz
Copy link
Contributor

Also, I understand from edwindj/daff#6 that you like having a single file for expressing diffs, and that may be the way to go. But just as the Tabular Data Package spec proposes data in csv and schema in json, there may be something to be said for expressing schema differences in a hierachical format like json rather than trying to flatten types out.

@edwindj
Copy link
Author

edwindj commented Feb 27, 2015

@paulfitz I like your the syntax for extra lines that may be ignored by consumers.
Maybe we can add this to the spec of Coopy Highligher Diff nonetheless.

Regarding type changes in one file or two: should we follow the diff paradigm of storing all changes in one text or should we follow the json table schema paradigm of describing meta data (changes) in a json file? The last option would force all users to use json table schema which I find too strict. May be we should support both with a preference for json table schema. When a schema is available it should be used, otherwise a less expressive form can be used with the @type syntax.

Note that a solution in the spirit of datapackage probably would not calculate a diff, but just reference two resources: table remote and table local.

@paulfitz
Copy link
Contributor

I agreed it would make sense to stick the new syntax in. I could take a shot also at adding support for it in daff. What I'd do is just ask the source of the tables if there's any meta-data, diff that, and pass it along. For patching, I'm not 100% clear what would happen, but basically daff should tell you what meta-data changes happened and let you take care of taking action based on them.

This feature should make diffs more useful within an environment with a single kind of data source, even if it wouldn't be very useful for interchange between different kinds of data sources.

@edwindj
Copy link
Author

edwindj commented Mar 1, 2015

Great! I will follow your changes and implement them in daff for R.

@rufuspollock
Copy link
Contributor

@paulfitz shoudl this remain open - are their pending changes? Otherwise let's close with summary.

@paulfitz
Copy link
Contributor

@rgrp can we keep it open a while longer? I've been plugging away on this, close to maturing.

paulfitz added a commit to paulfitz/daff that referenced this issue May 31, 2015
Tables with meta-data that can be expressed in tabular form
can now have changes in that meta-data included in diffs
and applied in patches, following ideas in:

  frictionlessdata/specs#164

Example implementation for Sqlite tables to follow soon (I hope).
@rufuspollock
Copy link
Contributor

@paulfitz fantastic!

@edwindj
Copy link
Author

edwindj commented May 31, 2015

@paulfitz Great!

@paulfitz
Copy link
Contributor

I implemented a version of this some time back, and then got distracted working on a demo for it with sqlite. Suppose we have a birds table as follows:

# schema: id INTEGER PRIMARY KEY, name TEXT, count TEXT
id,name,count
-------------
1,robin,251
2,eagle,10
3,pigeon,140

And we modify the type of a column, add another column, and add a row:

# schema: id INTEGER PRIMARY KEY, name TEXT, count INTEGER, weather TEXT
id,name,count,weather
---------------------
1,robin,251,warm
2,eagle,10,
3,pigeon,140,
4,penguin,5,cold

Then daff would report this diff:

sqlite_diff

To use this in R, you'd need to implement some code that reports the properties of each column that you care about. That is sufficient for diffing. For patching, you'd need to be able to accept a description of the changes in a particular format and make them happen. I'll need to document this better if you're still interested in pursuing this @edwindj.

@edwindj
Copy link
Author

edwindj commented Oct 11, 2015

@paulfitz, I'm still interested :-), documentation helps, but I will update my R code so this example works. Won't be until end of this week.

@rufuspollock
Copy link
Contributor

@edwindj @paulfitz can this be closed?

@danfowler
Copy link
Contributor

This issue was moved to okfn/specs#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants