Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular JSON format for Wikipedia collaboration #265

Closed
nyurik opened this Issue Jun 6, 2016 · 26 comments

Comments

Projects
None yet
4 participants
@nyurik
Copy link

commented Jun 6, 2016

Hi, I am currently developing a tabular data format for Wikipedia (ability to store tables of data on Wiki pages), and you might be interested in participating. Thanks!

https://phabricator.wikimedia.org/T134426

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2016

@nyurik great to hear about this. I would encourage taking a good look at:

JSON Table Schema: http://dataprotocols.org/json-table-schema/

Its very simple and already does exactly what you want :-)

I'd definitely recommend adopting that model for describing headers and types rather than rolling your own if you could.

In addition for the overall structure you could adopt the "resource" part of a (Tabular) Data Package model.

What this would look like

Here's your example from your proposal redone using this approach.

As you can see it's just as simple and if anything a bit more expressive -- plus you can leverage all the work that has already been done developing these (and the tooling)!

Note: the data could be simple array of arrays rather than array of objects if that were preferred (it is more concise but is a it less "json-ic"). I've done a 2nd example showing that ...

{
    "title": "Some good fruites for you",
    "title@es": "Algunas buenas frutas para ti"
    "schema": {
       "fields": [
          {
            "name": "label",
            "type": "string",
          },
          {
            "name": "value",
            "type":  "number"
          },
          {
            "name": "stored",
            "type": "boolean"
          },
          {
            "name": "localName",
            "type": "localized"
          }
      ]
    ],
    "data": [
        {
          "label": "peaches",
          "value": 100,
          "stored": true,
          "localized": {
            "en": "in english",
            "es": "esto puede estar en español",
            "fr": "this could be in french"
          }
        },
        {
          "label": "plums",
          "value": 32,
          "stored": false,
          "localized": {
            "en": "in english",
            "es": "esto también está en español",
            "fr": "this is also in french",
            "gr": "this could be in greek"
          }
        },
        ...
    ]
}

Example with row data as arrays rather than objects

{
    "title": "Some good fruites for you",
    "title@es": "Algunas buenas frutas para ti"
    "schema": {
       "fields": [
          {
            "name": "label",
            "title": "My fancy title for this field", #either
            "title": { "en": "My fancy title for this field", "fr": "Mon title"}, #or
            "type": "string",
          },
          {
            "name": "value",
            "type":  "number"
          },
          {
            "name": "stored",
            "type": "boolean"
          },
          {
            "name": "localName",
            "type": "localized"
          }
      ]
    ],
    "data": [
      [
        [
            "peaches",
            100,
            true,
            {
                "en": "in english",
                "es": "esto puede estar en español",
                "fr": "this could be in french"
            }
        ],
        [
            "plums",
            32,
            false,
            {
                "en": "in english",
                "es": "esto también está en español",
                "fr": "this is also in french",
                "gr": "this could be in greek"
            }
        ],
        ...
    ]
}
@nyurik

This comment has been minimized.

Copy link
Author

commented Jun 7, 2016

@rgrp thanks, lets continue the discussion in https://phabricator.wikimedia.org/T134426. Do you want to keep this issue open so that others may notice it?

@pwalsh

This comment has been minimized.

Copy link
Member

commented Jun 8, 2016

Hey @nyurik I'm working on the specs with @rgrp and I'll jump in on the other thread too. We'll leave this open for a while, sure.

@roll roll removed the backlog label Aug 29, 2016

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2016

@nyurik ping ;-)

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2016

@danfowler wonder if this issue should move to general FD issue tracker since more about outreach and collaboration ...

@nyurik

This comment has been minimized.

Copy link
Author

commented Sep 27, 2016

@rgrp hi, sorry for not replying earlier, tabular (and other) data on wiki will be my priority starting next week. There are some minor cleanups needed, but otherwise I feel pretty good about the spec and implementation. I have been thinking about i18n stuff some more, as we might want to allow an alternative interface for localizers - TBD. Switching back to phabricator :)

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2016

@nyurik so did you adopt tabular data package / json table schema :-) ?

PS: I tried to comment on phabricator a while back and seem to have got locked out and could not work out how to log in. Just flagging so you know why I did not follow up to your last comment there.

@nyurik

This comment has been minimized.

Copy link
Author

commented Sep 27, 2016

Hm, you should be able to create an account, or even easier - create an account in Wikipedia and login into phabricator via openauth. As for the spec - I am trying to get it more in line with what you have - I will be putting finishing touches on it once I start back on that project

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2016

@nyurik i had an account - hence i commented earlier. It was a challenge of getting access account.

@nyurik would it be great if we could fully align as it would allow us to share tooling 😄

@nyurik

This comment has been minimized.

Copy link
Author

commented Sep 27, 2016

Strange, we really should fix that - phab is our primary communication tool, both inside and outside the WMF. And yes, it would be good to be able to reuse tools. We could schedule a voice chat sometime later this week or early next (everyone interested are welcome to join), and iron out the spec.

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2016

@nyurik great suggestion for a chat - sorry somehow missed this in my notification queue when you posted. The person we might want there is @pwalsh or @danfowler and i'm not sure about availability today - but i could do afternoon. You could ping on the chat channel here and we can arrange https://gitter.im/frictionlessdata/chat

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2016

@nyurik any update on the structure? It would be great to align this and tabular data package spec. Please ping on channel or let us know how to reach you to have a chat ...

@nyurik

This comment has been minimized.

Copy link
Author

commented Nov 21, 2016

Hi @rgrp, yes, I am thinking of launching it fairly soon - https://commons.wikimedia.org/wiki/Commons:Village_pump#Launching_shared_maps_and_data_on_Commons -- see the modifications I have made, and feel free to experiment with it at https://commons.wikimedia.beta.wmflabs.org/wiki/Data:Sample.tab
Let me know if you think any urgent changes are needed.

@nyurik

This comment has been minimized.

Copy link
Author

commented Nov 21, 2016

P.S. I began writing documentation at https://www.mediawiki.org/wiki/User:Yurik/Tabular - feel free to correct or improve it if you want

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Nov 30, 2016

@nyurik does it build off tabular json format or is it ab initio? From what I can see it is fairly different ... (I do not think it has changed much since my original comments 6 months ago)

So the urgent changes would be to see if we could still converge in some way 😄

As I said it would be great to have a chat about this. I hang out here https://gitter.im/frictionlessdata/chat and would be happy to skype 😄

@nyurik

This comment has been minimized.

Copy link
Author

commented Nov 30, 2016

@rgrp a few questions/comments:

  • why is your second example uses triple array? Shouldn't it be [ ["x","y","z"], ["a","b","c"] ] for a two row, 3 column table? Your example uses [[[...],[...]]]
  • I don't want to have a top level "title", because that implies a single language being "main", while all others being "translations" (e.g. "title@es"). Our construct (named "info", but we can rename it to "title" if you want) does not dictate which languages must be provided:
    "info": {
        "en": "Some good fruits for you",
        "es": "Algunas buenas frutas para ti"
    },
  • The schema could be defined similar to yours, but what should we do with the field translation? How about this - the name will be a C-style identifier, type - one of the types we allow ("number", "boolean", "string", or "localized"), and the optional header will be a key-value set of localization strings.
    "schema": {
       "fields": [
          {
            "name": "fruit_id",
            "type": "string",
            "header": { "en": "Fruit ID", "fr": "ID de fruit" }
          },
@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2016

@nyurik

  • Re the triple array: good spot and that was a bug I have now corrected
  • Re the title point:
    • Renaming to title at your end would be good if that is possible
    • I take your point about not wanting to select one specific language. The route we are proposing going for i18n in #42 allows for either a simple title if only one language or exactly follows your structure in the case of multiple languages 😄 👍 🎱 (and our choice there was influenced by your work and choice here!)
  • Re the field translations: we could do exactly the same with the title there. Our fields have titles (which I think is the same as your header) and so what you propose and we propose is exactly the same 👍 🎱
@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2016

{
    "description": {"en": "Some good fruites for you", "es": ... },
    "schema": {
       "fields": [
          {
            "name": "label",
            "title": { "en": "My fancy title for this field", "fr": "Mon title"},
            "type": "string",
          },
          {
            "name": "value",
            "type":  "number"
          },
          {
            "name": "stored",
            "type": "boolean"
          },
          {
            "name": "localName",
            "type": "localized"
          }
      ]
    },
    "data": [
        [
            "peaches",
            100,
            true,
            {
                "en": "in english",
                "es": "esto puede estar en español",
                "fr": "this could be in french"
            }
        ],
        [
            "plums",
            32,
            false,
            {
                "en": "in english",
                "es": "esto también está en español",
                "fr": "this is also in french",
                "gr": "this could be in greek"
            }
        ],
        ...
    ]
}

wmfgerrit pushed a commit to wikimedia/mediawiki-extensions-JsonConfig that referenced this issue Dec 5, 2016

Switching to a more standard compliant schema
* all schema information is now part of the top level "schema" element:

    "schema": {
        "fields": [
            {
                "name": "col1",
                "type": "string"
            }, ...

* Optional "title" inside the above schema-fields allows for column labeling
* Top level "description" instead of "info"
* Top level "data" instead of "rows"
* Replaced CC0-1.0 with CC0-1.0+  (version 1 or later, per SPDX specification)
* Render better license HTML - link to license separate from "or later version"

See frictionlessdata/specs#265 (comment)

Bug: T152184
Change-Id: I2dc08940cf9e314d5d822dce8c7cb97052aee99b
@nyurik

This comment has been minimized.

Copy link
Author

commented Dec 6, 2016

@rgrp please take a look at this sample, and let me know if you see anything else missing, or feel free to close the issue.

{
    "license": "CC0-1.0+",
    "description": {
        "en": "Some good fruits for you",
        "es": "Algunas buenas frutas para ti"
    },
    "sources": "Copied verbatim from [https://meta.wikimedia.org/wiki/User:Yurik my head]",
    "schema": {
        "fields": [
            {
                "name": "id",
                "type": "string",
                "title": {
                    "en": "Fruite ID",
                    "fr": "ID de fruit"
                }
            },
            {
                "name": "count",
                "type": "number",
                "title": {
                    "en": "Count"
                }
            },
            {
                "name": "liked",
                "type": "boolean",
                "title": {
                    "en": "Do I like it?"
                }
            },
            {
                "name": "description",
                "type": "localized",
                "title": {
                    "en": "Description"
                }
            }
        ]
    },
    "data": [
        [
            "peaches",
            100,
            true,
            {
                "en": "in english",
                "es": "esto puede estar en español",
                "fr": "this could be in french"
            }
        ],
        [
            "plums",
            32,
            false,
            {
                "en": "in english",
                "es": "esto también está en español",
                "fr": "this is also in french",
                "gr": "this could be in greek"
            }
        ],
        [
            "blueberries",
            180,
            true,
            {
                "en": "in english",
                "ru": "это может быть и по-русски"
            }
        ],
        [
            "strawberries",
            46,
            false,
            {
                "en": "in english",
                "he": "this could be in hebrew"
            }
        ],
        [
            "bananas",
            21,
            true,
            {
                "cn": "this could be in chinese",
                "en": "in english"
            }
        ]
    ]
}
@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2016

@nyurik i've copied inline for ease of reference.

I think this looks really good! 😄 and is really compatible with the Data Package spec.

Couple of asides which are not of major import:

@nyurik

This comment has been minimized.

Copy link
Author

commented Dec 9, 2016

@rgrp - I just enabled it in production, accessible from all Wikipedias, but haven't advertised yet - will need to build a good demo first. Could you take a look at https://phabricator.wikimedia.org/T152753 -- null values proposal. See how it fits with your model.

https://phabricator.wikimedia.org/T152753

We should support null to indicate that the value is not available. We should also support notNull field schema parameter to indicate that the value must exist.

W3C version of the tabular schema also supports "json-y" way of values - instead of storing data as a list of lists [ [1,2,3], [4,5,6] ], it also allows an object approach - [ {"hdr1": 1, "hdr2": 2, "hdr3": 3}, { ... } ]. This allows for an easy way to indicate missing values - simply don't specify it. It is almost the same as setting it to null, but not entirely the same: missing != null. If we don't support it, we basically equate null and missing.

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2016

@nyurik first, Data Package support data structured like this:

 [ {"hdr1": 1, "hdr2": 2, "hdr3": 3}, { ... } ]

In fact, that is the default way for doing data - rather than "raw" arrays.

Nulls

Re nulls I think you mean missing values. For this, check out "missing values" in JSON Table Schema spec:

http://specs.frictionlessdata.io/json-table-schema/#missing-values

By “missing” we simply mean null or “not present for whatever reason”. Many datasets arrive with missing data values, either because a value was not collected or it never existed.

Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. -, NaN, 0, -9999 etc.

The missingValue property provides a way to indicate that these values should be interpreted as equivalent to null.

Summary: you can define per field what values are counted as missing values using missingValue

If you wanted more info you can see the original issue about this: #97

Edited: to more clearly point to JTS definition

@nyurik

This comment has been minimized.

Copy link
Author

commented Dec 29, 2016

nulls are now supported in Commons Datasets. No way to restrict them at the moment. Should this issue be closed?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2017

@nyurik yes we can close.

Did you use missingValue for defining nulls or how did you do it?

@pwalsh pwalsh closed this Feb 5, 2017

@nyurik

This comment has been minimized.

Copy link
Author

commented Feb 8, 2017

@rufuspollock I basically allowed null on everything, without any restrictions for now

@rufuspollock

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

@nyurik i think that is a sensible default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.