Example of programmatically creating metadata from scratch #48

brockfanning · 2021-01-04T14:35:29Z

Hello,

I was wondering if it would be possible to provide an example of creating the metadata from scratch. My goal is to create a metadata file by programmatically examining a CSV file to determine the schema. No worries if this is out of scope for the library.

Thank you!

LinguList · 2021-01-04T15:54:18Z

There are plenty examples. Pretty customized metadata files are used in the concepticon project. Just check there for the folder concepticondata/conceptlists/ where each TSV file is accompanied by a metadata file.

brockfanning · 2021-01-04T16:26:24Z

@LinguList Thanks for the quick reply! I apologize my question was unnecessarily vague. I'm actually looking for an example of programmatically (with Python code) creating metadata from scratch, using this library. To elaborate, I'm looking for either of these things, in order of preference:

Python example of loading a CSV file and then generating the JSON metadata based on the CSV contents
Python example of generating the JSON metadata using this library

The first item above is my ultimate goal, but if that is not possible with this library then I could get there with the help of the second item.

xrotwang · 2021-01-04T17:06:01Z

Hm. For 1, i.e. infering metadata from CSV you may be better off using the frictionless framework - even if you'd want to go with CSVW, which would require a simple transformation of the JSON metadata.

brockfanning · 2021-01-04T17:09:18Z

@xrotwang That's a great lead, thank you! Do you happen to know of any Frictionless-to-CSVW metadata converters out there? As you say it should be simple, though I would need to ramp up on both specs.

xrotwang · 2021-01-04T17:16:24Z

I'd be willing to help with writing the converter :)

brockfanning · 2021-01-04T17:26:59Z

@xrotwang That's amazing! Should I kick things off by creating a separate repo for this, and then ping you when I get stuck? Or were you imagining this being added to an existing repo?

xrotwang · 2021-01-04T18:22:20Z

As far as I'm concerned, such a converter could live in this repo as well, possibly as class method Dataset.from_frictionless_metadata or similar.

brockfanning · 2021-01-04T18:49:19Z

That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!

xrotwang · 2021-01-04T18:53:51Z

Will try to make a start this week. brockfanning <notifications@github.com> schrieb am Mo., 4. Jan. 2021, 19:49:

…

That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUOKFT67WRBTIIP4VQOFTSYIET3ANCNFSM4VTDE4HA> .

xrotwang · 2021-01-04T20:22:33Z

Skimming the documentation at https://frictionlessdata.io/tooling/python/describing-data/#describing-schema it seems that frictionless can infer the CSV dialect and the data types (partially), but not things like foreign key relations. But we could start out with the output of frictionless describe PATH/TO/TABLE --json --source-type package, i.e. something like

{
  "profile": "data-package",
  "resources": [
    {
      "path": "forms.tsv",
      "stats": {
        "hash": "91c89a7d4fe4d5d55ad8a383a64ea047",
        "bytes": 192127,
        "fields": 13,
        "rows": 2249
      },
      "control": {
        "newline": ""
      },
      "encoding": "utf-8",
      "dialect": {
        "delimiter": "\t"
      },
      "schema": {
        "fields": [
          {
            "name": "ID",
            "type": "string"
          },
          {
            "name": "Language_ID",
            "type": "string"
          },
          {
            "name": "Parameter_ID",
            "type": "integer"
          },
          {
            "name": "Value",
            "type": "string"
          },
          {
            "name": "Comment",
            "type": "any"
          },
          {
            "name": "Source",
            "type": "string"
          },
          {
            "name": "Graphemes",
            "type": "string"
          },
          {
            "name": "Profile",
            "type": "string"
          }
        ]
      },
      "name": "forms",
      "profile": "tabular-data-resource",
      "scheme": "file",
      "format": "tsv",
      "hashing": "md5",
      "compression": "no",
      "compressionPath": "",
      "query": {}
    }
  ]
}

and convert this into a csvw.TableGroup - which could then be enhanced programmatically.

xrotwang · 2021-01-04T21:01:33Z

@brockfanning could you give an example of the kind of CSV you'd want to create CSVW for? Does it use any naming scheme to give type or foreign key hints?

brockfanning · 2021-01-04T22:15:28Z

@xrotwang In my case it's pretty simple: each data package has exactly one standalone CSV file. So I don't believe there is any concern about foreign key hints (at least in my case). We don't have any naming scheme related to types. Here's an example:

Year	Location	Value
2010		10
2011		20
2012		30
2010	Urban	12
2011	Urban	22
2012	Urban	32
2010	Rural	14
2011	Rural	24
2012	Rural	34

xrotwang · 2021-01-05T08:53:17Z

Ok, so below is what frictionless describe makes off such data. It's enough to use csvw to read the data correctly, i.e. we have the info about

file path
CSV dialect
column names and datatypes

Once we can read the data with csvw, we could add information like

required - if there are no empty values for a column
unique
minimum or maximum.

OTOH, that would make the data more difficult to edit - e.g. adding a row with a value below an inferred minimum would make the data invalid.

So I guess, we could/should distinguish two use cases:

Seeding CSVW metadata for data which is still added to.
Adding CSVW metadata to "finished" data for publication (in which case inferring properties like uniqueness, etc. would be useful).

---
metadata: test.csv
---

compression: 'no'
compressionPath: ''
control:
  newline: ''
dialect:
  delimiter: '|'
encoding: utf-8
format: csv
hashing: md5
name: test
path: test.csv
profile: tabular-data-resource
query: {}
schema:
  fields:
    - name: Year
      type: integer
    - name: Location
      type: string
    - name: Value
      type: integer
scheme: file
stats:
  bytes: 131
  fields: 3
  hash: b36e8c21563ab32645052c11510bddb7
  rows: 9

brockfanning · 2021-01-05T11:39:31Z

@xrotwang Just my 2 cents: If it helps keep things simple I would be fine with assuming that all inferring of the schema, and adjustments to the schema, will happen in the Frictionless object, before it gets to this library. For example if the schema needs constraints added, they'd be added according to the Frictionless table schema. In that case this library could focus on converting (as faithfully as possible) that metadata to meet the CSVW spec.

Side note: I agree with your distinction of use-cases. I expect many providers want to publish CSVW just to make their data "interoperable". While others want the metadata for validation so that they can avoid maintenance problems in the future. My users definitely need the "infer from CSV" interoperability approach (hence this issue). To that end I'm imagining automating some of the adjustments you mention, like minimum/maximum, and turning values into an "enum". But again I am fine with doing all of that to the Frictionless object before sending it to this library.

xrotwang · 2021-01-05T11:59:38Z

Of course, the adjustments can also be made in the CSVW metadata - either manually editing the serialized JSON, or programmatically on the csvw.TableGroup object, which csvw.TableGroup.from_frictionless_metadata would return.

xrotwang · 2021-01-05T12:01:44Z

And I agree, getting the simple case up and running would not only be the first step, but presumably useful functionality already. I have a proof-of-concept in my head :) - hope to find the time later today to push for you to review.

brockfanning changed the title ~~Example of creating metadata from scratch~~ Example of programmatically creating metadata from scratch Jan 4, 2021

xrotwang mentioned this issue Jan 5, 2021

first stab at a fritionless converter #49

Merged

xrotwang closed this as completed in #49 Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of programmatically creating metadata from scratch #48

Example of programmatically creating metadata from scratch #48

brockfanning commented Jan 4, 2021

LinguList commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021 via email

xrotwang commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 5, 2021 •

edited

Loading

brockfanning commented Jan 5, 2021

xrotwang commented Jan 5, 2021

xrotwang commented Jan 5, 2021

Example of programmatically creating metadata from scratch #48

Example of programmatically creating metadata from scratch #48

Comments

brockfanning commented Jan 4, 2021

LinguList commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 4, 2021 via email

xrotwang commented Jan 4, 2021

xrotwang commented Jan 4, 2021

brockfanning commented Jan 4, 2021

xrotwang commented Jan 5, 2021 • edited Loading

brockfanning commented Jan 5, 2021

xrotwang commented Jan 5, 2021

xrotwang commented Jan 5, 2021

xrotwang commented Jan 5, 2021 •

edited

Loading