Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of programmatically creating metadata from scratch #48

Closed
brockfanning opened this issue Jan 4, 2021 · 16 comments · Fixed by #49
Closed

Example of programmatically creating metadata from scratch #48

brockfanning opened this issue Jan 4, 2021 · 16 comments · Fixed by #49

Comments

@brockfanning
Copy link

Hello,

I was wondering if it would be possible to provide an example of creating the metadata from scratch. My goal is to create a metadata file by programmatically examining a CSV file to determine the schema. No worries if this is out of scope for the library.

Thank you!

@LinguList
Copy link

There are plenty examples. Pretty customized metadata files are used in the concepticon project. Just check there for the folder concepticondata/conceptlists/ where each TSV file is accompanied by a metadata file.

@brockfanning brockfanning changed the title Example of creating metadata from scratch Example of programmatically creating metadata from scratch Jan 4, 2021
@brockfanning
Copy link
Author

@LinguList Thanks for the quick reply! I apologize my question was unnecessarily vague. I'm actually looking for an example of programmatically (with Python code) creating metadata from scratch, using this library. To elaborate, I'm looking for either of these things, in order of preference:

  1. Python example of loading a CSV file and then generating the JSON metadata based on the CSV contents
  2. Python example of generating the JSON metadata using this library

The first item above is my ultimate goal, but if that is not possible with this library then I could get there with the help of the second item.

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021

Hm. For 1, i.e. infering metadata from CSV you may be better off using the frictionless framework - even if you'd want to go with CSVW, which would require a simple transformation of the JSON metadata.

@brockfanning
Copy link
Author

@xrotwang That's a great lead, thank you! Do you happen to know of any Frictionless-to-CSVW metadata converters out there? As you say it should be simple, though I would need to ramp up on both specs.

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021

I'd be willing to help with writing the converter :)

@brockfanning
Copy link
Author

@xrotwang That's amazing! Should I kick things off by creating a separate repo for this, and then ping you when I get stuck? Or were you imagining this being added to an existing repo?

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021

As far as I'm concerned, such a converter could live in this repo as well, possibly as class method Dataset.from_frictionless_metadata or similar.

@brockfanning
Copy link
Author

That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021 via email

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021

Skimming the documentation at https://frictionlessdata.io/tooling/python/describing-data/#describing-schema it seems that frictionless can infer the CSV dialect and the data types (partially), but not things like foreign key relations. But we could start out with the output of frictionless describe PATH/TO/TABLE --json --source-type package, i.e. something like

{
  "profile": "data-package",
  "resources": [
    {
      "path": "forms.tsv",
      "stats": {
        "hash": "91c89a7d4fe4d5d55ad8a383a64ea047",
        "bytes": 192127,
        "fields": 13,
        "rows": 2249
      },
      "control": {
        "newline": ""
      },
      "encoding": "utf-8",
      "dialect": {
        "delimiter": "\t"
      },
      "schema": {
        "fields": [
          {
            "name": "ID",
            "type": "string"
          },
          {
            "name": "Language_ID",
            "type": "string"
          },
          {
            "name": "Parameter_ID",
            "type": "integer"
          },
          {
            "name": "Value",
            "type": "string"
          },
          {
            "name": "Comment",
            "type": "any"
          },
          {
            "name": "Source",
            "type": "string"
          },
          {
            "name": "Graphemes",
            "type": "string"
          },
          {
            "name": "Profile",
            "type": "string"
          }
        ]
      },
      "name": "forms",
      "profile": "tabular-data-resource",
      "scheme": "file",
      "format": "tsv",
      "hashing": "md5",
      "compression": "no",
      "compressionPath": "",
      "query": {}
    }
  ]
}

and convert this into a csvw.TableGroup - which could then be enhanced programmatically.

@xrotwang
Copy link
Contributor

xrotwang commented Jan 4, 2021

@brockfanning could you give an example of the kind of CSV you'd want to create CSVW for? Does it use any naming scheme to give type or foreign key hints?

@brockfanning
Copy link
Author

@xrotwang In my case it's pretty simple: each data package has exactly one standalone CSV file. So I don't believe there is any concern about foreign key hints (at least in my case). We don't have any naming scheme related to types. Here's an example:

Year Location Value
2010 10
2011 20
2012 30
2010 Urban 12
2011 Urban 22
2012 Urban 32
2010 Rural 14
2011 Rural 24
2012 Rural 34

@xrotwang
Copy link
Contributor

xrotwang commented Jan 5, 2021

Ok, so below is what frictionless describe makes off such data. It's enough to use csvw to read the data correctly, i.e. we have the info about

  • file path
  • CSV dialect
  • column names and datatypes

Once we can read the data with csvw, we could add information like

  • required - if there are no empty values for a column
  • unique
  • minimum or maximum.

OTOH, that would make the data more difficult to edit - e.g. adding a row with a value below an inferred minimum would make the data invalid.

So I guess, we could/should distinguish two use cases:

  1. Seeding CSVW metadata for data which is still added to.
  2. Adding CSVW metadata to "finished" data for publication (in which case inferring properties like uniqueness, etc. would be useful).
---
metadata: test.csv
---

compression: 'no'
compressionPath: ''
control:
  newline: ''
dialect:
  delimiter: '|'
encoding: utf-8
format: csv
hashing: md5
name: test
path: test.csv
profile: tabular-data-resource
query: {}
schema:
  fields:
    - name: Year
      type: integer
    - name: Location
      type: string
    - name: Value
      type: integer
scheme: file
stats:
  bytes: 131
  fields: 3
  hash: b36e8c21563ab32645052c11510bddb7
  rows: 9

@brockfanning
Copy link
Author

@xrotwang Just my 2 cents: If it helps keep things simple I would be fine with assuming that all inferring of the schema, and adjustments to the schema, will happen in the Frictionless object, before it gets to this library. For example if the schema needs constraints added, they'd be added according to the Frictionless table schema. In that case this library could focus on converting (as faithfully as possible) that metadata to meet the CSVW spec.

Side note: I agree with your distinction of use-cases. I expect many providers want to publish CSVW just to make their data "interoperable". While others want the metadata for validation so that they can avoid maintenance problems in the future. My users definitely need the "infer from CSV" interoperability approach (hence this issue). To that end I'm imagining automating some of the adjustments you mention, like minimum/maximum, and turning values into an "enum". But again I am fine with doing all of that to the Frictionless object before sending it to this library.

@xrotwang
Copy link
Contributor

xrotwang commented Jan 5, 2021

Of course, the adjustments can also be made in the CSVW metadata - either manually editing the serialized JSON, or programmatically on the csvw.TableGroup object, which csvw.TableGroup.from_frictionless_metadata would return.

@xrotwang
Copy link
Contributor

xrotwang commented Jan 5, 2021

And I agree, getting the simple case up and running would not only be the first step, but presumably useful functionality already. I have a proof-of-concept in my head :) - hope to find the time later today to push for you to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants