-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of programmatically creating metadata from scratch #48
Comments
There are plenty examples. Pretty customized metadata files are used in the concepticon project. Just check there for the folder |
@LinguList Thanks for the quick reply! I apologize my question was unnecessarily vague. I'm actually looking for an example of programmatically (with Python code) creating metadata from scratch, using this library. To elaborate, I'm looking for either of these things, in order of preference:
The first item above is my ultimate goal, but if that is not possible with this library then I could get there with the help of the second item. |
Hm. For 1, i.e. infering metadata from CSV you may be better off using the frictionless framework - even if you'd want to go with CSVW, which would require a simple transformation of the JSON metadata. |
@xrotwang That's a great lead, thank you! Do you happen to know of any Frictionless-to-CSVW metadata converters out there? As you say it should be simple, though I would need to ramp up on both specs. |
I'd be willing to help with writing the converter :) |
@xrotwang That's amazing! Should I kick things off by creating a separate repo for this, and then ping you when I get stuck? Or were you imagining this being added to an existing repo? |
As far as I'm concerned, such a converter could live in this repo as well, possibly as class method |
That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated! |
Will try to make a start this week.
brockfanning <notifications@github.com> schrieb am Mo., 4. Jan. 2021, 19:49:
… That works for me. I'm not sure where to start here, so any help you can
provide, even if only partial progress, would be greatly appreciated!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#48 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUOKFT67WRBTIIP4VQOFTSYIET3ANCNFSM4VTDE4HA>
.
|
Skimming the documentation at https://frictionlessdata.io/tooling/python/describing-data/#describing-schema it seems that {
"profile": "data-package",
"resources": [
{
"path": "forms.tsv",
"stats": {
"hash": "91c89a7d4fe4d5d55ad8a383a64ea047",
"bytes": 192127,
"fields": 13,
"rows": 2249
},
"control": {
"newline": ""
},
"encoding": "utf-8",
"dialect": {
"delimiter": "\t"
},
"schema": {
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Language_ID",
"type": "string"
},
{
"name": "Parameter_ID",
"type": "integer"
},
{
"name": "Value",
"type": "string"
},
{
"name": "Comment",
"type": "any"
},
{
"name": "Source",
"type": "string"
},
{
"name": "Graphemes",
"type": "string"
},
{
"name": "Profile",
"type": "string"
}
]
},
"name": "forms",
"profile": "tabular-data-resource",
"scheme": "file",
"format": "tsv",
"hashing": "md5",
"compression": "no",
"compressionPath": "",
"query": {}
}
]
} and convert this into a |
@brockfanning could you give an example of the kind of CSV you'd want to create CSVW for? Does it use any naming scheme to give type or foreign key hints? |
@xrotwang In my case it's pretty simple: each data package has exactly one standalone CSV file. So I don't believe there is any concern about foreign key hints (at least in my case). We don't have any naming scheme related to types. Here's an example:
|
Ok, so below is what
Once we can read the data with
OTOH, that would make the data more difficult to edit - e.g. adding a row with a value below an inferred minimum would make the data invalid. So I guess, we could/should distinguish two use cases:
---
metadata: test.csv
---
compression: 'no'
compressionPath: ''
control:
newline: ''
dialect:
delimiter: '|'
encoding: utf-8
format: csv
hashing: md5
name: test
path: test.csv
profile: tabular-data-resource
query: {}
schema:
fields:
- name: Year
type: integer
- name: Location
type: string
- name: Value
type: integer
scheme: file
stats:
bytes: 131
fields: 3
hash: b36e8c21563ab32645052c11510bddb7
rows: 9 |
@xrotwang Just my 2 cents: If it helps keep things simple I would be fine with assuming that all inferring of the schema, and adjustments to the schema, will happen in the Frictionless object, before it gets to this library. For example if the schema needs constraints added, they'd be added according to the Frictionless table schema. In that case this library could focus on converting (as faithfully as possible) that metadata to meet the CSVW spec. Side note: I agree with your distinction of use-cases. I expect many providers want to publish CSVW just to make their data "interoperable". While others want the metadata for validation so that they can avoid maintenance problems in the future. My users definitely need the "infer from CSV" interoperability approach (hence this issue). To that end I'm imagining automating some of the adjustments you mention, like minimum/maximum, and turning values into an "enum". But again I am fine with doing all of that to the Frictionless object before sending it to this library. |
Of course, the adjustments can also be made in the CSVW metadata - either manually editing the serialized JSON, or programmatically on the |
And I agree, getting the simple case up and running would not only be the first step, but presumably useful functionality already. I have a proof-of-concept in my head :) - hope to find the time later today to push for you to review. |
Hello,
I was wondering if it would be possible to provide an example of creating the metadata from scratch. My goal is to create a metadata file by programmatically examining a CSV file to determine the schema. No worries if this is out of scope for the library.
Thank you!
The text was updated successfully, but these errors were encountered: