-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor repo to be cldfbench compliant #32
Comments
This would come very handy for our work on the sound inventory paper: I had to go through some workarounds to allow to have this similar to the jipa and lapsyd inventories. It would allow us to quickly adjust the code in the inventory study to work well with both pyclts and the Inventory-class, as well as the new object-style handling of pycldf. So another very nice showcase! |
@LinguList Why would you think PHOIBLE is "currently not in generic CLDF", though? |
Because the Language_ID is not an identifier for a variety (a doculect) but rather for glottolog. This is in my opinion not the idea we had about Language_ID in the cldf spec, and if this is not specified, I'd say it should be adjusted. |
So I have to make the workaround to look for the contribution_ID and make a new language ID before I have the expected behavior that would allow me to use pycldf.objects("LanguageTable"). |
I think the "expectation" that PHOIBLE would list exactly one inventory per variety listed in So it can neither be assumed that entries in I see this issue as comparable to APiCS' vs. WALS features: While WALS only allows one value per (feature, language) pair, APiCS allows multiple. But you'd have to read up on the design of each project to understand what that means, and for non-CLDF-specified details such as the "Frequency" column in |
Isn't that because phoible has in some cases multiple doculects for the same language ID? https://raw.githubusercontent.com/cldf-datasets/phoible/master/cldf/values.csv And I always like clicking on URLs that are private. :) |
|
@LinguList I understand that it would be nice to shave off what seems like "special-case handling", and also frustrations that despite CLDF there's still many different ways to model what seems like very similar data. But the appearance that PHOIBLE is the special case here (the "non-generic" one) largely comes from the fact, that the other datasets you look at are custom-built by the same people for your particular use cases. |
@bambooforest, to repeat the argument: current cldf representation uses the Glottocode as |
So, in general, I think the "know your data" principle still holds, even in the CLDF era. But I still hope that CLDF makes it simpler to "get to know your data". |
@xrotwang, I would say that the particular usecases outnumber the non-particular datasets by now. We have already three Phoneme Inventory datasets, and all lexical datasets in lexibank conform to the expectation that the Language_ID is not the Glottocode. |
Whether the current phoible counts as real cldf or not is secondary to me, but I think that for the sake of comparability it would be useful to adjust the current language table, as it makes the data more conform with many other datasets that we have already coded, including the ones we collection in |
The question is not whether "the Language_ID is the Glottocode", but rather whether it can be assumed that there's only one value/measurement per (language, parameter) pair. And I think there are tons of valid use cases where this is not the case, see the APiCS example above. |
Yes, homogeneous datasets would be useful for comparability. But pushing use-case driven demands on data layout, saying something isn't "generic CLDF" doesn't seem very hepful. |
Okay, sorry about the "generic cldf", it was not my main intention of opening a debate about this here. My intention was rather to point to the discrepancy that I find in this particular Phoible cldf dataset with respect to the definition of Language identifiers and the inventories. |
And with respect to this CLDF structure, I'd say that it would make more sense to adjust the language identifiers to represent one inventory each. |
The thing is, the current CLDF structure does have a use case - which predates any other use cases: Feeding the clld web application. So there's more to take into account when thinking about what makes sense then your new use case. Right now, CLDF doesn't specify which parameter might correspond to a phoneme in an inventory, nor whether it makes sense for such a parameter to have multiple values per language. So on pure CLDF grounds, any such assumption can not be justified. So it comes down to "it would save you a couple lines of code if PHOIBLE CLDF were modeled differently". To which I might respond "it would take me a couple lines of code if PHOIBLE CLDF were modeled differently". |
If I recall correctly, I created the phoible CLDF data in accordance to @xrotwang 's suggestions. We have InventoryID in our data that maps in a one-to-many relationships with the doculect(s) that went into that inventory. This is mainly from databases like UPSID. LanguageID then is mapped to the Glottolog code, since as far as I understand, that's a language (name) identifier and it's more fine-grained than an ISO code in our circumstances. |
Btw. I also wouldn't find it completely outlandish, if PHOIBLE had only one |
I understand that perspective completely. But just to add to this: by "generic cldf", I merely meant "the cldf we have been using for sound inventories with cldfbench by now". Since I think that we can in the future build a rather "generic" data type here, similar to wordlists, that has some characteristics of its own, I am quite interested in seeing how this can be further unified. The "generic cldf" was just a bad wording, not meant to sound disrespectful or anything. But in favor of the structure I propose, I would say that we have a by now increasing amount of datasets coming in which show some similar structures, which have quite a few advantages, and this does not only amount to the four datasets we have in cldf for inventories, but also to lexibank datasets (beidasinitic, allenbai), where I extracted inventory data and added it after @xrotwang showed me how to do so best to the lexibank datasets as a structure dataset. On the long run, the fact that we collect different datasets in CLDF may thus even feed back in to phoible (if @bambooforest is interested in fresh inventories for Bai, Sinitic, and in the future also Hmong-Mien). If you consider that this should be rather discussed and done in the future, it is also fine by me. But when I saw that there is a plan of making a cldfbench script for Phoible, I thought it was useful to start discussing how to handle inventory data in cldf. And I'd definitely argue that a convenient structure for inventory datasets is currently emerging. |
I think I wanted to have the first CLDF version fairly minimal. But in hindsight - in particular given the experience with clld - we should have added a I would not want to merge this info into With a "Contribution" component, it may also be simpler to convey that a dataset is already an aggregation. @LinguList This may still mean two cases you need to handle: If "ContributionTable" is present, you have to aggregate phonemes per Contribution, if not, you can keep the simpler code. But at least the "special-case" would have broader applicability - and wouldn't be called Of course, it would still be a leap to assume that "Contribution" always signals the doculect-level. E.g. in WALS, a Contribution is always all datapoints for just one parameter - not the other way round, as in APiCS. But if we know something is an inventory, the assumption may be valid, assuming there are no inventories where one phoneme comes from one contribution and other for a different one. |
Note also that my team is working on CLDF transformations of: https://github.com/segbo-db/segbo https://github.com/bdproto/bdproto which may have the same multiple doculects issue that phoible raises. @LinguList -- always happy to incorporate more data and give credit where credit is do. You've probably also seen the inventory data by San Duanmu: http://www-personal.umich.edu/~duanmu/Duanmu2015PIDOC.pdf which he once sent me in a spreadsheet. POI. |
Yes, I am fine with this solution. What I ask myself now is, however (sorry @bambooforest for using this for non-phoible-related discussions): should we then also make the other datasets that have been prepared in CLDF in this form? I mean, it is cheaper to do it now. |
Here's what I propose: |
Ah, and @bambooforest, we provide links to phoible, bdproto, and segbo in our most recent CLTS version to be published now. These could be included as additional parameter information on bdproto and segbo as well (see https://github.com/cldf-datasets/jipa/blob/main/cldf/features.csv for an example of how features are currently linking to CLTS). |
create.py
should be refactored to be themakecldf
method of acldfbench.Dataset
. This will make it easierThe raw data from
phoible/dev
should be pulled in as git submodule.The text was updated successfully, but these errors were encountered: