Refactor repo to be cldfbench compliant #32

xrotwang · 2021-01-19T11:36:11Z

create.py should be refactored to be the makecldf method of a cldfbench.Dataset. This will make it easier

to tie in versioned catalog data from Glottolog
to create metadata for Zenodo

The raw data from phoible/dev should be pulled in as git submodule.

The text was updated successfully, but these errors were encountered:

LinguList · 2021-01-19T11:54:09Z

This would come very handy for our work on the sound inventory paper: I had to go through some workarounds to allow to have this similar to the jipa and lapsyd inventories.

https://github.com/cldf-datasets/inventory-study/blob/ddc464ce2175d16617bb1476af7478bfa6cad56a/prepare.py#L88-L122

It would allow us to quickly adjust the code in the inventory study to work well with both pyclts and the Inventory-class, as well as the new object-style handling of pycldf. So another very nice showcase!

xrotwang · 2021-01-19T11:58:28Z

@LinguList Why would you think PHOIBLE is "currently not in generic CLDF", though?

LinguList · 2021-01-19T12:01:04Z

Because the Language_ID is not an identifier for a variety (a doculect) but rather for glottolog. This is in my opinion not the idea we had about Language_ID in the cldf spec, and if this is not specified, I'd say it should be adjusted.

LinguList · 2021-01-19T12:02:07Z

So I have to make the workaround to look for the contribution_ID and make a new language ID before I have the expected behavior that would allow me to use pycldf.objects("LanguageTable").

xrotwang · 2021-01-19T12:14:51Z

I think the "expectation" that PHOIBLE would list exactly one inventory per variety listed in LanguageTable is actually non-generic. In fact, since PHOIBLE is already an aggregation of inventories from different sources, I'd rather expect the opposite.

So it can neither be assumed that entries in LanguageTable correspond one-to-one to Glottolog languoids, nor that they are the most fine-grained division of objects-under-study in a CLDF dataset.

I see this issue as comparable to APiCS' vs. WALS features: While WALS only allows one value per (feature, language) pair, APiCS allows multiple. But you'd have to read up on the design of each project to understand what that means, and for non-CLDF-specified details such as the "Frequency" column in ValueTable. So if you'd want to aggregate data from WALS and APiCS, you'd have to make an informed decision about which data model you want to go with. It's not a question of which one is "generic CLDF" or not.

bambooforest · 2021-01-19T12:14:57Z

Isn't that because phoible has in some cases multiple doculects for the same language ID?

https://raw.githubusercontent.com/cldf-datasets/phoible/master/cldf/values.csv

And I always like clicking on URLs that are private. :)

LinguList · 2021-01-19T12:24:54Z

def get_phoible_varieties(
    subsets,
    path=Path.home().joinpath(
        "data", "datasets", "cldf", "cldf-datasets", "phoible", "cldf"
    ),
):
    """
    Load phoible data (currently not in generic CLDF).
    """
    bipa = CLTS().bipa
    phoible = pycldf.Dataset.from_metadata(
        path.joinpath("StructureDataset-metadata.json")
    )
    bib = pybtex.database.parse_string(
        open(path.joinpath("sources.bib").as_posix()).read(), bib_format="bibtex"
    )
    bib_ = [Source.from_entry(k, e) for k, e in bib.entries.items()]
    bib = {source.id: source for source in bib_}
    gcodes = {row["ID"]: row for row in phoible.iter_rows("LanguageTable")}
    params = {row["Name"]: row for row in phoible.iter_rows("ParameterTable")}
    contributions = {
        row["ID"]: row["Contributor_ID"]
        for row in phoible.iter_rows("contributions.csv")
    }
    languages = {}
    varieties = defaultdict(list)
    sources = defaultdict(set)
    for row in progressbar(phoible.iter_rows("ValueTable"), desc="load values"):
        if contributions[row["Contribution_ID"]] in subsets:
            lid = row["Language_ID"] + "-" + row["Contribution_ID"]
            varieties[lid] += [nfd(row["Value"])]
            languages[lid] = gcodes[row["Language_ID"]]
            source = row["Source"][0] if row["Source"] else ""
            sources[lid].add(source)
    return languages, params, varieties, sources, bib

xrotwang · 2021-01-19T12:28:18Z

@LinguList I understand that it would be nice to shave off what seems like "special-case handling", and also frustrations that despite CLDF there's still many different ways to model what seems like very similar data. But the appearance that PHOIBLE is the special case here (the "non-generic" one) largely comes from the fact, that the other datasets you look at are custom-built by the same people for your particular use cases.

LinguList · 2021-01-19T12:29:02Z

@bambooforest, to repeat the argument: current cldf representation uses the Glottocode as Language_ID. However, in our CLDF specs, the Language_ID is usually an internal identifier that links between a language variety (a doculect) and the data for this doculect, while the Glottocode is something "en plus". To get this behavior, one has to now load all data and assign new Language_IDs by combining, e.g., the Contribution_ID with the Glottocode.

xrotwang · 2021-01-19T12:29:23Z

So, in general, I think the "know your data" principle still holds, even in the CLDF era. But I still hope that CLDF makes it simpler to "get to know your data".

LinguList · 2021-01-19T12:30:06Z

@xrotwang, I would say that the particular usecases outnumber the non-particular datasets by now. We have already three Phoneme Inventory datasets, and all lexical datasets in lexibank conform to the expectation that the Language_ID is not the Glottocode.

LinguList · 2021-01-19T12:33:08Z

Whether the current phoible counts as real cldf or not is secondary to me, but I think that for the sake of comparability it would be useful to adjust the current language table, as it makes the data more conform with many other datasets that we have already coded, including the ones we collection in cldf-datasets. Or is there a good argument to keep Glottocodes as langauge identifiers in Phoible?

xrotwang · 2021-01-19T12:33:26Z

The question is not whether "the Language_ID is the Glottocode", but rather whether it can be assumed that there's only one value/measurement per (language, parameter) pair. And I think there are tons of valid use cases where this is not the case, see the APiCS example above.

xrotwang · 2021-01-19T12:36:35Z

Yes, homogeneous datasets would be useful for comparability. But pushing use-case driven demands on data layout, saying something isn't "generic CLDF" doesn't seem very hepful.

LinguList · 2021-01-19T12:39:37Z

Okay, sorry about the "generic cldf", it was not my main intention of opening a debate about this here. My intention was rather to point to the discrepancy that I find in this particular Phoible cldf dataset with respect to the definition of Language identifiers and the inventories.

LinguList · 2021-01-19T12:41:33Z

And with respect to this CLDF structure, I'd say that it would make more sense to adjust the language identifiers to represent one inventory each.

xrotwang · 2021-01-19T12:52:06Z

The thing is, the current CLDF structure does have a use case - which predates any other use cases: Feeding the clld web application. So there's more to take into account when thinking about what makes sense then your new use case.

Right now, CLDF doesn't specify which parameter might correspond to a phoneme in an inventory, nor whether it makes sense for such a parameter to have multiple values per language. So on pure CLDF grounds, any such assumption can not be justified. So it comes down to "it would save you a couple lines of code if PHOIBLE CLDF were modeled differently". To which I might respond "it would take me a couple lines of code if PHOIBLE CLDF were modeled differently".

bambooforest · 2021-01-19T12:55:32Z

If I recall correctly, I created the phoible CLDF data in accordance to @xrotwang 's suggestions.

We have InventoryID in our data that maps in a one-to-many relationships with the doculect(s) that went into that inventory. This is mainly from databases like UPSID.

LanguageID then is mapped to the Glottolog code, since as far as I understand, that's a language (name) identifier and it's more fine-grained than an ISO code in our circumstances.

xrotwang · 2021-01-19T12:59:29Z

Btw. I also wouldn't find it completely outlandish, if PHOIBLE had only one Value for "Phoneme X being attested in language Y", and then linking this value to one or more source inventories - hiding the "doculect" aspect even more. So I really think that there is no valid generic assumption of what inventory data should look like. We might still try to come up with a specification for the next version of CLDF, but until that happens, I'd consider the current PHOIBLE CLDF a valid alternative - not already non-standard.

LinguList · 2021-01-19T13:05:43Z

I understand that perspective completely.

But just to add to this: by "generic cldf", I merely meant "the cldf we have been using for sound inventories with cldfbench by now". Since I think that we can in the future build a rather "generic" data type here, similar to wordlists, that has some characteristics of its own, I am quite interested in seeing how this can be further unified. The "generic cldf" was just a bad wording, not meant to sound disrespectful or anything.

But in favor of the structure I propose, I would say that we have a by now increasing amount of datasets coming in which show some similar structures, which have quite a few advantages, and this does not only amount to the four datasets we have in cldf for inventories, but also to lexibank datasets (beidasinitic, allenbai), where I extracted inventory data and added it after @xrotwang showed me how to do so best to the lexibank datasets as a structure dataset.

On the long run, the fact that we collect different datasets in CLDF may thus even feed back in to phoible (if @bambooforest is interested in fresh inventories for Bai, Sinitic, and in the future also Hmong-Mien).

If you consider that this should be rather discussed and done in the future, it is also fine by me. But when I saw that there is a plan of making a cldfbench script for Phoible, I thought it was useful to start discussing how to handle inventory data in cldf.

And I'd definitely argue that a convenient structure for inventory datasets is currently emerging.

xrotwang · 2021-01-19T13:27:27Z

I think I wanted to have the first CLDF version fairly minimal. But in hindsight - in particular given the experience with clld - we should have added a Contribution component right away. So now, I'd say this should be added to the next CLDF iteration. With a "contribution" context it would seem a lot easier to argue that there can be only one datapoint of the "has phoneme X" kind, i.e. that it would be a logical error if "has X" and "not has X" appeared in the same contribution.

I would not want to merge this info into LanguageTable. In the PHOIBLE case this would add quite a bit of duplication. Also, my interpretation of "what goes into LanguageTable" is merely "whatever needs to be put as dot on maps".

With a "Contribution" component, it may also be simpler to convey that a dataset is already an aggregation.

@LinguList This may still mean two cases you need to handle: If "ContributionTable" is present, you have to aggregate phonemes per Contribution, if not, you can keep the simpler code. But at least the "special-case" would have broader applicability - and wouldn't be called get_phoible_varieties.

Of course, it would still be a leap to assume that "Contribution" always signals the doculect-level. E.g. in WALS, a Contribution is always all datapoints for just one parameter - not the other way round, as in APiCS. But if we know something is an inventory, the assumption may be valid, assuming there are no inventories where one phoneme comes from one contribution and other for a different one.

bambooforest · 2021-01-19T13:34:34Z

Note also that my team is working on CLDF transformations of:

https://github.com/segbo-db/segbo

https://github.com/bdproto/bdproto

which may have the same multiple doculects issue that phoible raises.

@LinguList -- always happy to incorporate more data and give credit where credit is do. You've probably also seen the inventory data by San Duanmu:

http://www-personal.umich.edu/~duanmu/Duanmu2015PIDOC.pdf

which he once sent me in a spreadsheet. POI.

LinguList · 2021-01-19T13:35:34Z

Yes, I am fine with this solution. What I ask myself now is, however (sorry @bambooforest for using this for non-phoible-related discussions): should we then also make the other datasets that have been prepared in CLDF in this form? I mean, it is cheaper to do it now.

xrotwang · 2021-01-19T13:35:36Z

Here's what I propose:
cldf/cldf#102

LinguList · 2021-01-19T13:38:21Z

Ah, and @bambooforest, we provide links to phoible, bdproto, and segbo in our most recent CLTS version to be published now. These could be included as additional parameter information on bdproto and segbo as well (see https://github.com/cldf-datasets/jipa/blob/main/cldf/features.csv for an example of how features are currently linking to CLTS).

xrotwang self-assigned this Jan 19, 2021

xrotwang mentioned this issue Jan 19, 2021

Specify ContributionTable component cldf/cldf#102

Closed

xrotwang mentioned this issue Apr 9, 2021

Phoible 3.0 #35

Merged

bambooforest closed this as completed in #35 Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor repo to be cldfbench compliant #32

Refactor repo to be cldfbench compliant #32

xrotwang commented Jan 19, 2021 •

edited

Loading

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

bambooforest commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

xrotwang commented Jan 19, 2021 •

edited

Loading

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

bambooforest commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021 •

edited

Loading

bambooforest commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

Refactor repo to be cldfbench compliant #32

Refactor repo to be cldfbench compliant #32

Comments

xrotwang commented Jan 19, 2021 • edited Loading

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

bambooforest commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

xrotwang commented Jan 19, 2021 • edited Loading

LinguList commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

bambooforest commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021 • edited Loading

bambooforest commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021

LinguList commented Jan 19, 2021

xrotwang commented Jan 19, 2021 •

edited

Loading

xrotwang commented Jan 19, 2021 •

edited

Loading

xrotwang commented Jan 19, 2021 •

edited

Loading