Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor repo to be cldfbench compliant #32

Closed
xrotwang opened this issue Jan 19, 2021 · 25 comments · Fixed by #35
Closed

Refactor repo to be cldfbench compliant #32

xrotwang opened this issue Jan 19, 2021 · 25 comments · Fixed by #35
Assignees

Comments

@xrotwang
Copy link
Member

xrotwang commented Jan 19, 2021

create.py should be refactored to be the makecldf method of a cldfbench.Dataset. This will make it easier

  • to tie in versioned catalog data from Glottolog
  • to create metadata for Zenodo

The raw data from phoible/dev should be pulled in as git submodule.

@xrotwang xrotwang self-assigned this Jan 19, 2021
@LinguList
Copy link

This would come very handy for our work on the sound inventory paper: I had to go through some workarounds to allow to have this similar to the jipa and lapsyd inventories.

https://github.com/cldf-datasets/inventory-study/blob/ddc464ce2175d16617bb1476af7478bfa6cad56a/prepare.py#L88-L122

It would allow us to quickly adjust the code in the inventory study to work well with both pyclts and the Inventory-class, as well as the new object-style handling of pycldf. So another very nice showcase!

@xrotwang
Copy link
Member Author

@LinguList Why would you think PHOIBLE is "currently not in generic CLDF", though?

@LinguList
Copy link

Because the Language_ID is not an identifier for a variety (a doculect) but rather for glottolog. This is in my opinion not the idea we had about Language_ID in the cldf spec, and if this is not specified, I'd say it should be adjusted.

@LinguList
Copy link

So I have to make the workaround to look for the contribution_ID and make a new language ID before I have the expected behavior that would allow me to use pycldf.objects("LanguageTable").

@xrotwang
Copy link
Member Author

I think the "expectation" that PHOIBLE would list exactly one inventory per variety listed in LanguageTable is actually non-generic. In fact, since PHOIBLE is already an aggregation of inventories from different sources, I'd rather expect the opposite.

So it can neither be assumed that entries in LanguageTable correspond one-to-one to Glottolog languoids, nor that they are the most fine-grained division of objects-under-study in a CLDF dataset.

I see this issue as comparable to APiCS' vs. WALS features: While WALS only allows one value per (feature, language) pair, APiCS allows multiple. But you'd have to read up on the design of each project to understand what that means, and for non-CLDF-specified details such as the "Frequency" column in ValueTable. So if you'd want to aggregate data from WALS and APiCS, you'd have to make an informed decision about which data model you want to go with. It's not a question of which one is "generic CLDF" or not.

@bambooforest
Copy link
Collaborator

Isn't that because phoible has in some cases multiple doculects for the same language ID?

https://raw.githubusercontent.com/cldf-datasets/phoible/master/cldf/values.csv

And I always like clicking on URLs that are private. :)

@LinguList
Copy link

def get_phoible_varieties(
    subsets,
    path=Path.home().joinpath(
        "data", "datasets", "cldf", "cldf-datasets", "phoible", "cldf"
    ),
):
    """
    Load phoible data (currently not in generic CLDF).
    """
    bipa = CLTS().bipa
    phoible = pycldf.Dataset.from_metadata(
        path.joinpath("StructureDataset-metadata.json")
    )
    bib = pybtex.database.parse_string(
        open(path.joinpath("sources.bib").as_posix()).read(), bib_format="bibtex"
    )
    bib_ = [Source.from_entry(k, e) for k, e in bib.entries.items()]
    bib = {source.id: source for source in bib_}
    gcodes = {row["ID"]: row for row in phoible.iter_rows("LanguageTable")}
    params = {row["Name"]: row for row in phoible.iter_rows("ParameterTable")}
    contributions = {
        row["ID"]: row["Contributor_ID"]
        for row in phoible.iter_rows("contributions.csv")
    }
    languages = {}
    varieties = defaultdict(list)
    sources = defaultdict(set)
    for row in progressbar(phoible.iter_rows("ValueTable"), desc="load values"):
        if contributions[row["Contribution_ID"]] in subsets:
            lid = row["Language_ID"] + "-" + row["Contribution_ID"]
            varieties[lid] += [nfd(row["Value"])]
            languages[lid] = gcodes[row["Language_ID"]]
            source = row["Source"][0] if row["Source"] else ""
            sources[lid].add(source)
    return languages, params, varieties, sources, bib

@xrotwang
Copy link
Member Author

@LinguList I understand that it would be nice to shave off what seems like "special-case handling", and also frustrations that despite CLDF there's still many different ways to model what seems like very similar data. But the appearance that PHOIBLE is the special case here (the "non-generic" one) largely comes from the fact, that the other datasets you look at are custom-built by the same people for your particular use cases.

@LinguList
Copy link

@bambooforest, to repeat the argument: current cldf representation uses the Glottocode as Language_ID. However, in our CLDF specs, the Language_ID is usually an internal identifier that links between a language variety (a doculect) and the data for this doculect, while the Glottocode is something "en plus". To get this behavior, one has to now load all data and assign new Language_IDs by combining, e.g., the Contribution_ID with the Glottocode.

@xrotwang
Copy link
Member Author

So, in general, I think the "know your data" principle still holds, even in the CLDF era. But I still hope that CLDF makes it simpler to "get to know your data".

@LinguList
Copy link

@xrotwang, I would say that the particular usecases outnumber the non-particular datasets by now. We have already three Phoneme Inventory datasets, and all lexical datasets in lexibank conform to the expectation that the Language_ID is not the Glottocode.

@LinguList
Copy link

Whether the current phoible counts as real cldf or not is secondary to me, but I think that for the sake of comparability it would be useful to adjust the current language table, as it makes the data more conform with many other datasets that we have already coded, including the ones we collection in cldf-datasets. Or is there a good argument to keep Glottocodes as langauge identifiers in Phoible?

@xrotwang
Copy link
Member Author

The question is not whether "the Language_ID is the Glottocode", but rather whether it can be assumed that there's only one value/measurement per (language, parameter) pair. And I think there are tons of valid use cases where this is not the case, see the APiCS example above.

@xrotwang
Copy link
Member Author

xrotwang commented Jan 19, 2021

Yes, homogeneous datasets would be useful for comparability. But pushing use-case driven demands on data layout, saying something isn't "generic CLDF" doesn't seem very hepful.

@LinguList
Copy link

Okay, sorry about the "generic cldf", it was not my main intention of opening a debate about this here. My intention was rather to point to the discrepancy that I find in this particular Phoible cldf dataset with respect to the definition of Language identifiers and the inventories.

@LinguList
Copy link

And with respect to this CLDF structure, I'd say that it would make more sense to adjust the language identifiers to represent one inventory each.

@xrotwang
Copy link
Member Author

The thing is, the current CLDF structure does have a use case - which predates any other use cases: Feeding the clld web application. So there's more to take into account when thinking about what makes sense then your new use case.

Right now, CLDF doesn't specify which parameter might correspond to a phoneme in an inventory, nor whether it makes sense for such a parameter to have multiple values per language. So on pure CLDF grounds, any such assumption can not be justified. So it comes down to "it would save you a couple lines of code if PHOIBLE CLDF were modeled differently". To which I might respond "it would take me a couple lines of code if PHOIBLE CLDF were modeled differently".

@bambooforest
Copy link
Collaborator

If I recall correctly, I created the phoible CLDF data in accordance to @xrotwang 's suggestions.

We have InventoryID in our data that maps in a one-to-many relationships with the doculect(s) that went into that inventory. This is mainly from databases like UPSID.

LanguageID then is mapped to the Glottolog code, since as far as I understand, that's a language (name) identifier and it's more fine-grained than an ISO code in our circumstances.

@xrotwang
Copy link
Member Author

Btw. I also wouldn't find it completely outlandish, if PHOIBLE had only one Value for "Phoneme X being attested in language Y", and then linking this value to one or more source inventories - hiding the "doculect" aspect even more. So I really think that there is no valid generic assumption of what inventory data should look like. We might still try to come up with a specification for the next version of CLDF, but until that happens, I'd consider the current PHOIBLE CLDF a valid alternative - not already non-standard.

@LinguList
Copy link

I understand that perspective completely.

But just to add to this: by "generic cldf", I merely meant "the cldf we have been using for sound inventories with cldfbench by now". Since I think that we can in the future build a rather "generic" data type here, similar to wordlists, that has some characteristics of its own, I am quite interested in seeing how this can be further unified. The "generic cldf" was just a bad wording, not meant to sound disrespectful or anything.

But in favor of the structure I propose, I would say that we have a by now increasing amount of datasets coming in which show some similar structures, which have quite a few advantages, and this does not only amount to the four datasets we have in cldf for inventories, but also to lexibank datasets (beidasinitic, allenbai), where I extracted inventory data and added it after @xrotwang showed me how to do so best to the lexibank datasets as a structure dataset.

On the long run, the fact that we collect different datasets in CLDF may thus even feed back in to phoible (if @bambooforest is interested in fresh inventories for Bai, Sinitic, and in the future also Hmong-Mien).

If you consider that this should be rather discussed and done in the future, it is also fine by me. But when I saw that there is a plan of making a cldfbench script for Phoible, I thought it was useful to start discussing how to handle inventory data in cldf.

And I'd definitely argue that a convenient structure for inventory datasets is currently emerging.

@xrotwang
Copy link
Member Author

xrotwang commented Jan 19, 2021

I think I wanted to have the first CLDF version fairly minimal. But in hindsight - in particular given the experience with clld - we should have added a Contribution component right away. So now, I'd say this should be added to the next CLDF iteration. With a "contribution" context it would seem a lot easier to argue that there can be only one datapoint of the "has phoneme X" kind, i.e. that it would be a logical error if "has X" and "not has X" appeared in the same contribution.

I would not want to merge this info into LanguageTable. In the PHOIBLE case this would add quite a bit of duplication. Also, my interpretation of "what goes into LanguageTable" is merely "whatever needs to be put as dot on maps".

With a "Contribution" component, it may also be simpler to convey that a dataset is already an aggregation.

@LinguList This may still mean two cases you need to handle: If "ContributionTable" is present, you have to aggregate phonemes per Contribution, if not, you can keep the simpler code. But at least the "special-case" would have broader applicability - and wouldn't be called get_phoible_varieties.

Of course, it would still be a leap to assume that "Contribution" always signals the doculect-level. E.g. in WALS, a Contribution is always all datapoints for just one parameter - not the other way round, as in APiCS. But if we know something is an inventory, the assumption may be valid, assuming there are no inventories where one phoneme comes from one contribution and other for a different one.

@bambooforest
Copy link
Collaborator

Note also that my team is working on CLDF transformations of:

https://github.com/segbo-db/segbo

https://github.com/bdproto/bdproto

which may have the same multiple doculects issue that phoible raises.

@LinguList -- always happy to incorporate more data and give credit where credit is do. You've probably also seen the inventory data by San Duanmu:

http://www-personal.umich.edu/~duanmu/Duanmu2015PIDOC.pdf

which he once sent me in a spreadsheet. POI.

@LinguList
Copy link

Yes, I am fine with this solution. What I ask myself now is, however (sorry @bambooforest for using this for non-phoible-related discussions): should we then also make the other datasets that have been prepared in CLDF in this form? I mean, it is cheaper to do it now.

@xrotwang
Copy link
Member Author

Here's what I propose:
cldf/cldf#102

@LinguList
Copy link

Ah, and @bambooforest, we provide links to phoible, bdproto, and segbo in our most recent CLTS version to be published now. These could be included as additional parameter information on bdproto and segbo as well (see https://github.com/cldf-datasets/jipa/blob/main/cldf/features.csv for an example of how features are currently linking to CLTS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants