Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phoible 3.0 #35

Merged
merged 4 commits into from Oct 27, 2021
Merged

Phoible 3.0 #35

merged 4 commits into from Oct 27, 2021

Conversation

xrotwang
Copy link
Member

@xrotwang xrotwang commented Apr 9, 2021

I tired to make the CLDF creation and dataset as transparent as possible, by

@xrotwang xrotwang marked this pull request as draft April 9, 2021 06:26
@xrotwang
Copy link
Member Author

xrotwang commented Apr 9, 2021

Currently I'm working on translating the PHOIBLE FAQ to work from the CLDF data. Since I'm not much of an R coder, I'm doing it with SQL - but it would be awesome if we also had an R version. This shouldn't be too difficult, I think. All it may take is joining data from a couple of CSV files to get the representation the FAQ starts with, here http://phoible.github.io/faq/#how-do-i-get-the-data

@xrotwang
Copy link
Member Author

xrotwang commented Apr 9, 2021

I don't have a nice SQL solution for the sampling examples (maybe this should be left for downstream analysis anyway, and SQL be confined to basic data assembling), but the rest seems straightforward: https://github.com/cldf-datasets/phoible/blob/206ea83807e259bd0d52c016c657962daa62a7f6/faq.md

@xrotwang
Copy link
Member Author

@bambooforest
Copy link
Collaborator

@xrotwang -- so there will be a FAQ page on the website and another FAQ page in the CLDF repo?

@xrotwang
Copy link
Member Author

xrotwang commented Apr 10, 2021

@bambooforest I'm not sure. There's definitly potential for confusion. So here's what I'd like to see:

  • a version of the FAQ (AFAIC this could be the only version) which accesses the data from released versions of the CLDF dataset
  • a CLDF dataset that is fairly self-contained, i.e. can be used and understood without needing other resources.

Having an FAQ in this repos would meet these requirements.

@xrotwang
Copy link
Member Author

OTOH, maybe a smaller/shorter description of the dataset with one usage example would be sufficient to meet the "used and understood by itself" requirement. But then, I'd like the "official" FAQ to access "official" data, i.e. longterm archived, versioned releases at Zenodo.

@bambooforest
Copy link
Collaborator

I'd rather not have two FAQs, as you point out @xrotwang , because it may cause confusion.

On the other hand, perhaps there's already confusion between the "not official" CSV data that we create here:

and that we have release on GitHub and Zenodo:

in contrast to the "official" CLDF version:

Maybe best then to keep them separate and have two FAQs -- one for each data type?

@drammock ?

@drammock
Copy link

I'd vote for updating our FAQ to start from the clld version of the data. The SQL version is cool and I'm considering keeping it as an "orphan" page (not accessible from the nav, only linked from the main R FAQ). But this adds some overhead each time we want to update the FAQ content... 🤔

@xrotwang
Copy link
Member Author

Ok, if the R version would use the CLDF data that would be perfect from my POV. In that case, I'd reduce the SQL variant to just

  • a description how to get started and then
  • one example of a "translation" of the equivalent R code
  • and one example highlighting how SQLite can be integrated in shell pipelines, possible using the nice termgraph package.

Then, there should be minimal overlap between the text/content of the two pages, and thus not much overhead in terms of maintenance.

@bambooforest
Copy link
Collaborator

So just to be clear, the updated R code should read the CLDF CSV files from here (assuming these will be the standardized file names moving forward):

https://github.com/cldf-datasets/phoible/tree/phoible-3.0/cldf

and I suppose joined into the CSV file that we use and maintain and dev for simplicity's sake (won't have to update the rest of the code to work with different CLDF CSV files).

We also need to add a field for categories specifying source gaps (#333) and there are a few data issues in dev's issue trackers that I would like to fix before releasing and official 3.0.

@xrotwang
Copy link
Member Author

@bambooforest yes, that's what I propose. While this slightly complicates some things, it simplifies others:

  • the links to the bibliographic sources are already in inventories.csv,
  • Glottolog metadata of a specified version is already in languages.csv.

@xrotwang
Copy link
Member Author

I think, values.csv, with languages.csv and inventories.csv joined is basically an enriched phoible.csv as in phoible/dev.

@bambooforest
Copy link
Collaborator

OK, will look into it. Would also be motivational for users to use the CLDF version instead of the dev CSV if it contains additional Glottolog metadata, which we typically just merge into dev CSV anyway.

@xrotwang
Copy link
Member Author

Yes, considering that both PHOIBLE and Glottolog are moving targets, getting full transparency about particular versions makes the added complexity worthwhile.

@bambooforest
Copy link
Collaborator

I'm going to merge this and I created an issue (#36) to update the R code in the FAQ, since I need to do this elsewhere and having this PR merged gives me the submodule, etc.

@bambooforest bambooforest marked this pull request as ready for review October 27, 2021 10:48
@bambooforest bambooforest merged commit b261644 into master Oct 27, 2021
@bambooforest bambooforest deleted the phoible-3.0 branch October 27, 2021 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make CLDF creation independent of phoible-scripts Refactor repo to be cldfbench compliant
3 participants