Skip to content
This repository has been archived by the owner on Oct 22, 2019. It is now read-only.

Factor out pyclts #137

Closed
xrotwang opened this issue Oct 18, 2019 · 53 comments
Closed

Factor out pyclts #137

xrotwang opened this issue Oct 18, 2019 · 53 comments
Assignees

Comments

@xrotwang
Copy link
Collaborator

So while we are at making breaking changes elsewhere, why not

  • factor out pyclts to its own repository, trying to cleanly separate data and code
  • move these repositories into their own github org
@LinguList
Copy link
Collaborator

We discussed this, didn't we? The problem is that the code and data are specifically intertwined here, so finding a way to circumvent this is almost impossible.

@tresoldi
Copy link
Contributor

I would be in favor, separating data and code makes a lot of sense for CLTS (not sure if it "deserves" its own organization, however). It could certainly attract more phonologists/phoneticians.

However, as @LinguList just pointed, the code and data are somewhat co-dependent and would take some time to separate them. Should we have this as a low priority?

@LinguList
Copy link
Collaborator

I guess it depends on how fast we have a working example here.

As pyclts is deeply intertwined in lexibank, and I also use it in other applicattions, it is a great advantage of not having to configer the path where the package lies, which is one of the advantages of not separating code and data.

I'd say: if somebody can provide an example for how this could be achieved, one could review that and see how well it works, but I would not want to loose some of the functionality we have there by now, since specifically the adding of new transcription datasets is quite tedious, and the code written there should not be thrown away when changing the structure here.

@xrotwang
Copy link
Collaborator Author

I'll try to come up with a solution. While having data in the package may be simpler in applications, the fact that CLTS is different when compared to Glottolog and Concepticon makes it more complicated as well :)

I also think it's a bit mis-leading if code-bugfix-releases litter the data release timeline.

@tresoldi
Copy link
Contributor

By the way, maybe we should separate not only code from data, but app from library as well? In terms of repository, I mean.

@SimonGreenhill
Copy link
Collaborator

Note also that the code in ./app/ is py2 only so that should be updated if it's being refactored.

@LinguList
Copy link
Collaborator

Ah, sorry, that's dead code, it can be savely removed, all the py2 code, as it was just used for a cgi server that only runs py2.6

@xrotwang
Copy link
Collaborator Author

So here's the state of affairs:

What I did:

  • move the data directories from https://github.com/cldf/clts/tree/master/src/pyclts to https://github.com/cldf-clts/clts/tree/master/pkg
  • copy the remaining directories either to pyclts or clts
  • rewrote the tests such that they only test the code (and a mock data repository at https://github.com/cldf-clts/pyclts/tree/master/tests/repos)
  • the tests which actually tested the data can now be run with clts --repos ../clts/ test. Running this command currently leads to
    WARNING lˠ	Sound not generated 
    WARNING ᴀ̃/ã̱	Not matching codepoints
    WARNING ï	Sound not an alias
    
  • the commands to create app and package data still exist and do the right thing now.
    • running clts make_app results in the following diff, probably due to newer lingpy:
      diff --git a/pkg/soundclasses/lingpy.tsv b/pkg/soundclasses/lingpy.tsv index db29f90..c2467c3 100644
      --- a/pkg/soundclasses/lingpy.tsv
      +++ b/pkg/soundclasses/lingpy.tsv
      ...
      -rounded close with-frication central vowel     ʯ       0       0       0       0
         0       0
      +rounded close with-frication central vowel     ʯ       Y       V       7       V
         i       #c86464^M
      ...
  • The convenience of basically importing TranscriptionSystem('bipa') is gone, of course, but it isn't too hard now, either:
    >>> from clts import CLTS
    >>> CLTS('path/to/clts').bipa

Overall, I still like this setup a lot more. Everything seems a lot more where it belongs:

  • pyclts/tests actually test the code, not the repository data
  • the convolution of using pyclts to create parts of pyclts is gone
  • the double purpose of pyclts to curate clts as well as provide read access to it is still there, but that's ok, and can be expected since it's the same for the other catalogs as well.

@LinguList
Copy link
Collaborator

I'll need to have a closer look at everything later, and I am generally okay with that, but I'd say that it is extremely important to find a way to globally fix the path problems, as this is now making it already very difficult to present the code in tutorials (also for concepticon).

  1. either I tell users to git-clone from scratch, which is getting repetitive, and make sure things are somewhere next to each other
  2. or I have to fix the path, which is also very problematic for tutorials, or for writing a paper that uses this code, and which would have to submit all of it

So the general advantage of installing clts and having a tool set for handling with linguistic data for coding is definitely not there at the moment.

While I agree with the advantages of separating the two, I'd suggest to make a compatibility package on top, say, one command that allows me to make an init file, as we do now for cldf-bench, and stores this, so we can, upon installation or first run of pyclts, ask the users to provide their repo, and pyclts would call it and raise an error otherwise.

If we have this initialization-library or script generic enough, we could then also use it for pyconcepticon and pyglottolog (specifically pyconcepticon gives me a difficult time now, I started copy-pasting parts of it to people who wanted to work with it as they did not see that they had to change the path). And we could use that script then also in cldf-bench, checking if paths are properly set.

So to summarize: I felt a major inconveniency since the pyconcepticon was introduced, but I think it can be addressed, and I think it should be addressed, specifically as I saw many users (also in our group) ending up confused. And I'd argue that we should not leave it to cldf-bench but have a light-weight init-organizer on top (that could also be used from lingpy, where we also have some paths and data).

@LinguList
Copy link
Collaborator

Ah, in some sense, you can say this is similar to nltk and nltk_data now: users can do something with nltk, but in fact, they need to decide also what data they want.

@LinguList
Copy link
Collaborator

Or can we add a little bit of code to concepticon-data, to clts-data, and to glottolog-data, so that it could be shared on pip as well (maybe in condensed form), so that upon installation, it would handle the path problem? I could then import pyclts and also clts_data, which is, nothing more than a path to the data?

@xrotwang
Copy link
Collaborator Author

I'm not sure I like the compat-package idea. To me it seems as if most of the path problems come from the fact that so far we tried to hide the fact that an API needs a path at initialization. So the convenience of inferring default locations, config files, etc. led to intransparence. Since there are many ways of trying to be smart about discovering default locations, and we tried a couple, things broke - not passing a path worked in some situations and didn't in others.

So my idea for a better world:

  • Explicit is better than implicit.
  • Make it known more aggressively that "using a catalog" means having a (potentially long-lived) clone of the repsitory locally.

Also, I think that with the convenience before, we basically traded "path problems" for "sync problems" - which don't show up as Exceptions, typically.

@xrotwang
Copy link
Collaborator Author

Regarding making catalog data installable via pip: I think that adds again a layer of obscurity. Also, catalog data will typically be used across multiple virtual environments. With the proposed method of discovery that would mean install the catalog for each env, making sure it's always installed from the same place on disk - so even more potential for confusion.

@tresoldi
Copy link
Contributor

The difference is that nltk does not depend on nltk_data, while libraries like pyclts are effectively useless without the accompanying data (even though one could technically change the code and load only some personal IPA extension, without BIPA).

Maybe it is not the most elegant solution, but I still prefer to have a data or resources folder, install it alongside the code (having the package in setup.py, setting as not zip_safe, and the files in the manifest), and loading data transparently with some _RESOURCE_DIR=path.join(path.dirname(path.dirname(__file__)), "resources") magic.

@LinguList
Copy link
Collaborator

But the problem of explaining non-expert python users that they need to git-clone, download, and pip-install things at the same time is too much for many people, and it is also getting too much for me, as I had many of these requests in the past. So even if one is explicit, users may not necessarily see that. Also based on the tutorials I wrote in the past, it became quite difficult that one can no longer write a Python script that just runs, provided data are installed, but that one has to explain that users should change path-names according to their machines there.

We may not need to install things via pip, I mean, the data, but then we need to think of adding functionality as they offer for nltk_data, where the data is downloaded directly, i.e., users don't need to run through git clone.

@LinguList
Copy link
Collaborator

Without making it a bit easier to manage data, and to store where the data is in an init file, I see the follwoing problems:

  1. tutorials become tedious to write, as scripts don't work without users changing, and users don't like changing when they run a tutorial the first time
  2. code for replicability needs to either submit all data every time, to keep relative paths possible, or it will loose replicability
  3. we have an increase in emails form people who fail using the tools

And all are cases that actually happen, and quite a lot.

So if we imagine a simple package that downloads the data and stores the paths, pointing to the latest releases, so it would download the data the first time for the users, and remember where they were put, this may already be of help.

In fact, what we have in cldf-bench is largely doing the trick, and I ended up using from pylexibank.__main__ import configure and then pulled concepticon and glottolog from there, since it was so convenient, but one would like the same functionality independent of cldf-bench/lexibank.

@xrotwang
Copy link
Collaborator Author

I'm not sure such a package will be simple. As explained here, such code has the potential of masking/hiding problems beyond our control - such as missing git executables, problems with internet connectivity, etc. If this happens, we add even more confusion. I would have hoped that

  • telling people to either clone or download a GitHub repository and
  • expecting them to find their way around their local file system

would be a reasonable minimal requirement.

@LinguList
Copy link
Collaborator

It is a major hurdle that prevents people from coding, I am afraid, and makes it more difficult for us, as we will have to keep explaining what people should do, etc. And the article-submission-with-code is a similar case.

In fact, if one says: "assuming you have set up everything correctly [instructions here ...]" I'd just like to avoid that I have to write a "YOUR PATH HERE" into my python scripts.

So if a user downloads clts-data, why not add a simple setup.py that registers a path to the repository? Or using the cldf-bench-configuration and making it independent of cldf-bench?

@chrzyki
Copy link
Collaborator

chrzyki commented Oct 21, 2019

I'm fully with @xrotwang here. I think that this particular issue (code/data separation and accessibility) are something we've got to handle with good tutorials and documentation, as well as examples and cookbooks. Moving this pain into just another Python library is in my opinion just asking for more trouble further down the road.

Re "YOUR PATH HERE": There are multiple ways we can solve this, e.g. configuration files (see pylexibank for script, notebook, and tutorial environments) or cli arguments.

@xrotwang
Copy link
Collaborator Author

Ok, so what exactly can we expect? From what I understand, we cannot expect people to clone repositories. If so, how should people get the catalog data? I'd guess, if we cut git out of the picture, we should do the same with GitHub, thus direct people to download released catalog versions from Zenodo. Correct? If we limit the requirements for some sort of package to

  • list catalog versions on Zenodo
  • download a catalog version from Zenodo
  • make this download known to processing software (presumably using some config file (to be platform agnostic))

then I'd be somewhat optimistic that we can do that.

But the price that comes with it is that we disconnect the catalogs from git, which will make it harder for people to actually collaborate - which at least for Glottolog I would hope for.

@SimonGreenhill
Copy link
Collaborator

SimonGreenhill commented Oct 21, 2019

Personally, I've found running out of a git clone leads to syncing issues (do I have the latest commit, right branch, etc?). I can remember to git pull; git checkout master but I suspect most of our collaborators will find that complicated.

Sorry for jumping in here, but I like the idea of first-run (or an explicit .install(version=x) command) to download the version into an app dir. The user just then needs to know what version of the data they're using, which is good. I'd rather they were explicit about the version of the data rather than the path to the data :)

@xrotwang
Copy link
Collaborator Author

@SimonGreenhill so that sounds like a "yes" to my comment above? Some tool that

  • knows what's released (by looking up Zenodo)
  • knows how to download that stuff into appdirs.user_config_dir.cldf / <catalog> / <version> - maybe?
  • writes a config file with these paths in addition?

@xrotwang
Copy link
Collaborator Author

@SimonGreenhill I'd be sort-of ok with that. The price - again - is that people won't be set up to collaborate on the catalogs - and what could be worse - people will use a different tool than we (or at least I).

@SimonGreenhill
Copy link
Collaborator

Yes, I think that's the easiest solution??

I've had a few troubles recently with collaborators not knowing which version of the data they're using ("I downloaded glottolog/dplace a few months ago, has it been updated?", "I clicked download on the website. Where do I put it?"). Something like this would mean they have to explicitly ask for version x, so they must know what version they're using, but the mechanics of that are abstracted from them (and I'd rather our collaborators outside the core team are using explicit released version rather than git revision xyz).

It is annoying that they won't be able to PR as easily, but perhaps a 'dev' version could run a git clone (but that adds more complexity back). We could also flag in README and messages e.g. "These data are $Package $Version from $URL. To add or correct please see $URL".

@xrotwang
Copy link
Collaborator Author

Ok. Say we go with this idea, I'd propose that cldfbench would be this tool. Admittedly, it will pull in quite a few dependencies not needed for this particular task (although we can probably make a couple of these optional), but having yet another tool seems overkill.

@xrotwang
Copy link
Collaborator Author

This would also mean that clldutils.apilib.API would gain a factory method like

@classmethod
def from_name_and_version(cls, name, version):
    ...
    return cls(data_path)

which looks up a cldfbench config file, I guess.

@xrotwang
Copy link
Collaborator Author

And I guess the Catalog class in https://github.com/cldf/cldfcatalog should then also be able to deal with exports/Zenodo downloads of the catalog data - and not expect a git repository that can be queried for metadata using git describe, git tags and git remote.

@SimonGreenhill
Copy link
Collaborator

I think it makes sense for cldfbench to have it as an 'install catalog' idea. The Zenodo API looks flakey (most options incl. list releases are 'under development', so it may be easier to use github in which case the catalog could just run git clone on a specific tag?

@xrotwang
Copy link
Collaborator Author

xrotwang commented Oct 21, 2019

@SimonGreenhill I'm confused now. Didn't we want to cut git out for simplicity? I don't think running git clone from GitPython is the solution to do this - we'd still have to deal with "is git installed?" issues - just obscured by two layers of Python code.

Regarding Zenodo: I think the OAI-PMH interface they offer for communities is stable - and this would be all we need. We'd make sure to add catalog versions to a specific community, then go through OAI-PMH to discover these.

@LinguList
Copy link
Collaborator

Hi all, I would like to say: I am fine with git, and users are also nowadays, I think, but I am not fine with having all our personal ways to remember paths to concepticon and pyclts: we need a general config, so I can write code, and, e.g., Mei-Shin is pulling it, and applying it, without changing the path in the script. Yet that concrete and maybe seemingly silly problem is the biggest burden for me.

@LinguList
Copy link
Collaborator

One of the reasons why lexibank-repos are so convenient is that we can work on a repo, modify it, and then acccess it again, without having the data in the same place. Again: a functionality that is shipeed with the lexibank-setup.py scripts. I can ask people to follow instructions to install sagartst, and then once I assume that (them following tutorials etc. on how to do it), I can send them snippets with code that work on any folder of their directory, without having to adjust paths.

That's my major issue.

I think the easiest way to solve it is to add a setup.py to concepticon-data, clts-data, etc. so I have access to it's path. I mean: we do the same with cldf-datasets, right? Why not do it with our catalogs?

@tresoldi
Copy link
Contributor

Just to make my suggestion clearer with an example:

  • put all data in a resources folder in the same git repository
  • in setup.py, add that directory as one of the packages to be included in the wheel along with the code (there are many ways of doing it, but the dumbest is to just manually list, i.e., packages=["pyclts", "resources"]) -- this also implies that the package won't be zip safe, which can be set with zip_safe=False
  • add all the files to the MANIFEST.in (just an include resources/* line)
  • in the code, functions would read and return the data, either internally or exposing to the user, as in this pseudo-code (here only reading vowels):
import csv
from os import path

def read_ts_vowels(name):
  filename = path.join(path.dirname(path.dirname(__file__)), "resources", name, "vowels.tsv")
  with open(filename) as csvfile:
    reader = csv.DictReader(csvfile, delimiter="\t")
    data = [row for row in reader]

  return data
  • The functions above would be called internally, so that the API does not change (i.e., things like bipa = TranscriptionSystem("bipa") would still work), and if people want access to them they just need a pyclts.read_ts_vowels("bipa")

This is what I use in most of my projects, such as the alteruphono sound changer https://github.com/tresoldi/alteruphono . I find it easier because code and data are separated but in the same repository, when releasing a wheel it would be installed automatically with pip,
allowing easier branching/versioning as well as testing.

@LinguList
Copy link
Collaborator

LinguList commented Oct 21, 2019

Please check, for convenience, this tutorial on clts I wrote one month ago. My code says:

from pyclts import TranscriptionSystem
bipa = TranscriptionSystem('bipa')
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
    print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-'))

Now all I want to keep is that I never have to put in my tutorials or code I write with people together, a direct path from my system or a relative path. So I am fine with a solution like:

from pyclts import TranscriptionSystem
from cltsdata import data_path as clts_path
bipa = TranscriptionSystem(clts_path('bipa'))
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
    print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-'))

But I think we need at least this functionality for all of our catalogs.

@LinguList
Copy link
Collaborator

So, one more attempt, to be precise, @chrzyki said:

Re "YOUR PATH HERE": There are multiple ways we can solve this, e.g. configuration files (see pylexibank for script, notebook, and tutorial environments) or cli arguments.

And this is exactly, where I want ONE solution that handles this for all catalogs, so we can use it when teaching and when collaborating on our tools.

@xrotwang
Copy link
Collaborator Author

@tresoldi distributing data in python packages would work with CLTS, but I wouldn't bet on pypi to host 0.5 GB python packages for the Glottolog data. It also means different sets of instructions depending on the platform you are using (pypi won't work for R), more complicated releases (no release when pypi is down), etc.

@xrotwang
Copy link
Collaborator Author

@SimonGreenhill @LinguList So, maybe, having cldfbench config take a stab at running git clone would be the cheapest first stab at a solution? Maybe I'm overestimating the problems - I don't know how ubiquitous git is nowadays, and internet tends to be fast everywhere.

This would then make appdirs.user_config_dir('cldfbench') / catalog.ini the standard location to look up such paths from all packages, pyclts, pyglottolog, etc.?

@xrotwang
Copy link
Collaborator Author

Maybe make it appdirs.user_config_dir('cldf') / catalog.ini?

@xrotwang
Copy link
Collaborator Author

Looking at gitpython-developers/GitPython#26 , GitPython seems to work reasonably well with Windows - and I guess MacOS won't be too problematic. So, @SimonGreenhill do you think we can/should expect users to have git installed? I like this solution, since it provides for a smoother learning curve: Once you know a enough about git, just consult the config file for locations of your clones and be ready to contribute to glottolog - not quite - because you won't have a fork ... Anyway, I like it.

@tresoldi
Copy link
Contributor

@xrotwang I was proposing it for pyclts, with the issue of intertwined code & data pointed by @LinguList -- it can be hard enough to explain transcription systems and trascription data, adding a path configuration would make it even worse for beginners. As an API approach (not passing a clts_path or similar), I would still favor such solution, even with data downloaded manually or automatically.

No releases when pypi is down does not seem a problem to me, but you are completely right about the interoperability issues with R, as we'd make harder to actually reuse data. I didn't consider it. But again, either we keep all the data in an independent repository that needs to be cloned somehow, or periodically release the data somewhere so it can be download when needed by the Python and R libraries (hopefully in an automatic way, similar to the nltk approach).

@xrotwang
Copy link
Collaborator Author

@tresoldi yes, I agree that considering pyclts in isolation looks somewhat differently. But the power-users of all these tools will be us, and most of us are involved in the curation of more than this one dataset/catalog. And to me, homogeneity of these datasets and tools is mission-critical. If you come back to pyclts after a couple of months (as I did this weekend), and what you know boils down to "it's different" and "it's intertwined", that's not a good situation. So, I'd ask to sacrifice a bit of end user experience in order to keep us sane.

@LinguList
Copy link
Collaborator

The path statement, as long as it is configurable, and we have a tutorial that runs users through it, so I can point them to it, is for me the most burning issue, as its absence does not only break UI, but also current collaboration (we're using the pylexibank.__main__.configure for the time being in our group). But I guess we start agreeing on a solution here, right?

@tresoldi
Copy link
Contributor

tresoldi commented Oct 21, 2019

@xrotwang to a more practical matter, couldn't we just build the wheels for the data as normal Python packages (like a pyclts_data) and host them somewhere else then pypi? With the immediate first alternative of GitHub itself, even if it's half a Gb.

Having the data as normal packages, with version number, dependencies, etc., would worry me less about problems for debugging cases where the code and the data are out of sync, or people manually sharing zip files with the entire data, and the installation should be easier (as pip would and should take care of downloading and installing everything).

@xrotwang
Copy link
Collaborator Author

xrotwang commented Oct 21, 2019

@tresoldi I'm really not too keen on providing yet another (in this case even platform-specific) release version of our catalogs, in addition to the one on Zenodo, the one on the GitHub release page and a clone with the proper tag checked out. I see that it would be helpful, if pyclts could declare "works with data better than v1.3", but then I also think being very conservative with backwards compatibility for the data is good anyway, and that makes this point less relevant.

@LinguList Here's my (slightly biased) understanding of what agreement could look like:

  1. We expect users to have git installed.
  2. We expect users to install cldfbench (hopefully with stripped down requirements).

Then they'd run cldfbench config to

  • either point to clones or create new ones in appdirs.user_config_dir('cldf') / catalogs,
  • register these paths in appdirs.user_config_dir('cldf') / catalog.ini.

Other packages would consult this config file for default locations. To make this as simple as possible, I'd propose to add code to do this to cldfcatalog, which is very lightweight and can just be added as dependency to all packages. (I wouldn't want to add a separate CLI to cldfcatalog, though, maybe a bit of documentation how to write such a config file by hand.)

That leaves the question of how to handle different versions of the data. With cldfcatalog functionality it's easy to do stuff like:

with Catalog('clts', 'v1.0'):
    ...

but the question is

  • should this be done outside of instantiating the particular API - e.g. in a wrapper script?
    with Catalog('clts', 'v1.0') as cat:
        clts = pyclts.CLTS(cat.dir)
  • or should the API know how to do this, e.g..
    with pyclts.CLTS(version='4.0') as clts:
        ...

The first option would be simpler and compatible with current versions of the packages. The second option would have the advantage that the package knows which version of data it deals with - so could possibly adapt to differences.

@tresoldi
Copy link
Contributor

+1 for the first option, of instantiating the API.

@xrotwang
Copy link
Collaborator Author

xrotwang commented Oct 21, 2019

I think I'm for option 1 as well, since this would mean the API just expects a directory - wherever that came from - git export, git clone or unzipped zenodo download. Also, the APIs are typically used from other python code - the exception being user facing code like the concepticon CLI. But for this, future releases of pyconcepticon could just include a small wrapper - exposing a --version option and wrapping API instantiation into the with statement as above.

@xrotwang
Copy link
Collaborator Author

@SimonGreenhill @tresoldi @LinguList @chrzyki I propose to leave this issue now to @LinguList answering to #137 (comment)
and refer all further discussions to a PR I'm going to open on https://github.com/cldf/cldfbench

@xrotwang
Copy link
Collaborator Author

one last question, though, before I hack away: clones in appdirs.user_config_dir or appdirs.user_data_dir? (see https://pypi.org/project/appdirs/)

@LinguList
Copy link
Collaborator

I'd say: config. As these are config files, right?

@xrotwang
Copy link
Collaborator Author

Unless there are some expectations on some platforms that "config" must be small, I'd say config, too.

@LinguList
Copy link
Collaborator

As to #137: I am fine with this suggestion, just had a quick check, and think this will (provided the paths are solved, even increase convenience.

@xrotwang
Copy link
Collaborator Author

@LinguList I just released cldf-clts/clts: https://doi.org/10.5281/zenodo.3515745 . pyclts 2.0 is also on pypi. Should we archive this repo - i.e. mark as archived and read-only?

@xrotwang
Copy link
Collaborator Author

@LinguList I just realized, the easiest (?) way to do this may be:

  • rename this repo to clts-legacy
  • move it to cldf-clts org (then we can transfer issues to the two active repos there)
  • then archive.

ok?

@xrotwang
Copy link
Collaborator Author

Ah, that's a bit of a trap. I watched my email, waiting for an answer - but only adding reaction emojis doesn't seem to trigger an email :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants