-
Notifications
You must be signed in to change notification settings - Fork 3
Factor out pyclts #137
Comments
We discussed this, didn't we? The problem is that the code and data are specifically intertwined here, so finding a way to circumvent this is almost impossible. |
I would be in favor, separating data and code makes a lot of sense for CLTS (not sure if it "deserves" its own organization, however). It could certainly attract more phonologists/phoneticians. However, as @LinguList just pointed, the code and data are somewhat co-dependent and would take some time to separate them. Should we have this as a low priority? |
I guess it depends on how fast we have a working example here. As pyclts is deeply intertwined in lexibank, and I also use it in other applicattions, it is a great advantage of not having to configer the path where the package lies, which is one of the advantages of not separating code and data. I'd say: if somebody can provide an example for how this could be achieved, one could review that and see how well it works, but I would not want to loose some of the functionality we have there by now, since specifically the adding of new transcription datasets is quite tedious, and the code written there should not be thrown away when changing the structure here. |
I'll try to come up with a solution. While having data in the package may be simpler in applications, the fact that CLTS is different when compared to Glottolog and Concepticon makes it more complicated as well :) I also think it's a bit mis-leading if code-bugfix-releases litter the data release timeline. |
By the way, maybe we should separate not only code from data, but app from library as well? In terms of repository, I mean. |
Note also that the code in ./app/ is py2 only so that should be updated if it's being refactored. |
Ah, sorry, that's dead code, it can be savely removed, all the py2 code, as it was just used for a cgi server that only runs py2.6 |
So here's the state of affairs:
What I did:
Overall, I still like this setup a lot more. Everything seems a lot more where it belongs:
|
I'll need to have a closer look at everything later, and I am generally okay with that, but I'd say that it is extremely important to find a way to globally fix the path problems, as this is now making it already very difficult to present the code in tutorials (also for concepticon).
So the general advantage of installing clts and having a tool set for handling with linguistic data for coding is definitely not there at the moment. While I agree with the advantages of separating the two, I'd suggest to make a compatibility package on top, say, one command that allows me to make an init file, as we do now for cldf-bench, and stores this, so we can, upon installation or first run of pyclts, ask the users to provide their repo, and pyclts would call it and raise an error otherwise. If we have this initialization-library or script generic enough, we could then also use it for pyconcepticon and pyglottolog (specifically pyconcepticon gives me a difficult time now, I started copy-pasting parts of it to people who wanted to work with it as they did not see that they had to change the path). And we could use that script then also in cldf-bench, checking if paths are properly set. So to summarize: I felt a major inconveniency since the pyconcepticon was introduced, but I think it can be addressed, and I think it should be addressed, specifically as I saw many users (also in our group) ending up confused. And I'd argue that we should not leave it to cldf-bench but have a light-weight init-organizer on top (that could also be used from lingpy, where we also have some paths and data). |
Ah, in some sense, you can say this is similar to nltk and nltk_data now: users can do something with nltk, but in fact, they need to decide also what data they want. |
Or can we add a little bit of code to concepticon-data, to clts-data, and to glottolog-data, so that it could be shared on pip as well (maybe in condensed form), so that upon installation, it would handle the path problem? I could then import |
I'm not sure I like the compat-package idea. To me it seems as if most of the path problems come from the fact that so far we tried to hide the fact that an API needs a path at initialization. So the convenience of inferring default locations, config files, etc. led to intransparence. Since there are many ways of trying to be smart about discovering default locations, and we tried a couple, things broke - not passing a path worked in some situations and didn't in others. So my idea for a better world:
Also, I think that with the convenience before, we basically traded "path problems" for "sync problems" - which don't show up as |
Regarding making catalog data installable via pip: I think that adds again a layer of obscurity. Also, catalog data will typically be used across multiple virtual environments. With the proposed method of discovery that would mean install the catalog for each env, making sure it's always installed from the same place on disk - so even more potential for confusion. |
The difference is that Maybe it is not the most elegant solution, but I still prefer to have a |
But the problem of explaining non-expert python users that they need to git-clone, download, and pip-install things at the same time is too much for many people, and it is also getting too much for me, as I had many of these requests in the past. So even if one is explicit, users may not necessarily see that. Also based on the tutorials I wrote in the past, it became quite difficult that one can no longer write a Python script that just runs, provided data are installed, but that one has to explain that users should change path-names according to their machines there. We may not need to install things via pip, I mean, the data, but then we need to think of adding functionality as they offer for |
Without making it a bit easier to manage data, and to store where the data is in an init file, I see the follwoing problems:
And all are cases that actually happen, and quite a lot. So if we imagine a simple package that downloads the data and stores the paths, pointing to the latest releases, so it would download the data the first time for the users, and remember where they were put, this may already be of help. In fact, what we have in cldf-bench is largely doing the trick, and I ended up using |
I'm not sure such a package will be simple. As explained here, such code has the potential of masking/hiding problems beyond our control - such as missing
would be a reasonable minimal requirement. |
It is a major hurdle that prevents people from coding, I am afraid, and makes it more difficult for us, as we will have to keep explaining what people should do, etc. And the article-submission-with-code is a similar case. In fact, if one says: "assuming you have set up everything correctly [instructions here ...]" I'd just like to avoid that I have to write a "YOUR PATH HERE" into my python scripts. So if a user downloads clts-data, why not add a simple setup.py that registers a path to the repository? Or using the cldf-bench-configuration and making it independent of cldf-bench? |
I'm fully with @xrotwang here. I think that this particular issue (code/data separation and accessibility) are something we've got to handle with good tutorials and documentation, as well as examples and cookbooks. Moving this pain into just another Python library is in my opinion just asking for more trouble further down the road. Re "YOUR PATH HERE": There are multiple ways we can solve this, e.g. configuration files (see |
Ok, so what exactly can we expect? From what I understand, we cannot expect people to clone repositories. If so, how should people get the catalog data? I'd guess, if we cut
then I'd be somewhat optimistic that we can do that. But the price that comes with it is that we disconnect the catalogs from |
Personally, I've found running out of a git clone leads to syncing issues (do I have the latest commit, right branch, etc?). I can remember to Sorry for jumping in here, but I like the idea of first-run (or an explicit |
@SimonGreenhill so that sounds like a "yes" to my comment above? Some tool that
|
@SimonGreenhill I'd be sort-of ok with that. The price - again - is that people won't be set up to collaborate on the catalogs - and what could be worse - people will use a different tool than we (or at least I). |
Yes, I think that's the easiest solution?? I've had a few troubles recently with collaborators not knowing which version of the data they're using ("I downloaded glottolog/dplace a few months ago, has it been updated?", "I clicked download on the website. Where do I put it?"). Something like this would mean they have to explicitly ask for version It is annoying that they won't be able to PR as easily, but perhaps a 'dev' version could run a |
Ok. Say we go with this idea, I'd propose that |
This would also mean that @classmethod
def from_name_and_version(cls, name, version):
...
return cls(data_path) which looks up a |
And I guess the |
I think it makes sense for cldfbench to have it as an 'install catalog' idea. The Zenodo API looks flakey (most options incl. list releases are 'under development', so it may be easier to use github in which case the catalog could just run |
@SimonGreenhill I'm confused now. Didn't we want to cut Regarding Zenodo: I think the OAI-PMH interface they offer for communities is stable - and this would be all we need. We'd make sure to add catalog versions to a specific community, then go through OAI-PMH to discover these. |
Hi all, I would like to say: I am fine with git, and users are also nowadays, I think, but I am not fine with having all our personal ways to remember paths to concepticon and pyclts: we need a general config, so I can write code, and, e.g., Mei-Shin is pulling it, and applying it, without changing the path in the script. Yet that concrete and maybe seemingly silly problem is the biggest burden for me. |
One of the reasons why lexibank-repos are so convenient is that we can work on a repo, modify it, and then acccess it again, without having the data in the same place. Again: a functionality that is shipeed with the lexibank-setup.py scripts. I can ask people to follow instructions to install That's my major issue. I think the easiest way to solve it is to add a setup.py to concepticon-data, clts-data, etc. so I have access to it's path. I mean: we do the same with cldf-datasets, right? Why not do it with our catalogs? |
Just to make my suggestion clearer with an example:
import csv
from os import path
def read_ts_vowels(name):
filename = path.join(path.dirname(path.dirname(__file__)), "resources", name, "vowels.tsv")
with open(filename) as csvfile:
reader = csv.DictReader(csvfile, delimiter="\t")
data = [row for row in reader]
return data
This is what I use in most of my projects, such as the |
Please check, for convenience, this tutorial on clts I wrote one month ago. My code says: from pyclts import TranscriptionSystem
bipa = TranscriptionSystem('bipa')
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-')) Now all I want to keep is that I never have to put in my tutorials or code I write with people together, a direct path from my system or a relative path. So I am fine with a solution like: from pyclts import TranscriptionSystem
from cltsdata import data_path as clts_path
bipa = TranscriptionSystem(clts_path('bipa'))
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-')) But I think we need at least this functionality for all of our catalogs. |
So, one more attempt, to be precise, @chrzyki said:
And this is exactly, where I want ONE solution that handles this for all catalogs, so we can use it when teaching and when collaborating on our tools. |
@tresoldi distributing data in python packages would work with CLTS, but I wouldn't bet on pypi to host 0.5 GB python packages for the Glottolog data. It also means different sets of instructions depending on the platform you are using (pypi won't work for R), more complicated releases (no release when pypi is down), etc. |
@SimonGreenhill @LinguList So, maybe, having This would then make |
Maybe make it |
Looking at gitpython-developers/GitPython#26 , |
@xrotwang I was proposing it for No releases when pypi is down does not seem a problem to me, but you are completely right about the interoperability issues with R, as we'd make harder to actually reuse data. I didn't consider it. But again, either we keep all the data in an independent repository that needs to be cloned somehow, or periodically release the data somewhere so it can be download when needed by the Python and R libraries (hopefully in an automatic way, similar to the |
@tresoldi yes, I agree that considering |
The |
@xrotwang to a more practical matter, couldn't we just build the wheels for the data as normal Python packages (like a Having the data as normal packages, with version number, dependencies, etc., would worry me less about problems for debugging cases where the code and the data are out of sync, or people manually sharing zip files with the entire data, and the installation should be easier (as |
@tresoldi I'm really not too keen on providing yet another (in this case even platform-specific) release version of our catalogs, in addition to the one on Zenodo, the one on the GitHub release page and a clone with the proper tag checked out. I see that it would be helpful, if @LinguList Here's my (slightly biased) understanding of what agreement could look like:
Then they'd run
Other packages would consult this config file for default locations. To make this as simple as possible, I'd propose to add code to do this to That leaves the question of how to handle different versions of the data. With with Catalog('clts', 'v1.0'):
... but the question is
The first option would be simpler and compatible with current versions of the packages. The second option would have the advantage that the package knows which version of data it deals with - so could possibly adapt to differences. |
+1 for the first option, of instantiating the API. |
I think I'm for option 1 as well, since this would mean the API just expects a directory - wherever that came from - git export, git clone or unzipped zenodo download. Also, the APIs are typically used from other python code - the exception being user facing code like the |
@SimonGreenhill @tresoldi @LinguList @chrzyki I propose to leave this issue now to @LinguList answering to #137 (comment) |
one last question, though, before I hack away: clones in |
I'd say: config. As these are config files, right? |
Unless there are some expectations on some platforms that "config" must be small, I'd say config, too. |
As to #137: I am fine with this suggestion, just had a quick check, and think this will (provided the paths are solved, even increase convenience. |
@LinguList I just released cldf-clts/clts: https://doi.org/10.5281/zenodo.3515745 . pyclts 2.0 is also on pypi. Should we archive this repo - i.e. mark as archived and read-only? |
@LinguList I just realized, the easiest (?) way to do this may be:
ok? |
Ah, that's a bit of a trap. I watched my email, waiting for an answer - but only adding reaction emojis doesn't seem to trigger an email :) |
So while we are at making breaking changes elsewhere, why not
pyclts
to its own repository, trying to cleanly separate data and codeThe text was updated successfully, but these errors were encountered: