Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cltk data directory location #196

Open
nelson-liu opened this issue Mar 14, 2016 · 21 comments
Open

cltk data directory location #196

nelson-liu opened this issue Mar 14, 2016 · 21 comments

Comments

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Mar 14, 2016

Hi,
CLTK by default places corpuses that are cloned (e.g. when running tests) in ~/cltk_data. I personally don't like putting datafiles in my home directory, and would prefer to move it somewhere else. Perhaps we could integrate an option for changing where CLTK puts corpuses by default?

@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Mar 14, 2016

Hi Nelson, this is a good topic to bring up. For the sake of simplicity, I decided upon the home dir when started development.

If you want to take this on, I encourage you to look into all the places in the codebase you find cltk_data. Then, you'll need to propose a manner that users can customize the location. To be honest, I don't want to clutter up the code too much to make this option happen. And I will still want to keep a default of the home directory.

@hemantpugaliya
Copy link

@hemantpugaliya hemantpugaliya commented Mar 16, 2016

Hi,
I have an idea to solve this problem. We can have config file in the home directory : ~/.cltk_config (a hidden file) which is made during the initialization of library . This file can be used to store the path (or any other configurations in the future) . For the data directory, we can have a variable DATA_DIR which is initialized to ~/cltk_data by default.
Next we provide a utility function to change the data directory . It takes, a base directory(say BASE_DIR) as input and moves the current data directory into the base directory and changes ~/.cltk_config . After using this function the data directory will be now found at BASE_DIR/cltk_data .
We will need another function which will source ~/.cltk_config and use that returned value , where ever we are currently referring to ~/cltk_data . We will have to take care of a few other cases like , the user deleting ~/.cltk_config or edits it such that no DATA_DIR variable is found.
I would like to work on it but may not be able to start the work right away as i have tests in the upcoming week.

@hemantpugaliya
Copy link

@hemantpugaliya hemantpugaliya commented Mar 17, 2016

@kylepjohnson @nelson-liu How is the idea ??
Can i work on it ??

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Mar 18, 2016

hmm I know of several other libraries that have functionality like this, but I haven't had the time to look into them. Your idea seems sound, but please take a look at what others are doing first and see if you can compare/contrast the approaches. Feel free to work on this if you want.

@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Mar 19, 2016

@nelson-liu Does this solution of a ~/.cltk_config file meet your desire of not writing anything to your home dir?

I'm open to this, but I want to warn everyone that I'll be picky about the implementation in code. Of particular concern is the possibility that this option (which very few of our users will want) will clutter our code. For a linguistic FOSS project like ours, I am very mindful of the technical debt that "nice to haves" like this can create.

Please don't misunderstand me – I agree that this would be a nice function, but I'm only willing to trade so much complexity in exchange for me. So let me know if one of you are serious about it and I'll assign you the issue, however just be aware that I'll scrutinize it quite carefully.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Mar 19, 2016

yeah, I definitely echo your sentiment regarding the pile up of technical debt with the implementation of features like this. I ended up just symlinking the data directory to where I wanted it to be, which isn't an ideal solution but suffices. I've reversed my stance on this and now don't think this is worth implementing, mainly due to its niche nature.

@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Mar 19, 2016

@nelson-liu Thanks for helping me think this through. To be honest, I grappled with the idea ~3 years ago, when I first experimented with allowing alternate dir locations … but I just ended up confusing my self more often than helping.

I encourage you to keep raising these kinds of issues, however. As the project grows, my old assumptions need to be challenged. Closing issue but dialog can continue.

@hemantpugaliya
Copy link

@hemantpugaliya hemantpugaliya commented Mar 19, 2016

@nelson-liu @kylepjohnson
Just in case if this issue is again taken up in future, this may be another possible option.
I have investigated about how this feature has been implemented by NLTK.They have provided the following options .

  1. Define an NLTK_DATA environment variable.
  2. Separate parts of data(eg corpus,models etc) can be stored in different locations(need not be within the same parent directory) and these paths can be added to $PATH environment variable. NLTK looks up all the directories in PATH while searching for a resource.
@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Mar 19, 2016

@hemantpugaliya Thanks. The first is what I did initially, following the NLTK's lead. I forget the details of the problem, but it was basically this: When importing the CLTK as a library, how/where do we persist this CLTK_DATA variable?

@diyclassics
Copy link
Contributor

@diyclassics diyclassics commented May 29, 2016

Checking in on this issue—I wrote some code to use NLTK's PlaintextCorpusReader to load CLTK corpora using the 'home directory' default for the path (cf. https://github.com/diyclassics/cltk/blob/master/cltk/corpus/latin/__init__.py). I feel like it should check other locations before raising an error and @hemantpugaliya's environment variable solution seems like a good idea.

@TinaRussell
Copy link

@TinaRussell TinaRussell commented Dec 8, 2017

I hope this issue is still being worked on, somewhere—I love the idea of CLTK but I can’t stand it when programs add new directories to my home directory and won’t let me specify somewhere else. I’m particular about how my computer is organized, and this kind of thing feels like a precocious child is trying to remodel part of my house.

@almostearthling
Copy link
Contributor

@almostearthling almostearthling commented Jun 2, 2019

Hi, I see that this issue has been closed but @kylepjohnson also said that the discussion can continue. I think that one of the primary discrepancies in not allowing customization of ~/cltk_data lies in the use of Python Virtual Environments, which are often suggested throughout the documentation. An user might want to use a virtual environment to keep such a dynamically developed library as CLTK in its own space, along with its library. Moreover, someone might want to use several copies of the data at different stages of development for testing purposes, which is legitimate in my opinion.

That said, I forked CLTK and (re?)implemented the lookup of an environment variable - CLTK_DATA as you can imagine - to determine the position of the data directory: it uses the builtins functionality provided in Python 3 to spread a builtin-like variable, namely get_cltk_data_dir, throughout the whole package. I changed all locations where ~/cltk_data was either mentioned, or constructed, or prepended to other strings. I also changed one of the unit tests to reflect that ~/cltk_data is no more the expected value: all provided tests succeed. Unfortunately many files from many contributors had to be changed: probably if such a solution is to be adopted there should be some sort of "protocol" for newly contributed code, such as suggesting to use utilities like make_cltk_path() from cltk/utils/file_operations.py to determine the data path.

The reason why I chose such a long and ugly name for the variable is to make it easy to find and replace with a better one, CLTK_DATA itself being for instance a good candidate.

Of course, when no CLTK_DATA environment variable is provided, the package falls back to the good old ~/cltk_data position, so no current installation should be broken by using the forked version. In case anyone would like to give it a try, the branch is located at the following URL:

https://github.com/almostearthling/cltk/tree/CLTK_DATA

If this could be a viable solution I'll propose a pull request.

Note that the environment variable solution might be of a certain appeal to Python Virtual Environment users, as the variable can be easily set in the venv/bin/activate script.

@todd-cook
Copy link
Collaborator

@todd-cook todd-cook commented Jun 4, 2019

I like this idea. It is a little bit fringe, but as someone who works with corpora, I think it could be useful. Unless somebody chimes in against it (thoughts @kylepjohnson ? )
@almostearthling I would say go ahead and PR it.

@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Jun 4, 2019

@almostearthling thanks for the input.

I don't know how I feel about it … I lean more against than for. If I explain my logic, perhaps someone can correct a flaw in it:

  • I understand why people don't like having software litter their home dir (as @TinaRussell mentions above). However two points: (1) a home dir is the only common denominator we can count on all of our users having access to; (2) there must ultimately be a hard coded filepath somewhere (though see below, I don't understand the builtins idea very well).
  • A contrib once mentioned using a "hidden" file or dir, though this seems to me just as clutter-y -- and even worse, because it's out of obvious view. Whenever anyone downloads code off the Internet, there's a big risk involved for the user.
  • Help me understand the builtins idea -- I still do not have a clear idea of how this would solve one's problem. Scanning each user's hard drive for cltk_data seems clunky and inefficient. (Crazy idea: spacy installs its models into your virtualenv … though in this case the model is a separate Python package, eg env/lib/python3.7/site-packages/en_core_web_lg/en_core_web_lg-2.1.0/.) I wouldn't want this as a default, though, as it seems likely to me that a user would have only a few corpus and many virtualenvs.
  • As @todd-cook implies it is fair to assume that not all of our users will passively take in our default corpora. In fact, even if they're in the minority, those who customize corpora would likely be counted among our "power users". It would be right to write code to accommodate them, though I don't see how an arbitrary data dir does so. For example, couldn't one just put arbitrary corpus folders within ~/cltk_data -- or anywhere else, for that matter?

If corpus management, and not file system hygiene, is the driving concern, then we could take this conversation in another direction. What patterns may users follow when doing experiments? To give my 2 cents, I would rather a user use a standard corpus and customize the corpus with filepath selection -- for example, getting all of the First1KGreek repo [https://github.com/OpenGreekAndLatin/First1KGreek] and then opening only (say) the files of historians.

I could say a little more … please @todd-cook and @almostearthling let me learn more, especially about the builtins idea.

@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Jun 4, 2019

Update: @almostearthling I didn't notice that you had already written some code. I see this diff -- is it the correct one?

almostearthling@4bab520

In this case, do I understand the following right?

  1. CLTK looks in an OS's PATH for $CLTK_DATA.
  2. If $CLTK_DATA found, use that; else use ~/cltk_data/

If so, I am fine with this, as an idea. Next time please send a link so I can see the code better.

@todd-cook in your opinion, is the implementation optimal?

@todd-cook
Copy link
Collaborator

@todd-cook todd-cook commented Jun 4, 2019

@todd-cook in your opinion, is the implementation optimal?
Yeah, the code looks good and this is how I would do it.
I like the edge case ability to override the data dir, good idea.
I made a comment on the code branch that pathlib and Path may provide cleaner path part concatenation and may allow users to better type hint function interfaces, but that could be done later. Pathlib is great, but sometimes you still have to cast the Path object back to a string to use it.

@almostearthling
Copy link
Contributor

@almostearthling almostearthling commented Jun 4, 2019

First I'll try to explain the builtins mechanism.

The builtins module allows for definition, at package level, of objects, variables and functions that become the same level as Python's intrinsic builtin items - such as the print() function or the None object. In Python 2.x it was called __builtins__, thus interaction with it was considered something dangerous to say the least, if not some sort of witchcraft. Of course the builtins module is originally not intended for this purpose, but it serves as a quite common hack to provide global-like items to contributors of a Python package. I proposed this solution in order not to change too much the way contributors provide their code to the CLTK, and to avoid to drastically change the semantic of code where the data directory is mentioned: I saw that many times the contributed code uses the ~/cltk_data literal to build a more complex path using common Python functions - mostly os.path.join(), so most of the times the substitution of the literal with get_cltk_data_dir didn't change the sense of the code so much.

I see think diff -- is it the correct one?
almostearthling@4bab520

Indeed it is. Sorry for not pointing out directly to the diff and just providing a link to the branch. However, as you can see, my changes touched a lot of files, and a lot of contributions, not only the core of CLTK. This is the main drawback of my proposed solution: many contributors will see their provided code changed (well, this is however the spirit of Github itself, isn't it?) and future contributors will have to be warned that they will have to refer to a global item (the get_cltk_data_dir variable, or even better a builtin-like function to avoid the possibility of overwriting its value) instead of assuming that $HOME/cltk_data is the place where everything can be found or written. It's not a minor change, in fact, and it can dissuade to accept a PR because of its invasiveness. I say this because everything has to be considered before such a radical modification.

In this case, do I understand the following right?

  1. CLTK looks in an OS's PATH for $CLTK_DATA.
  2. If $CLTK_DATA found, use that; else use ~/cltk_data/

Yes, exactly. $CLTK_DATA is a name similar to $NLTK_DATA, so it seemed consistent to use it - and it's pretty straightforward as a name. This piece of code is where it happens:

if 'CLTK_DATA' in os.environ:
    get_cltk_data_dir = os.path.expanduser(
        os.path.normpath(os.environ['CLTK_DATA']))
else:
    get_cltk_data_dir = os.path.expanduser(
        os.path.normpath("~/cltk_data"))
builtins.get_cltk_data_dir = get_cltk_data_dir

and the fallback mechanism makes it possible to keep using existing installations of ~/cltk_data for users that have never defined (and possibly will never define) the $CLTK_DATA environment variable, just because they don't care.

While I admitted that the changes introduced by my implementation can cause such a revision of the contribution habits that might let you opt for not accepting a PR, there are at least two use cases that make the solution interesting:

  1. Testing. A researcher or developer might want to test something in the CLTK data directory - be it a corpus, a data set, or even a piece of code: there is a lot of Python code there - before promoting it to production, and in such a case it can be useful to compare results with a working version of CLTK. The possibility to have, say, both ~/cltk_data and ~/cltk_data_TEST in one's home directory can be helpful. As I said above, it can also be useful to test recent changes to the data contributed by others without interference with a "known working copy" of the CLTK.

  2. Automation. If there is a process that runs automatically on a system, that ingests data and uses CLTK as a library to provide results to be used by other software, the account that runs the process might not have access to its home directory (some accounts have it set to /nonexistent, or even worse /tmp). In this case it would be useful to set $CLTK_DATA to something like /var/lib/some_path in order to let the CLTK data directory persist in a consolidated fashion - even on POSIX systems that delete everything in /tmp on every reboot. This use case is particularly significant to me, and in fact is one of the main reasons why I propose the change: a project that I am working on will do exactly this.

However, I still have something to do before proposing a PR: first I want to test if everything works on Windows too - I don't know however if the test suite actually succeeds on Windows as well, but I'd like to see at least if there are major issues. Secondly, as I said to @todd-cook and above, I'd like to see if I can provide the value via a function instead of a global variable, without cluttering the global namespace with too many names.

Sorry for being so lengthy, as it seems obvious conciseness is not my gift. :-)

almostearthling pushed a commit to almostearthling/cltk that referenced this issue Jun 4, 2019
@almostearthling
Copy link
Contributor

@almostearthling almostearthling commented Jun 4, 2019

All tests succeed after commit fb191f8 too: variable get_cltk_data_dir is substituted by function get_cltk_data_dir().

@kylepjohnson kylepjohnson reopened this Jun 5, 2019
@kylepjohnson
Copy link
Member

@kylepjohnson kylepjohnson commented Jun 5, 2019

future contributors will have to be warned that they will have to refer to a global item

They already need to know of the current data dir convention, so this is not any worse.

it can also be useful to test recent changes to the data contributed by others without interference with a "known working copy" of the CLTK.

the account that runs the process might not have access to its home directory

Both excellent points.

However, I still have something to do before proposing a PR: first I want to test if everything works on Windows too - I don't know however if the test suite actually succeeds on Windows as well, but I'd like to see at least if there are major issues.

I really appreciate the offer, however we do not support Windows (or do you refer to Windows Subsystem for Linux?). ... I am open to reevaluating Windows support, however I think this deserves its own issue

thus probably a builtin-like function instead of a global variable might work better for the purpose.

This is what you have already done, with get_cltk_data_dir()?

Sorry for being so lengthy, as it seems obvious conciseness is not my gift. :-)

Ha, me neither. Story of my life 😅

Go ahead and make the PR when you're ready. I'll let you and @todd-cook sort out the rest of the details.

@almostearthling
Copy link
Contributor

@almostearthling almostearthling commented Jun 5, 2019

This is what you have already done, with get_cltk_data_dir()?

Yes, I couldn't wait.

Go ahead and make the PR when you're ready. I'll let you and @todd-cook sort out the rest of the details.

Fine, thank you. I'll PR as soon as I get back to my Linux workstation to do all of the mandatory merge/rebase work.

@almostearthling
Copy link
Contributor

@almostearthling almostearthling commented Jun 5, 2019

The die is cast. 😁

kylepjohnson added a commit that referenced this issue Jun 8, 2019
* Use a global variable that infers the cltk_data directory from an environment variable instead of forcing ~/cltk_data

* Convert references to the variable to function calls to avoid possibility to overwrite the value (#196)

* Use a global function to provide cltk_data that queries CLTK_DATA envvar and replace all references to literal

* Better comment code, and skip coverage test on unreachable branch

* Check existence and writeability of data directory when CLTK_DATA envvar is set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.