New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cltk data directory location #196

Closed
nelson-liu opened this Issue Mar 14, 2016 · 11 comments

Comments

Projects
None yet
5 participants
@nelson-liu
Member

nelson-liu commented Mar 14, 2016

Hi,
CLTK by default places corpuses that are cloned (e.g. when running tests) in ~/cltk_data. I personally don't like putting datafiles in my home directory, and would prefer to move it somewhere else. Perhaps we could integrate an option for changing where CLTK puts corpuses by default?

@kylepjohnson

This comment has been minimized.

Show comment
Hide comment
@kylepjohnson

kylepjohnson Mar 14, 2016

Member

Hi Nelson, this is a good topic to bring up. For the sake of simplicity, I decided upon the home dir when started development.

If you want to take this on, I encourage you to look into all the places in the codebase you find cltk_data. Then, you'll need to propose a manner that users can customize the location. To be honest, I don't want to clutter up the code too much to make this option happen. And I will still want to keep a default of the home directory.

Member

kylepjohnson commented Mar 14, 2016

Hi Nelson, this is a good topic to bring up. For the sake of simplicity, I decided upon the home dir when started development.

If you want to take this on, I encourage you to look into all the places in the codebase you find cltk_data. Then, you'll need to propose a manner that users can customize the location. To be honest, I don't want to clutter up the code too much to make this option happen. And I will still want to keep a default of the home directory.

@hemantpugaliya

This comment has been minimized.

Show comment
Hide comment
@hemantpugaliya

hemantpugaliya Mar 16, 2016

Hi,
I have an idea to solve this problem. We can have config file in the home directory : ~/.cltk_config (a hidden file) which is made during the initialization of library . This file can be used to store the path (or any other configurations in the future) . For the data directory, we can have a variable DATA_DIR which is initialized to ~/cltk_data by default.
Next we provide a utility function to change the data directory . It takes, a base directory(say BASE_DIR) as input and moves the current data directory into the base directory and changes ~/.cltk_config . After using this function the data directory will be now found at BASE_DIR/cltk_data .
We will need another function which will source ~/.cltk_config and use that returned value , where ever we are currently referring to ~/cltk_data . We will have to take care of a few other cases like , the user deleting ~/.cltk_config or edits it such that no DATA_DIR variable is found.
I would like to work on it but may not be able to start the work right away as i have tests in the upcoming week.

hemantpugaliya commented Mar 16, 2016

Hi,
I have an idea to solve this problem. We can have config file in the home directory : ~/.cltk_config (a hidden file) which is made during the initialization of library . This file can be used to store the path (or any other configurations in the future) . For the data directory, we can have a variable DATA_DIR which is initialized to ~/cltk_data by default.
Next we provide a utility function to change the data directory . It takes, a base directory(say BASE_DIR) as input and moves the current data directory into the base directory and changes ~/.cltk_config . After using this function the data directory will be now found at BASE_DIR/cltk_data .
We will need another function which will source ~/.cltk_config and use that returned value , where ever we are currently referring to ~/cltk_data . We will have to take care of a few other cases like , the user deleting ~/.cltk_config or edits it such that no DATA_DIR variable is found.
I would like to work on it but may not be able to start the work right away as i have tests in the upcoming week.

@hemantpugaliya

This comment has been minimized.

Show comment
Hide comment
@hemantpugaliya

hemantpugaliya Mar 17, 2016

@kylepjohnson @nelson-liu How is the idea ??
Can i work on it ??

hemantpugaliya commented Mar 17, 2016

@kylepjohnson @nelson-liu How is the idea ??
Can i work on it ??

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Mar 18, 2016

Member

hmm I know of several other libraries that have functionality like this, but I haven't had the time to look into them. Your idea seems sound, but please take a look at what others are doing first and see if you can compare/contrast the approaches. Feel free to work on this if you want.

Member

nelson-liu commented Mar 18, 2016

hmm I know of several other libraries that have functionality like this, but I haven't had the time to look into them. Your idea seems sound, but please take a look at what others are doing first and see if you can compare/contrast the approaches. Feel free to work on this if you want.

@kylepjohnson

This comment has been minimized.

Show comment
Hide comment
@kylepjohnson

kylepjohnson Mar 19, 2016

Member

@nelson-liu Does this solution of a ~/.cltk_config file meet your desire of not writing anything to your home dir?

I'm open to this, but I want to warn everyone that I'll be picky about the implementation in code. Of particular concern is the possibility that this option (which very few of our users will want) will clutter our code. For a linguistic FOSS project like ours, I am very mindful of the technical debt that "nice to haves" like this can create.

Please don't misunderstand me – I agree that this would be a nice function, but I'm only willing to trade so much complexity in exchange for me. So let me know if one of you are serious about it and I'll assign you the issue, however just be aware that I'll scrutinize it quite carefully.

Member

kylepjohnson commented Mar 19, 2016

@nelson-liu Does this solution of a ~/.cltk_config file meet your desire of not writing anything to your home dir?

I'm open to this, but I want to warn everyone that I'll be picky about the implementation in code. Of particular concern is the possibility that this option (which very few of our users will want) will clutter our code. For a linguistic FOSS project like ours, I am very mindful of the technical debt that "nice to haves" like this can create.

Please don't misunderstand me – I agree that this would be a nice function, but I'm only willing to trade so much complexity in exchange for me. So let me know if one of you are serious about it and I'll assign you the issue, however just be aware that I'll scrutinize it quite carefully.

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Mar 19, 2016

Member

yeah, I definitely echo your sentiment regarding the pile up of technical debt with the implementation of features like this. I ended up just symlinking the data directory to where I wanted it to be, which isn't an ideal solution but suffices. I've reversed my stance on this and now don't think this is worth implementing, mainly due to its niche nature.

Member

nelson-liu commented Mar 19, 2016

yeah, I definitely echo your sentiment regarding the pile up of technical debt with the implementation of features like this. I ended up just symlinking the data directory to where I wanted it to be, which isn't an ideal solution but suffices. I've reversed my stance on this and now don't think this is worth implementing, mainly due to its niche nature.

@kylepjohnson

This comment has been minimized.

Show comment
Hide comment
@kylepjohnson

kylepjohnson Mar 19, 2016

Member

@nelson-liu Thanks for helping me think this through. To be honest, I grappled with the idea ~3 years ago, when I first experimented with allowing alternate dir locations … but I just ended up confusing my self more often than helping.

I encourage you to keep raising these kinds of issues, however. As the project grows, my old assumptions need to be challenged. Closing issue but dialog can continue.

Member

kylepjohnson commented Mar 19, 2016

@nelson-liu Thanks for helping me think this through. To be honest, I grappled with the idea ~3 years ago, when I first experimented with allowing alternate dir locations … but I just ended up confusing my self more often than helping.

I encourage you to keep raising these kinds of issues, however. As the project grows, my old assumptions need to be challenged. Closing issue but dialog can continue.

@hemantpugaliya

This comment has been minimized.

Show comment
Hide comment
@hemantpugaliya

hemantpugaliya Mar 19, 2016

@nelson-liu @kylepjohnson
Just in case if this issue is again taken up in future, this may be another possible option.
I have investigated about how this feature has been implemented by NLTK.They have provided the following options .

  1. Define an NLTK_DATA environment variable.
  2. Separate parts of data(eg corpus,models etc) can be stored in different locations(need not be within the same parent directory) and these paths can be added to $PATH environment variable. NLTK looks up all the directories in PATH while searching for a resource.

hemantpugaliya commented Mar 19, 2016

@nelson-liu @kylepjohnson
Just in case if this issue is again taken up in future, this may be another possible option.
I have investigated about how this feature has been implemented by NLTK.They have provided the following options .

  1. Define an NLTK_DATA environment variable.
  2. Separate parts of data(eg corpus,models etc) can be stored in different locations(need not be within the same parent directory) and these paths can be added to $PATH environment variable. NLTK looks up all the directories in PATH while searching for a resource.
@kylepjohnson

This comment has been minimized.

Show comment
Hide comment
@kylepjohnson

kylepjohnson Mar 19, 2016

Member

@hemantpugaliya Thanks. The first is what I did initially, following the NLTK's lead. I forget the details of the problem, but it was basically this: When importing the CLTK as a library, how/where do we persist this CLTK_DATA variable?

Member

kylepjohnson commented Mar 19, 2016

@hemantpugaliya Thanks. The first is what I did initially, following the NLTK's lead. I forget the details of the problem, but it was basically this: When importing the CLTK as a library, how/where do we persist this CLTK_DATA variable?

@diyclassics

This comment has been minimized.

Show comment
Hide comment
@diyclassics

diyclassics May 29, 2016

Contributor

Checking in on this issue—I wrote some code to use NLTK's PlaintextCorpusReader to load CLTK corpora using the 'home directory' default for the path (cf. https://github.com/diyclassics/cltk/blob/master/cltk/corpus/latin/__init__.py). I feel like it should check other locations before raising an error and @hemantpugaliya's environment variable solution seems like a good idea.

Contributor

diyclassics commented May 29, 2016

Checking in on this issue—I wrote some code to use NLTK's PlaintextCorpusReader to load CLTK corpora using the 'home directory' default for the path (cf. https://github.com/diyclassics/cltk/blob/master/cltk/corpus/latin/__init__.py). I feel like it should check other locations before raising an error and @hemantpugaliya's environment variable solution seems like a good idea.

@TinaRussell

This comment has been minimized.

Show comment
Hide comment
@TinaRussell

TinaRussell Dec 8, 2017

I hope this issue is still being worked on, somewhere—I love the idea of CLTK but I can’t stand it when programs add new directories to my home directory and won’t let me specify somewhere else. I’m particular about how my computer is organized, and this kind of thing feels like a precocious child is trying to remodel part of my house.

TinaRussell commented Dec 8, 2017

I hope this issue is still being worked on, somewhere—I love the idea of CLTK but I can’t stand it when programs add new directories to my home directory and won’t let me specify somewhere else. I’m particular about how my computer is organized, and this kind of thing feels like a precocious child is trying to remodel part of my house.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment