Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RA manual and standardize language across documentation #20

Closed
z-y-huang opened this issue May 1, 2019 · 19 comments · Fixed by #22
Closed

Update RA manual and standardize language across documentation #20

z-y-huang opened this issue May 1, 2019 · 19 comments · Fixed by #22
Assignees

Comments

@z-y-huang
Copy link
Contributor

  • Update RA manual to incorporate from procedures from latest version of gslab_make
  • Standardize language across documentation
@z-y-huang z-y-huang self-assigned this May 1, 2019
@z-y-huang
Copy link
Contributor Author

Hi @gentzkow:

To follow up on this comment here, I've synced any changes to gslab_make with template and updated the RA manual in the wiki.

Let me know if there are any changes you would like me to make before opening an issue to have the other RAs review the template.

@gentzkow
Copy link
Owner

gentzkow commented May 1, 2019

Thanks @z-y-huang. I've read over the wiki and made some minor edits.

Here are some additional to dos I'd suggest we knock off before sending for review.

  1. I realize we have not included /temp/ directories in the template. These can be pretty important in my experience as a place to store intermediate files that are only ever needed within the module. I added discussion of these to the repository structure wiki page. Why don't you also add /temp/ to the template's .gitignore. When we create the live working examples we should make sure and include one that uses the /temp/ directory.

  2. I would like to allow code to reference /raw/ directories directly. I removed the note in the manual about only referencing these via /input/. I'd appreciate it if you could update the template to reflect this (or let me know if you disagree with the choice).

  3. I think we should implement in the template the decision we made in SocialMediaEffects to copy and commit inputs for paper_slides to the repo rather than making them links. I'd appreciate it if you could implement this in the template and update the wiki to explain it.

  4. We need a more detailed discussion of the protocol for releasing output to an external directory (e.g., like we do with output_local). Let's make this a separate section of the repository structure wiki page and provide more detailed instructions for how exactly this works. Then when we discuss output_local we can just reference this section.

  5. We need to do some work on readme.md. This should focus on steps needed to setup the repository and get it running. It should refer via a link to the wiki for instructions on using the repository (e.g., how to run the build scripts). We should try to get rid of anything that is redundant with the wiki.

    I would also suggest that we include separate sections at the end called "Specific Instructions for Mac" and "Specific Instructions for Windows" that collect everything platform specific. The main section on command line usage can be very short and just give the list of commands that should be able to run the software by default, then say how to change the names for executables, then point to the mac / windows sections for more details on setup.

@z-y-huang
Copy link
Contributor Author

z-y-huang commented May 1, 2019

@gentzkow:

Got it.

It sounds like from the emails that we want to support Jupyter Notebooks, which I just added with a run function called run_jupyter. The run_jupyter function is currently executing notebooks via the command line, though we can also change it to be a wrapper for the Python API interface. The advantage of executing via command line is that it keeps the codebase simple as run_jupyter will mirror the other run functions. The advantage of executing via Python API interface is that it's one less thing users have to set up command line usage for. Do you have a preference either way?

z-y-huang pushed a commit that referenced this issue May 1, 2019
@gentzkow
Copy link
Owner

gentzkow commented May 1, 2019 via email

@z-y-huang
Copy link
Contributor Author

@gentzkow:

I believe we can add Jupyter to the requirements.txt and preface the jupyter command with python -m. I'll need to double-check this won't potentially lead to conflicts with Anaconda (since that's how most people have Jupyter installed), but will go with that if there isn't.

@gentzkow
Copy link
Owner

gentzkow commented May 2, 2019 via email

z-y-huang pushed a commit that referenced this issue May 2, 2019
z-y-huang pushed a commit that referenced this issue May 2, 2019
z-y-huang pushed a commit that referenced this issue May 3, 2019
@z-y-huang
Copy link
Contributor Author

z-y-huang commented May 3, 2019

Hi @gentzkow:

I pushed a branch addressing the items here.

Four comments:

  1. My original reasoning behind imposing the rule that /code/ only ever reference /input/ (and not /raw/) was that doing so makes it more transparent what source files are used for analysis (since often times, /raw/ contains files that aren't used).

  2. The current procedure for saving external outputs requires manually uploading the output externally and saving the appropriate documentation. This seems like it could lead to room for error. One thought was to write an optional utility function that automatically uploads the output and saves the appropriate documentation.

  3. I ended up using the Python API interface for Jupyter notebook since it's more robust when it comes to Python version compatibility. The API lets you run notebooks created with Python 2 using Python 3 and vice-versa.

  4. From the testing I did, gslab_make should play nicely with Anaconda and run on different IDEs. One quibble is that Anaconda creates environments where the PATH variable may be different from your system environment. For example, Matlab is added to my PATH for my system environment but not my Anaconda environment, which means run_matlab works when running it via shell but not through Spyder. This is an idiosyncratic issue though, but something I can note in the RA manual.

@gentzkow
Copy link
Owner

gentzkow commented May 5, 2019

Great.

On (1), I understand that reasoning. I think the rule we want to stick to is just that code can never reference a directory outside its own module.

I continue to go back and forth between these two structures we've discussed before

/data/ with subdirectory /raw/
/analysis/
etc.

or

/raw/
/data/
/analysis/
etc.

The former would correspond to a case where /raw/ is consumed only by code within /data/ and then (most) everything downstream consumes /data/output/ but not /data/raw/. The latter would correspond to a case where many different modules read from /raw/.

An annoying detail is what to do about things like image files we want to use only for slides. Current plan would be those go in /paper_slides/raw/. In the second scheme, are we OK having /raw/ both as a standalone module and as a sub within other modules? Or do we want to force those image files to all be in the top-level /raw/ directory.

Let me know your views.

On (2), I like the idea of a utility script but let's sequence that later after we finish tidying up and do the next round of review. For now let's just make sure the manual instructions are clear.

Good on (3)-(4).

One note: on the hard-committed input files for paper_slides what we currently have doesn't work. I still can't compile paper.lyx without running make.py first. I think the input files were committed as symbolic links rather than actual files.

@gentzkow
Copy link
Owner

gentzkow commented May 5, 2019

Also, this looks like a typo in setup_repository.py...

$ python setup_repository.py
Traceback (most recent call last):
  File "setup_repository.py", line 83, in <module>
    configuration()
  File "setup_repository.py", line 78, in configuration
    check_software(config, config_user)
  File "setup_repository.py", line 59, in check_software
    check_executable(default_executabless[os.name]['git-lfs'])
NameError: global name 'default_executabless' is not defined

@gentzkow
Copy link
Owner

gentzkow commented May 5, 2019

Can we rename setup_repository.py -> check_setup.py?

That seems more accurate if the script is not doing any actual setup just checking if it's correct.

@gentzkow
Copy link
Owner

gentzkow commented May 5, 2019

I get the following error even when I run make.py in /analysis/, where there are no files >0.5mb.

WARNING! Certain files tracked by git exceed config limit (0.5 MB). See logs for more detail.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

My mistake, I didn't push a bugfix. I fixed the typo and committed the hard files in paper_slides.

The size check function is currently repository-wide, but I can make it only walk through the module if that's what we want.

@gentzkow
Copy link
Owner

gentzkow commented May 6, 2019

Ah, I see. Yes, I think it would make sense to have the size check function be module-specific. It would also be great if it could list the offending files.

Do you know why I would have been getting that warning on a clean checkout of the template? Presumably it should be set up so all large files are tracked w/ LFS.

@z-y-huang
Copy link
Contributor Author

z-y-huang commented May 6, 2019

@gentzkow:

The list of offending files are saved in make.log (I can reword my warning message to make that clearer). I didn't output the file list into the console because it seemed like it could get annoying if you intentionally wanted to keep those files.

The data-generating example script is saving TXTs instead of a CSVs, which we don't have added to LFS. It looks like gslab_make also has a tar file for testing, which I can start adding to LFS.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

I personally like the idea of a centralized /raw/ folder with a /general/ subdirectory for any files that are used across multiple repositories and /module/ subdirectories for files that are only used in a specific module. There's a small cost to having to create input links/copies in all of the modules, but I think a single place for all of the data is useful for cases where we may want to release the code for a repository but not any of the data.

I think either way, we do want a clear distinction between files that are used only for a specific module vs. files that are used across multiple modules. That way, the dependencies are clearly outlined and it's much less likely that someone forgets to run all of the modules that rely on a file.

@gentzkow
Copy link
Owner

gentzkow commented May 6, 2019

Great. Thanks.

On the large files, saving in make.log is fine. Let's update the warning message to point the user to make.log for the list. Let's switch the data-generating example so it saves .csv rather than .txt and add .tar to LFS. (We might want to make a note to clean up the previously committed versions of these at some point so the repository isn't heavier-weight than necessary; will matter since this repo will be a base for all other repos.)

On raw, let's go to a centralized /raw/ directory then. I don't think we need separate /module/ and /general/ subs -- if all other modules call from /raw/ using inputs.txt the record of which files are used where will be clear.

There is a general question of how we look at dependencies downstream. inputs.txt makes it easy to ask which other files F module M depends on. But it is hard to ask which modules M depend on some arbitrary file F. In the past we'd written a repo mapping tool that would crawl the inputs.txt and externals.txt files and make a nice network map of the relationships. Once we finish the first round of review we can think about creating an updated version of that.

z-y-huang pushed a commit that referenced this issue May 6, 2019
z-y-huang pushed a commit that referenced this issue May 7, 2019
…s to CSV, rewrite warning/error messages, and add tar files to LFS
@z-y-huang
Copy link
Contributor Author

@gentzkow:

I have addressed the comments here and here and have updated all of the documentation accordingly.

I can put (i) cleaning the commit history (the .git folder for a clean checkout is currently about ~4 MB, so not egregious yet), (ii) adding a utility function for uploading external files, and (iii) adding a utility function for mapping dependencies on the long-term todo list.

gentzkow added a commit that referenced this issue May 8, 2019
@gentzkow
Copy link
Owner

gentzkow commented May 8, 2019

Great! I think we're basically done and ready to go to review.

The only small thing I'd suggest is that we store some actual raw data in /raw/ that is then consumed by /data/ (rather than having the code in /data/ generate the data itself).

We might as well make the example code a little more live by having a raw data set in /raw/ then doing some transformations of it in /data/.

When we send for review we should make clear to everyone that it's next on the to do list to replace all of this with a more interesting / rich set of example scripts.

Oh, and also: I think I'm going to backtrack on my suggestion to have the .ipynb code as part of the example code in the main repo. It's great that you wrote the run_jupyter function and we definitely will want to put a jupyter notebook directory as one of the add-on examples we include but I think it will be simpler if all the examples are plain vanilla python.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

I added a dataset in /raw/ and removed the notebook in /analysis/. Closing this issue and will open a new one for review.

z-y-huang pushed a commit that referenced this issue May 8, 2019
z-y-huang added a commit that referenced this issue May 9, 2019
…ation (#22)

* Add Jupyter notebook support for #20

* Update run_jupyter to use Python API for #20

* Test out run_jupyter

* Make run_jupyter Python version agnostic for #20

* Update README and setup_repository for #20

* Make edits to README for #20

* Simplify setup_repository

* Track /input/ and /external/ for /paper_slides/, generalize executable for git-lfs for #20

* Bugfix and commit paper_slides input for #20

* Typechange for symlink conversion for #20

* Change check_repo_size to check_module size for #20

* Commit everything from 882ef90

* #20: Update documentation, standardize language, change TXT data files to CSV, rewrite warning/error messages, and add tar files to LFS

* #20: Reword README to be repository agnostic and update setup scripts

* Fix typo

* #20 correct typo in readme.md

* Add context to administrator mode in README

* #20: Restructure repository to use already generated data file in /raw/

* #20: Delete notebook in analysis

* #20: Set merge strategy to ours for log files

* Fix .gitattributes reference

* #20: Add merge strategy ours to config file

* #20: Set merge strategy to binary for log files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants