Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

Open
shntnu opened this issue Aug 14, 2019 · 8 comments
Labels
Rule Discussing possible rule

Comments

@shntnu
Copy link
Member

shntnu commented Aug 14, 2019

Here, we define the elements of an effective lab notebook.

@shntnu shntnu added the Rule Discussing possible rule label Aug 14, 2019
@gwaybio
Copy link
Member

gwaybio commented Mar 20, 2020

in broadinstitute/hepatocyte_cellpainting_sigma#25 @shntnu writes:

issues to discuss the experiment, and PRs to describe the implementation

Based on this approach, you would create an issue related to this experiment and describe your results in the issue and seek feedback via the issue, not the PR

@shntnu
Copy link
Member Author

shntnu commented Apr 13, 2020

I have copied below a set of our (past) ELN requirements. These have evolved ever since, but still are more or less representative.

Priority Feature Minimal requirements
High Tagging Anyone can create tags, no constraint on no. of tags
High WYSIWYG Drag drop images
High Searching Searchable using tags, save searches, boolean operators
High Backup Should be able to backup using Crashplan or Broad servers
High Export Should be able to export notes in a common format
High Linking Should be able to link to other notes
High Inline comments Comments should not be editable by others, but others should be able to "resolve" or reply to it. Comment should automatically be associated with author.
High Large amounts of content Each note needs to be able to handle an appropriate amount of images/data
Medium Collaborative editing No conflicts if multiple people edit
Medium Modularity Each note should be a separate entity
Medium Notification Be notified if you are tagged, or a page is updated
Medium Task management Assign tasks to yourself or others
Medium Access to outsiders When we occasionally want to share a result with someone not regularly using the system, we want to send them a link they can access via web browser
Low Offline access/editing This would be convenient
Low Revision history (esp in case of accidents) Be able to view changes to notes
Low Smooth transition to writing papers Be able to construct a paper outline based on notes
Really low Permissions/security We probably don't care about this feature; nearly all content would be available to all

@shntnu
Copy link
Member Author

shntnu commented May 13, 2021

When creating a note (=issue), give the title some thought. Here's what we've followed:

  1. If you've already completed the analysis when creating the note, the title is a sentence outlining the conclusion of the analysis.
  2. If not, the title is the hypothesis or the question being addressed (and then later update the title with your conclusion).

@shntnu
Copy link
Member Author

shntnu commented Mar 3, 2023

Here's a great template for lab notebooks on GitHub https://github.com/uwescience/shablona

@shntnu
Copy link
Member Author

shntnu commented Mar 6, 2024

Guidelines for using a GitHub repo as a lab notebook

  • Repository Structure
    • Readme: Landing page for the project, detailing the project and relevant links
    • Multiple folders, each corresponding to a module of analysis
    • Folder naming convention: 00.<name>, 01.<name>, ...
    • Each folder contains:
      • Multiple notebooks or scripts named in the same manner, i.e., 00.<name>, 01.<name>, ...
      • Three subfolders: input, output, and figures
  • Notebooks
    • Most of the analysis is done in notebooks
    • We may also use scripts that are convertible to notebooks using jupytext instead of scripts: This makes developing and editing pipelines much more friendly with the git+plain text toolchains.
    • Black is used for formatting. TODO: Decide if we should use ruff instead.
  • Scripts
    • Typically, computationally intensive code is written in scripts instead of notebooks TODO: explain rationale
  • Environments
    • Conda is used for managing environments
    • Each folder has its own Conda environment
  • Figures
    • Figures are saved and committed
    • Final version of figures are saved as SVG
    • Figures are generated using x.generate-figures.ipynb in each folder
      • Reads data frame from the output folder
      • Produces figures
      • Allows reproducing figures without redoing all the analysis
  • Issues
    • Used for making individual notes and discussions
    • Figures and snippets of analysis are pasted into issues for discussion
    • Most discussions happen in issues, not pull requests
    • Be deliberate with the issue title
      • the title is the hypothesis or the question being addressed
      • once completed, the title is the conclusion of the analysis.
  • Pull Requests
    • Used for discussions that require implementation-specific details
    • Avoid having too many discussions in pull requests; use issues instead
    • Pull requests should be small
    • Use fork-and-branch approach
    • TODO: Adopt a naming convention for pull requests
  • Data Versioning
    • Preferably use DVC (Data Version Control) for versioning data
    • If DVC is not possible, use GitLFS (Git Large File Storage)
    • Only use one of the two: either DVC or GitLFS, not both
  • No junk
    • Use precommit rules to enforce rules, like no large files
  • .gitignore
    • Have a standard for this
  • Style guide

Pitfalls

  • @afermg said: The folder structure doesn't cover multiple iterations of a given analysis. I will exemplify this with a use-case: Let us say I perform an analysis and, after discussion, we conclude that we should try changing some parameters and see if it improves. We then need to compare these different results. My proposed solution to this is to have (optional) datetimes for every time an analysis was run.
  • @afermg said: - Using Github Issues for project management is far from ideal, but I understand it is the best we have. My biggest gripe is that they are not searchable from external search engines. I think automating conversion of github issues to sqlite files, coupled with the good old dattasette would provide more accessibility to what is going on in all projects. As a long-term solution, I think writing project management in markdown documents (I've been pleased with hedegdoc, a FOSS alternative to Docs for markdown collaborative editing) will be the way once RAG (retrieval-augmented generation) tools -- such as greptile -- become more mature.

@afermg
Copy link

afermg commented Mar 8, 2024

I just have a few comments on this, though I am aware that it is a complex challenge:

  • I'd like to have the option of using scripts that are conversible to notebooks using jupytext instead of scripts: This makes developing and editing pipelines much more friendly with the git+plain text toolchains.
  • Using Github Issues for project management is far from ideal, but I understand it is the best we have. My biggest gripe is that they are not searchable from external search engines. I think automating conversion of github issues to sqlite files, coupled with the good old dattasette would provide more accessibility to what is going on in all projects. As a long-term solution, I think writing project management in markdown documents (I've been pleased with hedegdoc, a FOSS alternative to Docs for markdown collaborative editing) will be the way once RAG (retrieval-augmented generation) tools -- such as greptile -- become more mature.
  • Just as an add-on, issues are flat, while project analyses are trees. Any solution that actually models the problem correctly requires the usage of Directed Acyclic Graphs, there is a discussion of that here, for anyone interested. It may be alleviated by using the "track dependency" system for github.
  • Regarding python style, Black used to be the big shark until LSP came about, I'd suggest also considering ruff (or ruff-lsp) nowadays. It offers IME the best speed-to-quality ratio of all the LSPs that exist, and does the same and more than black. This covers both formatting and linting.
  • The folder structure doesn't cover multiple iterations of a given analysis. I will exemplify this with a use-case: Let us say I perform an analysis and, after discussion, we conclude that we should try changing some parameters and see if it improves. We then need to compare these different results. My proposed solution to this is to have (optional) datetimes for every time an analysis was run.

I still have to digest some of the ideas and challenges, so I may come back to this issue down the line.

@shntnu
Copy link
Member Author

shntnu commented Mar 8, 2024

I just have a few comments on this, though I am aware that it is a complex challenge:

Thanks for sharing your thoughts @afermg! I've updated my comment

@shntnu
Copy link
Member Author

shntnu commented Jun 24, 2024

@afermg I've now created https://github.com/broadinstitute/carpenter-singh-lab-standards/blob/main/01-lab-notebook.md so let's continue the discussion in that repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Rule Discussing possible rule
Projects
None yet
Development

No branches or pull requests

3 participants