Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

shntnu · 2019-08-14T23:37:00Z

Here, we define the elements of an effective lab notebook.

gwaybio · 2020-03-20T13:04:04Z

in broadinstitute/hepatocyte_cellpainting_sigma#25 @shntnu writes:

issues to discuss the experiment, and PRs to describe the implementation

Based on this approach, you would create an issue related to this experiment and describe your results in the issue and seek feedback via the issue, not the PR

shntnu · 2020-04-13T13:21:53Z

I have copied below a set of our (past) ELN requirements. These have evolved ever since, but still are more or less representative.

Priority	Feature	Minimal requirements
High	Tagging	Anyone can create tags, no constraint on no. of tags
High	WYSIWYG	Drag drop images
High	Searching	Searchable using tags, save searches, boolean operators
High	Backup	Should be able to backup using Crashplan or Broad servers
High	Export	Should be able to export notes in a common format
High	Linking	Should be able to link to other notes
High	Inline comments	Comments should not be editable by others, but others should be able to "resolve" or reply to it. Comment should automatically be associated with author.
High	Large amounts of content	Each note needs to be able to handle an appropriate amount of images/data
Medium	Collaborative editing	No conflicts if multiple people edit
Medium	Modularity	Each note should be a separate entity
Medium	Notification	Be notified if you are tagged, or a page is updated
Medium	Task management	Assign tasks to yourself or others
Medium	Access to outsiders	When we occasionally want to share a result with someone not regularly using the system, we want to send them a link they can access via web browser
Low	Offline access/editing	This would be convenient
Low	Revision history (esp in case of accidents)	Be able to view changes to notes
Low	Smooth transition to writing papers	Be able to construct a paper outline based on notes
Really low	Permissions/security	We probably don't care about this feature; nearly all content would be available to all

shntnu · 2021-05-13T12:35:35Z

When creating a note (=issue), give the title some thought. Here's what we've followed:

If you've already completed the analysis when creating the note, the title is a sentence outlining the conclusion of the analysis.
If not, the title is the hypothesis or the question being addressed (and then later update the title with your conclusion).

shntnu · 2023-03-03T19:40:40Z

Here's a great template for lab notebooks on GitHub https://github.com/uwescience/shablona

shntnu · 2024-03-06T21:40:24Z

Guidelines for using a GitHub repo as a lab notebook

Repository Structure
- Readme: Landing page for the project, detailing the project and relevant links
- Multiple folders, each corresponding to a module of analysis
- Folder naming convention: 00.<name>, 01.<name>, ...
- Each folder contains:
  - Multiple notebooks or scripts named in the same manner, i.e., 00.<name>, 01.<name>, ...
  - Three subfolders: input, output, and figures
Notebooks
- Most of the analysis is done in notebooks
- We may also use scripts that are convertible to notebooks using jupytext instead of scripts: This makes developing and editing pipelines much more friendly with the git+plain text toolchains.
- Black is used for formatting. TODO: Decide if we should use ruff instead.
Scripts
- Typically, computationally intensive code is written in scripts instead of notebooks TODO: explain rationale
Environments
- Conda is used for managing environments
- Each folder has its own Conda environment
Figures
- Figures are saved and committed
- Final version of figures are saved as SVG
- Figures are generated using x.generate-figures.ipynb in each folder
  - Reads data frame from the output folder
  - Produces figures
  - Allows reproducing figures without redoing all the analysis
Issues
- Used for making individual notes and discussions
- Figures and snippets of analysis are pasted into issues for discussion
- Most discussions happen in issues, not pull requests
- Be deliberate with the issue title
  - the title is the hypothesis or the question being addressed
  - once completed, the title is the conclusion of the analysis.
Pull Requests
- Used for discussions that require implementation-specific details
- Avoid having too many discussions in pull requests; use issues instead
- Pull requests should be small
- Use fork-and-branch approach
- TODO: Adopt a naming convention for pull requests
Data Versioning
- Preferably use DVC (Data Version Control) for versioning data
- If DVC is not possible, use GitLFS (Git Large File Storage)
- Only use one of the two: either DVC or GitLFS, not both
No junk
- Use precommit rules to enforce rules, like no large files
.gitignore
- Have a standard for this
Style guide
- TODO Adopt a style guide, e.g., https://google.github.io/styleguide/pyguide.html
- Use precommit hooks to enforce

Pitfalls

@afermg said: The folder structure doesn't cover multiple iterations of a given analysis. I will exemplify this with a use-case: Let us say I perform an analysis and, after discussion, we conclude that we should try changing some parameters and see if it improves. We then need to compare these different results. My proposed solution to this is to have (optional) datetimes for every time an analysis was run.
@afermg said: - Using Github Issues for project management is far from ideal, but I understand it is the best we have. My biggest gripe is that they are not searchable from external search engines. I think automating conversion of github issues to sqlite files, coupled with the good old dattasette would provide more accessibility to what is going on in all projects. As a long-term solution, I think writing project management in markdown documents (I've been pleased with hedegdoc, a FOSS alternative to Docs for markdown collaborative editing) will be the way once RAG (retrieval-augmented generation) tools -- such as greptile -- become more mature.

afermg · 2024-03-08T15:43:43Z

I just have a few comments on this, though I am aware that it is a complex challenge:

I'd like to have the option of using scripts that are conversible to notebooks using jupytext instead of scripts: This makes developing and editing pipelines much more friendly with the git+plain text toolchains.
Using Github Issues for project management is far from ideal, but I understand it is the best we have. My biggest gripe is that they are not searchable from external search engines. I think automating conversion of github issues to sqlite files, coupled with the good old dattasette would provide more accessibility to what is going on in all projects. As a long-term solution, I think writing project management in markdown documents (I've been pleased with hedegdoc, a FOSS alternative to Docs for markdown collaborative editing) will be the way once RAG (retrieval-augmented generation) tools -- such as greptile -- become more mature.
Just as an add-on, issues are flat, while project analyses are trees. Any solution that actually models the problem correctly requires the usage of Directed Acyclic Graphs, there is a discussion of that here, for anyone interested. It may be alleviated by using the "track dependency" system for github.
Regarding python style, Black used to be the big shark until LSP came about, I'd suggest also considering ruff (or ruff-lsp) nowadays. It offers IME the best speed-to-quality ratio of all the LSPs that exist, and does the same and more than black. This covers both formatting and linting.
The folder structure doesn't cover multiple iterations of a given analysis. I will exemplify this with a use-case: Let us say I perform an analysis and, after discussion, we conclude that we should try changing some parameters and see if it improves. We then need to compare these different results. My proposed solution to this is to have (optional) datetimes for every time an analysis was run.

I still have to digest some of the ideas and challenges, so I may come back to this issue down the line.

shntnu · 2024-03-08T21:24:44Z

I just have a few comments on this, though I am aware that it is a complex challenge:

Thanks for sharing your thoughts @afermg! I've updated my comment

shntnu · 2024-06-24T14:03:42Z

@afermg I've now created https://github.com/broadinstitute/carpenter-singh-lab-standards/blob/main/01-lab-notebook.md so let's continue the discussion in that repo

shntnu added the Rule Discussing possible rule label Aug 14, 2019

gwaybio mentioned this issue Aug 21, 2019

Make lab notebook available #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

shntnu commented Aug 14, 2019

gwaybio commented Mar 20, 2020

shntnu commented Apr 13, 2020

shntnu commented May 13, 2021

shntnu commented Mar 3, 2023

shntnu commented Mar 6, 2024 •

edited

Loading

afermg commented Mar 8, 2024 •

edited

Loading

shntnu commented Mar 8, 2024

shntnu commented Jun 24, 2024 •

edited

Loading

Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

Make lab notebooks collaborative, reproducible, maintainable, and navigable #18

Comments

shntnu commented Aug 14, 2019

gwaybio commented Mar 20, 2020

shntnu commented Apr 13, 2020

shntnu commented May 13, 2021

shntnu commented Mar 3, 2023

shntnu commented Mar 6, 2024 • edited Loading

Guidelines for using a GitHub repo as a lab notebook

Pitfalls

afermg commented Mar 8, 2024 • edited Loading

shntnu commented Mar 8, 2024

shntnu commented Jun 24, 2024 • edited Loading

shntnu commented Mar 6, 2024 •

edited

Loading

afermg commented Mar 8, 2024 •

edited

Loading

shntnu commented Jun 24, 2024 •

edited

Loading