# Organizing Compuational Science Projects


## Folder Structure


```
<project_name>
|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |
|   └── final/
|       └── <dataset_name>.parquet
|
├── reports/
|   └── <report-group>/
|       ├── <report>.png
|       └── <report>.pdf
|
├── logs/
|   └── <log-group>/
|       └── <log>.txt
|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
|
├── pyproject.toml
├── environment.yml
├── Dockerfile
├── compose.yml
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── CONTRIBUTORS.txt
├── CONTRIBUTORS.txt
├── CODE_OF_CONDUCT.txt
└── datacite.xml

  
```

### Data Files


##### Raw

Raw data is the original data, and it doesn't have to be pretty, just complete.
Experimental Raw data is organized by what data was collected and when.

```
|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx  
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   |
```


#### Preprocessed

Data is complex, and extracting variables out of raw data can be some work.  The **"preprocessed"** section of data pipelines is where intermediate files can go; they tend to be focused on individual variables of each session and byproducts of third-party tools, stored in a way that makes the data easy to read in for later processing steps.  Don't worry if the folder organization here is fairly messy--data extraction is a messy business!

```
|   |
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |
```



#### Processed

How do all these different variables relate to each other?  *"processed"* data includes the data's schema, and is meant to be complete; as much of the data is accessed in the same way as possible.  Note that the data is still in a "records" format, organized by collection date--this makes it easy to add new processed data files without having to touch the old ones.

```
|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |
```



#### Final 

What data structure makes the data as easy to analyze as possible?  These files contain data grouped in ways that make them easy to analyze; multiple sessions are combined together, only specific variables are extracted, data and metadata may be duplicated in the files, and variables may appear in multiple files.  The goal here is to have files that someone can just read into R, Pandas, etc, and get started with statistics, data visualization, and machine learning!  

This folder can also get complex, and that's okay--data analysis is complex, and this folder is a representation of that data analysis.  These files tend to be much smaller in size than the previous steps.

```
|   |
|   └── final/
|       └── <analysis_type1>.parquet
|
```



### Code Files: Scripts

```
|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
```

  - **`scripts/`** and **`notebooks/`**: Scripts that are meant to be run directly as a protocol belong together; their steps tend to be referenced in the methods sections of a research paper.  Sometimes people will seperate them by programming language ('e.g. `scripts_python/`), but it's usually not necessary.  

  - **`scratch/`**: Just playing around, don't want to worry about code quality or maintenance?  Keep a `scratch` folder (alternatiely, sometimes called `sandbox` or `playground`) for that!  If working with multiple colleagues

  - **`dodo.py`**, **`Snakefile`**, **`Makefile`**: What order are these scripts supposed to be run in?   What inputs and outpus are needed from each file?  **Workflow management tools** like [DoIt](https://pydoit.org/), [Snakemake](https://snakemake.readthedocs.io/en/stable/), and **Make** are meant for directly describing these steps, and can be run in order to do the full processing and analysis pipeline.



### Code Files: Libraries (Functions, Classes, Constants, etc)


```
|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
|
├── pyproject.toml
|
```

This is where custom project code that scripts reference live.  They come complete with an intaller file ([**`pyproject.toml`**](https://packaging.python.org/en/latest/tutorials/packaging-projects/) shown here, for Python projects), which installs the packages into a location where your scripts can easily import them.

##### Pyproject.toml Minimal Example

```
[project]
name = "project-name"
version = "v0.0.1"
requires-python = ">=3.10"
dependencies = ["matplotlib", "numpy>=1.26"]
```

| **`Command`** | **`Description`** |
| :-- | :-- |
| `pip install -e .` | Install the packages and its dependencies into the current python environment, but keep it easy to modify the files. |
| `pip uninstall .` | Remove this package from the current python environment.  Note: won't uninstall the dependencies. |


###### Additional Fields

| in the **`[project]`** section |  |
| :-- | :-- |
| `description = "A short description of the project's purpose."` |  A short description, appears in `pip show`. |
| `authors = [{name="Nicholas DG", emails="dg@email.com"}]` |  The authors of the project |
| `maintainers = [{name="Nicholas DG", emails="dg@email.com"}]` |  The people responsible for keeping the project going. |
| `readme = "README.md"` | Where to find the readme file. |
| `licence = "MIT"` | What licence the project uses.  |
| `licence = {file = "LICENSE.txt"}` | What license the project uses, if it's found in a file. |

.

| **Build Systems** |   |
| :-- | :-- |
| <code>[build-system]<br>requires = ["setuptools >= 61.0"]<br>build-backend = "setuptools.build_meta"</code> | Use `setuptools`, (the default). |
| <code>[build-system]<br>requires = ["hatchling"] <br>build-backend = "hatchling.build"</code> | Use `hatch`, a great modern builder |

There is a lot more one can put into the file--more fields and explanations of the `pyproject.toml` format can be found at the official guide:  https://packaging.python.org/en/latest/guides/writing-pyproject-toml/


#### Aside: What if I don't want an installer file?

That's okay, but you'll need to tell your scripts how to find your library code somehow.  Most scripting languages offer a way to do this inside your scripts by modifying their import search path, so they know what folders to search in.  Here's the relevant code for Python:

| **Python Code** | **Description** |
| :-- | :-- |
| <code>import os<br>os.path</code> | Add the `src` folder to the python import command's search path |
| <code>import os<br>os.path.append('../src')</code> | Add the `src` folder to the python import command's search path |
| <code>import os<br>os.environ['PATH']</code> | View the operating system's search path |



## Computational Environment Setup Files

```
|
├── environment.yml
├── Dockerfile
├── compose.yml
├── project.sif
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|
```

These files are commonly placed in the root directory, because they are used by software that helps set up the computational environemnt (installing libraries, setting up the operating system, downloading data, configuring environment variables, etc) for the entire project.  

  - **environment.yml**: used by the **Conda, Mamba, and Micromamba** package managers:
     - Cheat sheet: https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf
     - Download sources"
      - **Miniforge**, where you can download conda without needing a paid license: https://conda-forge.org/miniforge/
      - **Mamba**: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html
      - **Micromamba**: https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html
  
  - **Dockerfile** and **compose.yml** are used by [Docker](https://docs.docker.com/get-started/), and **.sif** files are used by Singularity and [Apptainer](https://apptainer.org/docs/user/latest/introduction.html), which can additionally set up your project into its own sandboxed operating system (called a "container")
      
  - **Makefile** is used by CMake, and can do any kind of recipe you give it; sometimes just installing things, but sometimes also running a script pipeline.  It's very generic and flexible that way.


#### Environment.yml Reference

##### Minimal Example:

```yaml
# environment.yaml
dependencies:  [python=3.11]
```

##### Useful `conda` terminal commands:

| **Command** | **Description** |
| :-- | :-- |
| **`conda env create -f environment.yml`** |   Create an environment from a file. |
| **`conda create -n <name>`** | Crete an environment without a file. |
| **`conda env remove --name <name>`** | Delete an environment. |
| **`conda env export > environment-lock.yml`** | Have conda tell you what it installed into the environment. |


##### Optional Fields :

| **Field** | **Example Values** | **Description** |
| :-- | :-- | :-- |
| `channels: ` | `[defaults, conda-forge]` | Where `conda` should look to download dependencies |
| `name: ` | `my-env` | A name to use to activate the environment, without knowing the path:  `conda activate my-env`  |
| `prefix: ` | `C:\Users\nickdg\miniconda3` | An absolute path, where on the computer to install the environment. Note: not great for cross-computer usage.  It's beter to specify the path when building the env with `conda env -f env.yml -p ./env`, when the computer can find the path at runtime. |

#### Operating System-Level Package Managers

| **Operating System** | **Package Manager** | **Search Command** | **Install Command** |
| :-- | :-- | :-- | :-- |
| Windows | **WinGet** | `winget search <name>` | `winget install --id=<Id>` |
| Windows | **Chocolatey** | `choco search <name>` | `choco install <name>` |
| Mac | **Homebrew** | `brew search <name>` | `brew install <name>` |
| Linux | **Aptitude** | `apt-get search <name>` | `apt-get install <name>` |
| Linux | **Yum** | `yum search <name>` | `yum install <name>` |

#### Virtual Machines: Vagrant

| **Command** | **Description** |
| :-- | :-- |
| `vagrant init generic/ubuntu2204` | Make a Vagrantfile that will specify Ubuntu 22.04 as the virtual machine. |
| `vagrant up` | create the virtual machine |
| `vagrant ssh` | log in to your virtual machine on the terminal. |

### Documentation

```
|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── datacite.xml
|
```

These files are there to help others understand better how to use your project.  Written explanations, interactive examples, references to licenses, etc, all contribute to help tell people about your project and how it is meant to relate to them.

##### Readme File: Essential Parts

A useful reference: https://www.makeareadme.com/

| **Section** | What goes here |
| :-- | :-- |
| `# <project name>` | The title.  Put the name of the project there. |
| `## Installation` | How to install the project.  Best to include copy-pastable code in [code blocks](https://www.markdownguide.org/basic-syntax/#code-blocks-1) |
| `## Usage` | The main ways the project is run, and what to expect when it works properly.  Include [code blocks](https://www.markdownguide.org/basic-syntax/#code-blocks-1) here, too. |

### Collaboration

```
| 
├── CONTRIBUTORS.txt
└── CODE_OF_CONDUCT.txt
```

These files explain to other collaborators how to work on the project; it's meant for your internal team.  