# Practical Research Computing Workshop
Goal: Improve reproducibility, collaboration, and computational confidence


# 1.0 Command-Line Literacy (Why the Shell Matters)

- Most research computing workflows eventually leave the IDE. Comfort with the shell is a force multiplier.

## Core Concepts

- Filesystems are hierarchical, not project-aware

- The shell is a scripting language, not just a launcher

- Most research infrastructure assumes CLI fluency


## Essential Commands

### Prints the current working directory
`pwd`

### Opens the current directory

`explorer .`

### List all files in a directory 
`ls` 

### Lists files with details such as size and modification date.

`ls -l`

### Changes the current directory

`cd /path/to/directory`

### Goes back one directory level

`cd ..`

### Creates a new directory

`mkdir new_directory`

### Creates multiple directories at once

`mkdir data_raw data_processed outputs`

### Removes an empty directory

`rmdir old_folder`

### Create a new file 

`New-Item newfile.txt`

### Move files

`mv newfile.txt /new/destination/folder/destination.txt`

### Remove files

`rm newfile.txt`

---

# 2.0 Git fundamentals 

## Git, Source Control, and Project History

- Git is a distributed version control system that records the history of a project as a sequence of commits. Each commit is a snapshot of the project state at a point in time.

**its how we iteratively save our changes to our code**

- The **main branch** represents the authoritative history of the project. It should always reflect a coherent, working state and typically corresponds to results you trust, share, or publish.

- A **branch** is a separate line of development created from an existing commit. Branches allow you to experiment, refactor, or test ideas without modifying main. This enables safe, parallel development while preserving full history and provenance.

- When work on a branch is complete, it can be merged back into main, explicitly incorporating its changes into the official project record.

## In source control terms:

- Commits track change

- Branches isolate intent

- Merges formalize decisions


# Visual of a working repo
```text
(main)
  |
  A --- B --- C -------------------- D
              \                    /
               E --- F --- G ------
            (feature_branch)

A → B → C → D are commits on main

feature_branch is created at commit C

E → F → G are commits made during experimentation

The merge integrates feature_branch back into main at D
```

---

## Git Basic Controls

## Create your own GitHub Repo

1. To make a new repo go to www.github.com and create a new repo:  

![make repo](images\new_repo.png)

2. Set the permissions and settings 

![settings](images\setting_up_repo.png)

3. Clone the repo: `git clone https://github.com/cscarpon/Science_Workshop.git`  

![clone](images\git_clone.png)

4. Change to the newly created directory (dir)

   `cd Science_Workshop`

5. Make a branch 

   `git checkout -b 2026-02-11-InitialCommit`

5. make a change (add a folder)  

6. `git add -A`  

7. `git commit -m "We added the test.py file"`  

8. `git push`  

9. `git push upstream`  

10. On the Github repo, review the changes for the PR  

11. Merge.
---

## Other handy commands

### Initialize a new Git repository in the current directory
`git init`

or we can create a new one on the github website itself.

### Check the current status of the repository (Shows tracked, untracked, and modified files)
`git status`

### Add a file to the staging area (Staging marks files to be included in the next commit)
`git add analysis.R`

### Add all modified and new files to the staging area
`git add -A`

### Create a commit (A commit records a snapshot of the project state)
`git commit -m "Add initial analysis script"`

### Show the commit history
`git log`

### Create a new branch for experimental work without swapping to it.
`git branch feature_experiment`

### Switch to the new branch
`git checkout feature_experiment`

### Shortcut: create and switch to a branch in one command
`git checkout -b yyyy-mm-dd-feature_experiment`

`#git checkout -b 2026-01-16-yml`

### Switch back to the main branch
`git checkout main`

### Merge changes from the feature branch into main
`git merge feature_experiment`

### Push to Remote Repository 

`git remote add origin https://github.com/cscarpon/Science_Workshop.git`

### once you have set the remote, you can push changes
`git push`

### If you are checking for an updated version from the remote repository
`git checkout main`  

`git pull`

---


# 3.0 Reproducible Research Foundations
## Reproducibility as a Workflow Property

Reproducibility is not achieved at publication time; it is a consequence of how a project is structured and executed. A reproducible project allows someone else, or your future self, to regenerate results using the same code, data, and computational environment.

## This section focuses on:

- Structuring projects so intent is clear

- Separating raw data from derived outputs

- Capturing the computational context in which results were produced

- Reproducible workflows reduce ambiguity, prevent accidental data corruption, and make research easier to resume after interruptions.

---

# 3.1 Environment Management (Conda)
## Isolating Computational Context

An environment defines the exact versions of software and libraries used in a project. Without explicit environment management, analyses become fragile and difficult to reproduce across machines or over time.

## Conda environments:

- Isolate dependencies from the system and other projects

- Allow exact recreation of software stacks

- Capture part of the research method, not just implementation details

- The environment specification file should be treated as part of the project’s methodology and versioned alongside the code.

---

# Create a new Conda environment named "research_env"
## This environment is isolated from the base environment
`conda create -n research_env python=3.11`

### Activate the Conda environment
`conda activate research_env`

### Install common scientific Python packages
`conda install numpy pandas matplotlib`

### Export the full environment specification
### This records exact package versions and sources
`conda env export > environment.yml`

### Recreate the environment later or on another machine
`conda env create -f environment.yml`

### Remove an environment when it is no longer needed
`conda remove -n research_env --all`

---


# 3.2 R Projects and Research-Oriented Structure
## R Projects as Analytical Units

- An R project defines a self-contained analytical workspace. It establishes a fixed project root, making all file paths relative, portable, and reproducible. This avoids hard-coded absolute paths and reduces ambiguity when moving between machines or collaborators.

- An R project is not just an IDE convenience. It is a logical boundary for a research task, experiment, or manuscript. Treating each project as an independent unit improves clarity, reproducibility, and version control.

- R projects work naturally with Git, renv, and standardized directory layouts.

---

# Creating an R Project

## In RStudio:
`File → New Project → New Directory → New Project`

## Project Structure and source files

```r
project_name/  
  project_name.Rproj
```

## Recommend Project Structure
```text
project_name/
├── data/
│   ├── raw/
│   └── processed/
├── scripts/
├── outputs/
│   ├── figures/
│   └── tables/
├── tests/
│   └── firsttest.R/
├── renv/
├── renv.lock
├── README.md
└── project_name.Rproj
```

## Directory Roles

`/data/raw`

- Original input data

- Read-only

- Never overwritten or modified

`/data/processed`

- Cleaned or derived datasets

- Fully reproducible from scripts

- Can be regenerated at any time

`/scripts`

- All analysis and processing code

- No manual data manipulation

- Scripts should be executable end-to-end

`/outputs`

- Figures, tables, model results

- Generated programmatically

- Safe to delete and recreate

`renv/ and renv.lock`

- Captures the R package environment

- Part of the research method

`README.md`

- Explains purpose, structure, and workflow

- Entry point for collaborators and reviewers

---


# 3.3 Environment Management with renv in R

## Initialize a reproducible R environment
`renv::init()`

# installing a package 
`renv::install("dplyr")`

## Record package versions used by the project
`renv::snapshot()`

## Restore the exact package environment later
`renv::restore()`

---

# 4.0 Testing and Validation 
## Making Assumptions Explicit

Testing in research code is not about achieving full test coverage. It is about making assumptions explicit and checking them automatically.

## Tests help ensure that:

- Inputs conform to expected ranges or structures

- Outputs remain scientifically plausible

- Changes to code do not silently alter results

- Lightweight testing reduces the risk of unnoticed errors and increases confidence in analytical pipelines, especially as projects grow or are revisited.

- Usually you have a few different types of tests such as ensuring the conda environment is running, testings that simple python logic works, then testing smaller to larger functions. Often these larger tests would be a full run through of your sample code logic.

---

# 4.1 Testing with PyTest

```python
# Install pytest into the active Python environment
%pip install pytest

import pytest

# Create a test file (must start with test_)
# File: test_metrics.py

def test_positive_values():
    values = [1, 2, 3]
    assert all(v > 0 for v in values)
    
# Run all tests in the project
pytest

# Run tests with verbose output
pytest -v

# Example of validating scientific assumptions
def test_no_negative_heights():
    heights = [10.2, 5.4, 3.1]
    assert min(heights) >= 0
```

---

# 4.2 Testing and Validation in R (testthis)
```r
library(testthat)

test_that("addition works correctly", {
  result <- 2 + 3
  expect_equal(result, 5)
})
```

---

# 5.0 Loop-Based Scripts vs Structured Code
## Why Loops Limit Reproducibility

Many research scripts rely on hard-coded loops over files and folders. While this works initially, it tightly couples the analysis logic to a specific directory layout, file naming scheme, and execution context.

This creates several problems:

- The code only works for one exact folder structure

- Re-running the analysis requires manual edits

- Sharing the code requires others to mirror your filesystem

- Provenance is implicit rather than explicit

- In practice, the loop becomes the method, rather than the computation itself.

---

## Typical code presentation for a loop that runs a very specific set of commands, this is hard to apply to different use case scenarios

```python
import os

data_dir = "data/raw"
out_dir = "outputs"

for fname in os.listdir(data_dir):
    if fname.endswith(".csv"):
        in_file = os.path.join(data_dir, fname)
        out_file = os.path.join(out_dir, fname.replace(".csv", "_processed.csv"))

        with open(in_file) as f:
            data = f.read()

        processed = data.upper()

        with open(out_file, "w") as f:
            f.write(processed)
```

## Issues:

- Paths are embedded in the logic

- Cannot easily reuse on another dataset

- No record of configuration or intent

- Difficult to test or extend

---

## Creating an object to store all of your files and functions

- `init` is to ensure that the object is created when loaded
- `self.` is to signify what lives inside the `slots` of the object
- `def function()` are functions that can be called inside the file and work with the stored files in the slots

```python

from pathlib import Path

class DataProcessor:
    def __init__(self, input_dir, output_dir):
        self.input_dir = Path(input_dir)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

    def process_file(self, path):
        data = path.read_text()
        return data.upper()

    def run(self):
        for path in self.input_dir.glob("*.csv"):
            result = self.process_file(path)
            out_path = self.output_dir / f"{path.stem}_processed.csv"
            out_path.write_text(result)
```

# How to execute this class object called ```DataProcessor```

```python
# Execution is now explicit and configurable
processor = DataProcessor(
    input_dir="data/raw",
    output_dir="outputs"
)

processor.run()
```

## Advantages:

- Directory structure is configurable, not hard-coded

- Logic is isolated and reusable

- Inputs and outputs are explicit

- Easier to test, document, and version

---

# 6.0 Documentation and Knowledge Transfer
## Writing for Your Future Self

- Documentation is a form of communication between collaborators separated by time. In most projects, your future self is the primary audience.

## Clear documentation:

- Explains what the project does and why

- Describes how to run analyses end-to-end

- Records assumptions, decisions, and limitations

- A well-maintained README and clean project structure dramatically reduce the cost of onboarding new collaborators or returning to old work.

---

# Create a README file for the project
`New-Item README.md`

## Project Title

### Description
Brief explanation of the research question and approach.

### Requirements
How to install dependencies.

### How to Run
Exact commands needed to reproduce results.

### Outputs
Description of generated files and figures.

### Create a scripts directory for executable analysis code
`mkdir scripts`

`scripts/`
  `run_analysis.py    # Executable pipeline`
  `helpers.py         # Supporting functions`