# Practical Research Computing Workshop
Goal: Improve reproducibility, collaboration, and computational confidence


# 1. Command-Line Literacy (Why the Shell Matters)

Most research computing workflows eventually leave the IDE. Comfort with the shell is a force multiplier.

## Core Concepts

Filesystems are hierarchical, not project-aware

The shell is a scripting language, not just a launcher

Most research infrastructure assumes CLI fluency


## Essential Commands

In [None]:
# prints the current working directory
pwd

# list all files in a directory 
ls 

# Lists files with details such as size and modification date.

ls -l

# changes the current directory

cd /path/to/directory

# goes back one directory level

cd ..

# creates a new directory

mkdir new_directory

# creates multiple directories at once

mkdir data_raw data_processed outputs

# removes an empty directory

rmdir old_folder

# create a new file 

New-Item newfile.txt

# move files

mv newfile.txt /new/destination/folder/destination.txt

# remove files

rm newfile.txt


# 2. Git fundamentals 

## Git, Source Control, and Project History

Git is a distributed version control system that records the history of a project as a sequence of commits. Each commit is a snapshot of the project state at a point in time.

The main branch represents the authoritative history of the project. It should always reflect a coherent, working state and typically corresponds to results you trust, share, or publish.

A branch is a separate line of development created from an existing commit. Branches allow you to experiment, refactor, or test ideas without modifying main. This enables safe, parallel development while preserving full history and provenance.

When work on a branch is complete, it can be merged back into main, explicitly incorporating its changes into the official project record.

## In source control terms:

Commits track change

Branches isolate intent

Merges formalize decisions


# Visual of a work repo

(main)
  |
  A --- B --- C -------------------- D
              \                    /
               E --- F --- G ------
            (feature_branch)

A → B → C → D are commits on main

feature_branch is created at commit C

E → F → G are commits made during experimentation

The merge integrates feature_branch back into main at D

In [None]:
## Git Basic Controls

## insert picture for the clone button ...

# Clone a remote repository to your local machine
git clone https://github.com/cscarpon/Science_Workshop.git

cd Science_Workshop

# Initialize a new Git repository in the current directory
git init

# Check the current status of the repository
# Shows tracked, untracked, and modified files
git status

# Add a file to the staging area
# Staging marks files to be included in the next commit
git add analysis.R

# Add all modified and new files to the staging area
git add .

# Create a commit
# A commit records a snapshot of the project state
git commit -m "Add initial analysis script"

# Show the commit history
git log

# Create a new branch for experimental work
git branch feature_experiment

# Switch to the new branch
git checkout feature_experiment

# Shortcut: create and switch to a branch in one command
git checkout -b feature_experiment

# Switch back to the main branch
git checkout main

# Merge changes from the feature branch into main
git merge feature_experiment

## Push to Remote Repository 

git remote add origin https://github.com/user/repo.git

# once you have set the remote, you can push changes
git push

# If you are checking for an updated version from the remote repository
git pull


# 3. Reproducible Research Foundations
## Reproducibility as a Workflow Property

Reproducibility is not achieved at publication time; it is a consequence of how a project is structured and executed. A reproducible project allows someone else, or your future self, to regenerate results using the same code, data, and computational environment.

This section focuses on:

Structuring projects so intent is clear

Separating raw data from derived outputs

Capturing the computational context in which results were produced

Reproducible workflows reduce ambiguity, prevent accidental data corruption, and make research easier to resume after interruptions.

# 4. Environment Management (Conda)
## Isolating Computational Context

An environment defines the exact versions of software and libraries used in a project. Without explicit environment management, analyses become fragile and difficult to reproduce across machines or over time.

Conda environments:

Isolate dependencies from the system and other projects

Allow exact recreation of software stacks

Capture part of the research method, not just implementation details

The environment specification file should be treated as part of the project’s methodology and versioned alongside the code.

In [None]:
# Create a new Conda environment named "research_env"
# This environment is isolated from the base environment
conda create -n research_env python=3.11

# Activate the Conda environment
conda activate research_env

# Install common scientific Python packages
conda install numpy pandas matplotlib

# Export the full environment specification
# This records exact package versions and sources
conda env export > environment.yml

# Recreate the environment later or on another machine
conda env create -f environment.yml

# Remove an environment when it is no longer needed
conda remove -n research_env --all




# R Projects and Research-Oriented Structure
## R Projects as Analytical Units

An R project defines a self-contained analytical workspace. It establishes a fixed project root, making all file paths relative, portable, and reproducible. This avoids hard-coded absolute paths and reduces ambiguity when moving between machines or collaborators.

An R project is not just an IDE convenience. It is a logical boundary for a research task, experiment, or manuscript. Treating each project as an independent unit improves clarity, reproducibility, and version control.

R projects work naturally with Git, renv, and standardized directory layouts.

In [None]:
# Creating an R Project

# In RStudio:
# File → New Project → New Directory → New Project

# Project Structure and source files
project_name/
  project_name.Rproj


## Recommend Project Structure

project_name/
├── data/
│   ├── raw/
│   └── processed/
├── scripts/
├── outputs/
│   ├── figures/
│   └── tables/
├── renv/
├── renv.lock
├── README.md
└── project_name.Rproj

## Directory Roles

### /data/raw

Original input data

Read-only

Never overwritten or modified

### /data/processed

Cleaned or derived datasets

Fully reproducible from scripts

Can be regenerated at any time

/scripts

All analysis and processing code

No manual data manipulation

Scripts should be executable end-to-end

### /outputs

Figures, tables, model results

Generated programmatically

Safe to delete and recreate

### renv/ and renv.lock

Captures the R package environment

Part of the research method

### README.md

Explains purpose, structure, and workflow

Entry point for collaborators and reviewers


In [None]:
# Environement Management with renv in R

# Initialize a reproducible R environment
renv::init()

# Record package versions used by the project
renv::snapshot()

# Restore the exact package environment later
renv::restore()

#installing a package 
renv::install("dplyr")


# 5. Testing and Validation (pytest)
## Making Assumptions Explicit

Testing in research code is not about achieving full test coverage. It is about making assumptions explicit and checking them automatically.

Tests help ensure that:

Inputs conform to expected ranges or structures

Outputs remain scientifically plausible

Changes to code do not silently alter results

Lightweight testing reduces the risk of unnoticed errors and increases confidence in analytical pipelines, especially as projects grow or are revisited.

In [None]:
# Install pytest into the active Python environment
pip install pytest

# Create a test file (must start with test_)
# File: test_metrics.py

def test_positive_values():
    values = [1, 2, 3]
    assert all(v > 0 for v in values)
    
# Run all tests in the project
pytest

# Run tests with verbose output
pytest -v

# Example of validating scientific assumptions
def test_no_negative_heights():
    heights = [10.2, 5.4, 3.1]
    assert min(heights) >= 0

# 6. Documentation and Knowledge Transfer
## Writing for Your Future Self

Documentation is a form of communication between collaborators separated by time. In most projects, your future self is the primary audience.

Clear documentation:

Explains what the project does and why

Describes how to run analyses end-to-end

Records assumptions, decisions, and limitations

A well-maintained README and clean project structure dramatically reduce the cost of onboarding new collaborators or returning to old work.

In [None]:
# Create a README file for the project
New-Item README.md

# Project Title

## Description
Brief explanation of the research question and approach.

## Requirements
How to install dependencies.

## How to Run
Exact commands needed to reproduce results.

## Outputs
Description of generated files and figures.

# Create a scripts directory for executable analysis code
mkdir scripts

scripts/
  run_analysis.py    # Executable pipeline
  helpers.py         # Supporting functions
