# Session 4: Organising Bioinformatics Projects

## Learning Objectives

By the end of this session, you will be able to:

1. Design a logical directory structure for bioinformatics projects
2. Apply consistent naming conventions for files and samples
3. Document your analysis workflow effectively
4. Implement version control basics with Git
5. Create reproducible analysis environments

## 1. Why Organisation Matters

Poor project organisation leads to:

- Lost or overwritten data
- Inability to reproduce results
- Wasted time searching for files
- Confusion when collaborating
- Difficulty publishing or sharing work

### The Reproducibility Crisis

Studies show that a significant portion of published bioinformatics analyses cannot be reproduced, even by the original authors. Good organisation is the foundation of reproducible research.

### Principles of Good Organisation

1. **Separation**: Keep raw data, code, and results separate
2. **Documentation**: Record what you did and why
3. **Consistency**: Use the same structure across projects
4. **Automation**: Reduce manual steps that introduce errors
5. **Version control**: Track changes to code and documents

## 2. Standard Directory Structure

### Recommended Project Layout

```
project_name/
├── README.md                 # Project overview and instructions
├── LICENSE                   # Licence for sharing
├── config/                   # Configuration files
│   ├── samples.tsv          # Sample metadata
│   └── config.yaml          # Pipeline parameters
├── data/
│   ├── raw/                  # Original, immutable data
│   │   ├── fastq/           # Raw sequencing reads
│   │   └── checksums.md5    # Data integrity verification
│   ├── reference/            # Reference genomes, annotations
│   │   ├── genome.fasta
│   │   ├── genome.fasta.fai
│   │   └── annotation.gtf
│   └── processed/            # Cleaned/filtered data
│       ├── trimmed/
│       └── aligned/
├── scripts/                  # Analysis scripts
│   ├── 01_quality_control.sh
│   ├── 02_alignment.sh
│   ├── 03_variant_calling.sh
│   └── utils/               # Helper functions
├── envs/                     # Environment specifications
│   ├── environment.yaml     # Conda environment
│   └── requirements.txt     # Python packages
├── results/                  # Analysis outputs
│   ├── figures/
│   ├── tables/
│   └── reports/
├── docs/                     # Documentation
│   ├── methods.md
│   └── analysis_log.md
└── notebooks/                # Jupyter notebooks for exploration
    ├── 01_eda.ipynb
    └── 02_visualisation.ipynb
```

### Setting Up a New Project

```bash
# Create project structure in one command
mkdir -p my_project/{config,data/{raw/fastq,reference,processed/{trimmed,aligned}},scripts/utils,envs,results/{figures,tables,reports},docs,notebooks}

# Create essential files
touch my_project/README.md
touch my_project/config/samples.tsv
touch my_project/docs/analysis_log.md

# View the structure
tree my_project/
```

## 3. Protecting Raw Data

Raw data is sacred. Once modified, you may never be able to get it back.

### Golden Rules for Raw Data

1. **Never modify raw files** - always work on copies
2. **Store checksums** - verify data integrity
3. **Make read-only** - prevent accidental changes
4. **Back up** - store copies in multiple locations

### Implementing Data Protection

```bash
# Generate checksums for all raw FASTQ files
cd data/raw/fastq/
md5sum *.fastq.gz > checksums.md5

# Verify checksums later
md5sum -c checksums.md5

# Make raw data read-only
chmod -R a-w data/raw/

# If you need to add more raw data later
chmod u+w data/raw/
# ... add files ...
chmod -R a-w data/raw/
```

### Sample Checksum File

```
# checksums.md5
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6  sample_001_R1.fastq.gz
b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7  sample_001_R2.fastq.gz
c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8  sample_002_R1.fastq.gz
d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9  sample_002_R2.fastq.gz
```

## 4. File Naming Conventions

Good file names are descriptive, consistent, and machine-readable.

### Naming Guidelines

| Rule | Bad Example | Good Example |
|------|-------------|-------------|
| No spaces | `sample 1.fastq` | `sample_001.fastq` |
| No special characters | `sample#1.fastq` | `sample_001.fastq` |
| Use underscores or hyphens | `sampleone.fastq` | `sample_001.fastq` |
| Include leading zeros | `sample_1.fastq` | `sample_001.fastq` |
| Use ISO dates | `sample_15-1-24.fastq` | `sample_2024-01-15.fastq` |
| Be descriptive | `data.bam` | `sheep_liver_001_aligned.bam` |

### Recommended Naming Patterns

**For sequencing data:**
```
{sample_id}_{condition}_{replicate}_{read}.fastq.gz

Examples:
sheep_liver_rep1_R1.fastq.gz
sheep_liver_rep1_R2.fastq.gz
sheep_muscle_rep1_R1.fastq.gz
```

**For processed files:**
```
{sample_id}_{processing_step}_{date/version}.{extension}

Examples:
sheep_liver_aligned_oar4.bam
sheep_liver_variants_filtered.vcf
cohort_analysis_v2.csv
```

**For scripts:**
```
{step_number}_{action}_{description}.{extension}

Examples:
01_quality_control.sh
02_trim_adapters.sh
03_align_reads.sh
```

## 5. Sample Metadata Management

A well-organised sample sheet is essential for tracking samples through analysis.

### Sample Sheet Format (TSV/CSV)

```
sample_id	condition	replicate	read1	read2	batch	sequencing_date
sheep_L_1	liver	1	data/raw/sheep_L_1_R1.fastq.gz	data/raw/sheep_L_1_R2.fastq.gz	batch1	2024-01-15
sheep_L_2	liver	2	data/raw/sheep_L_2_R1.fastq.gz	data/raw/sheep_L_2_R2.fastq.gz	batch1	2024-01-15
sheep_M_1	muscle	1	data/raw/sheep_M_1_R1.fastq.gz	data/raw/sheep_M_1_R2.fastq.gz	batch1	2024-01-15
sheep_M_2	muscle	2	data/raw/sheep_M_2_R1.fastq.gz	data/raw/sheep_M_2_R2.fastq.gz	batch1	2024-01-15
```

### Essential Metadata Fields

| Field | Description | Example |
|-------|-------------|--------|
| sample_id | Unique identifier | sheep_L_1 |
| condition | Experimental group | liver, muscle, treated |
| replicate | Biological replicate number | 1, 2, 3 |
| batch | Processing batch | batch1, batch2 |
| sequencing_date | When sequenced | 2024-01-15 |
| platform | Sequencing platform | Illumina NovaSeq |
| library_type | Library preparation | RNA-Seq, WGS |

### Tips for Sample Management

1. Create the sample sheet before starting analysis
2. Store it in version control
3. Never manually rename files - use the sample sheet to link IDs
4. Include all relevant experimental factors for downstream analysis

## 6. Documentation Best Practices

### The README File

Every project should have a README.md explaining:

```markdown
# Project Title

Brief description of the project and its goals.

## Data

- Source of raw data
- Number of samples
- Sequencing platform and parameters

## Methods

Overview of the analysis pipeline.

## Directory Structure

Brief explanation of folder organisation.

## Usage

How to reproduce the analysis.

## Dependencies

Required software and versions.

## Authors

Who worked on this project.

## Licence

Terms for using this work.
```

### Analysis Log

Keep a running log of your analysis:

```markdown
# Analysis Log

## 2024-01-15

### Quality Control
- Ran FastQC on all samples
- Sample sheep_003 showed adapter contamination
- Decision: Include in trimming, monitor downstream

### Reference Download
- Downloaded Oar_v4.0 from NCBI
- Accession: GCF_000298735.2
- Command: wget [URL]

## 2024-01-16

### Trimming
- Used Trimmomatic v0.39
- Parameters: ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3
- Results: 95% reads retained on average
```

## 7. Managing Software Environments

Reproducibility requires recording exact software versions.

### Using Conda Environments

```yaml
# envs/environment.yaml
name: sheep_rnaseq
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.10
  - bwa=0.7.17
  - samtools=1.17
  - bcftools=1.17
  - fastqc=0.12.1
  - multiqc=1.14
  - hisat2=2.2.1
```

### Creating and Using Environments

```bash
# Create environment from file
conda env create -f envs/environment.yaml

# Activate environment
conda activate sheep_rnaseq

# Export current environment (for sharing)
conda env export > envs/environment_exported.yaml

# List installed packages with versions
conda list > envs/package_versions.txt
```

### Recording Software Versions in Scripts

```bash
#!/bin/bash
# 01_alignment.sh

# Log software versions
echo "=== Software Versions ===" > alignment.log
bwa 2>&1 | head -3 >> alignment.log
samtools --version | head -2 >> alignment.log
echo "========================" >> alignment.log

# Run alignment
bwa mem reference.fasta reads_R1.fastq reads_R2.fastq > aligned.sam 2>> alignment.log
```

## 8. Version Control with Git

Git tracks changes to your code and documents, allowing you to revert mistakes and collaborate.

### Essential Git Commands

```bash
# Initialise a new repository
cd my_project/
git init

# Create .gitignore to exclude large data files
echo "data/raw/" >> .gitignore
echo "data/processed/" >> .gitignore
echo "*.fastq*" >> .gitignore
echo "*.bam" >> .gitignore
echo "*.bam.bai" >> .gitignore

# Add files to staging
git add scripts/ config/ README.md .gitignore

# Commit changes
git commit -m "Initial project setup"

# View history
git log --oneline

# Check status
git status
```

### What to Track vs Ignore

| Track (git add) | Ignore (.gitignore) |
|----------------|--------------------|
| Scripts (.sh, .py, .R) | Raw data (.fastq, .bam) |
| Configuration files | Processed data |
| Sample sheets | Large reference files |
| Documentation | Log files |
| Environment specs | Temporary files |
| Small result tables | Figures (regenerate) |

### Sample .gitignore for Bioinformatics

```
# Data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.bam.bai
*.cram
*.vcf
*.vcf.gz
*.bcf

# Reference files
*.fasta
*.fa
*.fasta.fai
*.dict

# Indices
*.bt2
*.bwt
*.pac
*.ann
*.amb
*.sa

# Logs and temp
*.log
*.tmp
.snakemake/

# System files
.DS_Store
Thumbs.db
```

## 9. Hands-On Exercise

### Exercise: Organise the Training Data

Create a properly organised project structure for analysing our sheep training data:

```bash
# Step 1: Create project structure
PROJECT="sheep_training_analysis"
mkdir -p $PROJECT/{config,data/{raw,reference,processed},scripts,envs,results/{figures,tables},docs}

# Step 2: Create README
cat > $PROJECT/README.md << 'EOF'
# Sheep Training Data Analysis

Analysis of sheep genomic data for bioinformatics training.

## Data

- FASTQ: SRR10532784 (truncated)
- BAM: u9_liver_100.bam (sheep liver RNA-seq)
- VCF: sheep.snp.vcf.gz (~300 samples)

## Methods

Training exercises for file format exploration.
EOF

# Step 3: Create sample sheet for the VCF samples
cat > $PROJECT/config/samples.tsv << 'EOF'
file_type	file_name	description
FASTQ	SRR10532784_1.fastq.gz	RNA-seq reads (truncated)
BAM	u9_liver_100.bam	Sheep liver aligned reads
VCF	sheep.snp.vcf.gz	Sheep population SNPs
EOF

# Step 4: Link the training data (or copy)
ln -s $(pwd)/trining_datasets/* $PROJECT/data/raw/

# Step 5: Generate checksums
cd $PROJECT/data/raw
md5 *.gz *.bam > checksums.md5

# Step 6: Create .gitignore
cat > $PROJECT/.gitignore << 'EOF'
data/raw/
data/processed/
*.fastq*
*.bam*
*.vcf*
*.log
EOF

# Step 7: View final structure
tree $PROJECT/
```

## 10. Summary

In this session, we covered:

1. **Directory Structure**: Separating raw data, code, and results
2. **Raw Data Protection**: Checksums, read-only permissions, backups
3. **Naming Conventions**: Machine-readable, descriptive file names
4. **Sample Metadata**: Tracking samples with structured sheets
5. **Documentation**: README files and analysis logs
6. **Environments**: Conda for reproducible software stacks
7. **Version Control**: Git basics for tracking changes

### Checklist for Every New Project

- [ ] Create standard directory structure
- [ ] Write README.md
- [ ] Create sample sheet
- [ ] Set up conda environment
- [ ] Initialise git repository
- [ ] Create .gitignore
- [ ] Generate checksums for raw data
- [ ] Make raw data read-only

## Next Steps

With your project properly organised, you are ready to begin the actual bioinformatics analysis. The following sessions will cover quality control, alignment, and downstream analysis.

## Additional Resources

- [A Quick Guide to Organising Computational Biology Projects (PLoS Computational Biology)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)
- [Good Enough Practices in Scientific Computing](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510)
- [The Turing Way - Guide for Reproducible Research](https://the-turing-way.netlify.app/)
- [Git for Scientists (Software Carpentry)](https://swcarpentry.github.io/git-novice/)