# Session 2: Linux Command Line Essentials for Bioinformatics

## Welcome to KCGEB Computing

At the **Khalifa Centre for Genetic Engineering & Biotechnology (KCGEB)**, we process terabytes of genomic data daily. Mastering the Linux command line is essential for working efficiently with sequencing data, running analysis pipelines, and managing files on our high-performance computing cluster.

## Learning Objectives

By the end of this session, you will be able to:

1. Navigate the Linux file system confidently
2. Create, move, copy, and delete files and directories
3. View and manipulate text files
4. Use wildcards and pipes for efficient workflows
5. Apply these skills to common bioinformatics tasks at KCGEB

## 1. Navigating the File System

When you log into a KCGEB workstation or the computing cluster, you start in your home directory. Think of the file system as a tree structure where you can move between branches.

### Essential Navigation Commands

| Command | Description | Example |
|---------|-------------|--------|
| `pwd` | Print working directory (where am I?) | `pwd` |
| `ls` | List directory contents | `ls -l` |
| `cd` | Change directory | `cd /data/projects` |

### Practical Examples at KCGEB

```bash
# Check your current location
pwd
# Output: /home/trainee

# List files in your home directory
ls
# Output: Documents  Downloads  projects  scripts

# List with details (permissions, size, date)
ls -lh
# Output:
# drwxr-xr-x 2 trainee kcgeb 4.0K Jan 15 10:30 Documents
# drwxr-xr-x 3 trainee kcgeb 4.0K Jan 15 11:45 projects

# Navigate to the KCGEB shared data directory
cd /data/kcgeb/shared

# Go back to your home directory
cd ~
# or simply
cd

# Go up one directory level
cd ..

# Go to the previous directory
cd -
```

## 2. File and Directory Management

At KCGEB, proper file organisation is critical. With multiple researchers working on different projects, keeping data organised prevents confusion and data loss.

### Creating Directories and Files

```bash
# Create a new project directory for a rice genome study
mkdir rice_genome_2024

# Create nested directories in one command
mkdir -p rice_genome_2024/data/raw rice_genome_2024/data/processed rice_genome_2024/results

# Create an empty file (useful for logs or placeholder files)
touch rice_genome_2024/README.txt
```

### Copying and Moving Files

```bash
# Copy a reference genome to your project
cp /data/kcgeb/references/rice_IR64.fasta rice_genome_2024/data/

# Copy an entire directory with all contents
cp -r /data/kcgeb/templates/pipeline_scripts rice_genome_2024/scripts/

# Move (rename) a file
mv sample1.fastq sample_001_R1.fastq

# Move files to a different directory
mv *.fastq.gz rice_genome_2024/data/raw/
```

### Removing Files and Directories

```bash
# Remove a file
rm temporary_file.txt

# Remove an empty directory
rmdir empty_folder

# Remove a directory and all its contents (use with caution!)
rm -r old_project

# Interactive removal (asks for confirmation)
rm -i important_file.txt
```

:::{warning}
The `rm` command permanently deletes files. There is no recycle bin in Linux! Always double-check before using `rm -r`.
:::

## 3. Viewing and Inspecting Files

Bioinformatics files can be massive. A single whole-genome sequencing FASTQ file can be 50 GB or more. These commands help you peek at files without loading them entirely into memory.

### Quick File Inspection

```bash
# View the first 10 lines of a FASTQ file
head sample_001_R1.fastq

# View the first 20 lines
head -n 20 sample_001_R1.fastq

# View the last 10 lines (useful for checking if a file completed)
tail sample_001_R1.fastq

# Watch a file in real-time (great for monitoring running jobs)
tail -f alignment.log

# View an entire small file
cat README.txt

# Page through a large file (use q to quit)
less reference_genome.fasta
```

### Counting and Summarising

```bash
# Count lines, words, and characters
wc sample_001_R1.fastq
# Output: 4000000  4500000 180000000 sample_001_R1.fastq

# Count only lines (divide by 4 to get number of reads in FASTQ)
wc -l sample_001_R1.fastq
# Output: 4000000 (= 1,000,000 reads)

# Check file size
ls -lh sample_001_R1.fastq
# Output: -rw-r--r-- 1 trainee kcgeb 2.5G Jan 15 14:30 sample_001_R1.fastq
```

## 4. Searching and Filtering

Finding specific sequences or patterns in large files is a daily task at KCGEB. The `grep` command is your best friend.

### Using grep for Pattern Matching

```bash
# Find all headers in a FASTA file
grep ">" reference_genome.fasta

# Count the number of sequences in a FASTA file
grep -c ">" reference_genome.fasta
# Output: 12 (12 chromosomes)

# Find a specific gene in a GFF annotation file
grep "gene_id=Os01g0100100" rice_annotation.gff

# Case-insensitive search
grep -i "chromosome" reference_genome.fasta

# Show line numbers with matches
grep -n "ATGCATGC" sequences.fasta

# Find lines that do NOT match a pattern
grep -v "#" variants.vcf
```

### Finding Files

```bash
# Find all FASTQ files in the current directory and subdirectories
find . -name "*.fastq"

# Find all files modified in the last 7 days
find /data/kcgeb/projects -mtime -7

# Find large files (over 1 GB)
find . -size +1G
```

## 5. Wildcards and Patterns

Wildcards allow you to work with multiple files at once, saving time when processing batches of samples.

### Common Wildcards

| Wildcard | Meaning | Example |
|----------|---------|--------|
| `*` | Matches any characters | `*.fastq` (all FASTQ files) |
| `?` | Matches single character | `sample_?.fastq` (sample_1.fastq, sample_2.fastq) |
| `[...]` | Matches any character in brackets | `sample_[123].fastq` |
| `{...}` | Matches any pattern in braces | `*.{fastq,fq}` |

### Practical Examples

```bash
# List all FASTQ files (compressed or not)
ls *.fastq *.fastq.gz

# Move all R1 (forward) reads to a folder
mv *_R1*.fastq.gz data/forward/

# Count reads in all FASTQ files
wc -l *.fastq

# Compress all FASTA files
gzip *.fasta
```

## 6. Pipes and Redirection

Pipes (`|`) connect commands together, passing the output of one command as input to the next. This is the essence of the Unix philosophy: small tools that do one thing well, combined to perform complex tasks.

### Using Pipes

```bash
# Count sequences in a FASTA file
grep ">" reference.fasta | wc -l

# Find the 10 largest files in a directory
ls -lS *.bam | head -10

# Extract chromosome names and sort them
grep ">" reference.fasta | cut -d " " -f1 | sort

# Count unique gene types in a GFF file
cut -f3 annotation.gff | sort | uniq -c | sort -rn
```

### Redirection

```bash
# Save output to a file (overwrites existing file)
grep ">" reference.fasta > chromosome_headers.txt

# Append output to a file
echo "Analysis completed at $(date)" >> analysis.log

# Redirect errors to a file
bwa mem reference.fasta reads.fastq 2> alignment_errors.log

# Redirect both output and errors
bwa mem reference.fasta reads.fastq > aligned.sam 2> alignment.log
```

## 7. Working with Compressed Files

Sequencing data is almost always compressed to save storage space. At KCGEB, we use gzip compression for most files.

### Compression Commands

```bash
# Compress a file (replaces original with .gz version)
gzip sample_001.fastq

# Decompress a file
gunzip sample_001.fastq.gz

# Keep the original file while compressing
gzip -k sample_001.fastq

# View compressed file without decompressing (use gzcat on macOS)
zcat sample_001.fastq.gz | head
gzcat sample_001.fastq.gz | head  # macOS

# Search in compressed files
zgrep "@SRR" sample_001.fastq.gz

# Count lines in compressed file
gzcat sample_001.fastq.gz | wc -l
```

## 8. File Permissions

On shared systems like the KCGEB cluster, understanding permissions ensures your data is protected while allowing collaborators appropriate access.

### Understanding Permission Strings

```
-rw-r--r-- 1 trainee kcgeb 2.5G Jan 15 14:30 sample.fastq
```

| Position | Meaning |
|----------|--------|
| `-` | File type (- = file, d = directory) |
| `rw-` | Owner permissions (read, write, no execute) |
| `r--` | Group permissions (read only) |
| `r--` | Others permissions (read only) |

### Changing Permissions

```bash
# Make a script executable
chmod +x run_pipeline.sh

# Give group members read and write access
chmod g+rw project_data/

# Remove read access for others
chmod o-r sensitive_data.txt

# Set specific permissions (owner: rwx, group: rx, others: none)
chmod 750 scripts/
```

## 9. Hands-On Exercise

Using the training datasets, practise your Linux skills:

### Exercise: Working with the Training Data

```bash
# 1. Navigate to the training datasets directory
cd trining_datasets
ls -lh

# 2. Count how many lines are in the compressed FASTQ file
gzcat SRR10532784_1.fastq.gz | wc -l
# Divide by 4 to get number of reads

# 3. Extract the first 100 reads to a new file
gzcat SRR10532784_1.fastq.gz | head -400 > first_100_reads.fastq

# 4. Find all reads that contain the sequence "ATCGATCG"
gzcat SRR10532784_1.fastq.gz | grep "ATCGATCG" | wc -l

# 5. View the VCF header and count the number of samples
bcftools view -h sheep.snp.vcf.gz | tail -1 | tr '\t' '\n' | tail -n +10 | wc -l

# 6. Use samtools to view BAM statistics
samtools flagstat u9_liver_100.bam

# 7. Create a summary report
echo "Training Data Summary" > summary.txt
echo "===================" >> summary.txt
echo "FASTQ reads: $(gzcat SRR10532784_1.fastq.gz | wc -l | awk '{print $1/4}')" >> summary.txt
echo "BAM alignments: $(samtools view -c u9_liver_100.bam)" >> summary.txt
cat summary.txt
```

## 10. Summary

In this session, we covered essential Linux commands for bioinformatics work at KCGEB:

| Category | Commands |
|----------|----------|
| Navigation | `pwd`, `ls`, `cd` |
| File Management | `mkdir`, `touch`, `cp`, `mv`, `rm` |
| Viewing Files | `head`, `tail`, `cat`, `less`, `wc` |
| Searching | `grep`, `find` |
| Compression | `gzip`, `gunzip`, `gzcat`, `zgrep` |
| Permissions | `chmod` |

### Key Takeaways

1. Always know where you are (`pwd`) before making changes
2. Use `ls -l` to check file details before operations
3. Be careful with `rm` - there is no undo
4. Combine commands with pipes for powerful workflows
5. Compress files to save storage space

## Next Session

In the next session, we will explore NCBI and UCSC databases, learning how to download reference genomes, annotations, and public sequencing data for your analyses.

## Additional Resources

- [GNU Coreutils Manual](https://www.gnu.org/software/coreutils/manual/)
- [Linux Command Line Basics - Ubuntu Tutorial](https://ubuntu.com/tutorials/command-line-for-beginners)
- [The Linux Command Line Book (free PDF)](https://linuxcommand.org/tlcl.php)
- [Explain Shell - Breaks down complex commands](https://explainshell.com/)