# UMR QB1 - Seminar on Gene Expression Analysis

---

## Medical Problem & Research Question

### **Central Clinical Question:**
**"_What are the molecular differences between cancer cells and normal human tissue, and how can we use these differences to identify new therapeutic targets for cancer treatment?_"**

### **Why This Matters for Medicine:**
- **Cancer heterogeneity:** Different cancers have distinct molecular signatures
- **Precision medicine:** Treatments must be tailored to specific cancer types
- **Drug resistance:** Cancer cells evolve to evade treatment
- **Therapeutic targets:** New drugs are desperately needed for better patient outcomes

### **What We'll Discover:**
1. **Molecular cancer signatures:** Genes consistently altered in cancer vs normal tissue
2. **Therapeutic vulnerabilities:** Pathways that could be targeted with drugs
3. **Drug repositioning opportunities:** Existing drugs that might treat cancer
4. **Biomarker identification:** Genes that could predict treatment response

---


## Dataset: A Real Cancer vs Normal Tissue Study

**Clinical Context:** Universal Human Reference (UHR) vs Human Brain Reference (HBR)  
**Medical Relevance:** Cancer cell lines vs normal human brain tissue<br>
**Sample Size:** 6 samples (3 cancer replicates vs 3 normal brain replicates)  
**Data Type:** Paired-end RNA-sequencing (chromosome 22 subset)  
**Reference:** Griffith M, Walker JR, Spies NC, Ainscough BJ, Griffith OL (2015) Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. *PLoS Comput Biol* 11(8): e1004393. https://doi.org/10.1371/journal.pcbi.1004393



### **Medical Samples**

#### **UHR (Universal Human Reference) = CANCER SAMPLES**
- **Composition:** Total RNA from 10 different human cancer cell lines
- **Cancer types included:** Breast, liver, cervix, testis, brain, skin cancers plus immune cells (T cell, B cell, macrophage, histocyte)
- **Why this matters:** Represents the common molecular features shared across different cancer types
- **Clinical relevance:** Helps identify pan-cancer therapeutic targets

#### **HBR (Human Brain Reference) = NORMAL TISSUE CONTROLS**
- **Composition:** Total RNA from brains of 23 healthy Caucasians, mostly 60-80 years old
- **Why brain tissue:** Provides a normal tissue baseline for comparison
- **Clinical relevance:** Shows what "healthy" gene expression looks like

### **The Biological Hypothesis:**
**Cancer cells will show systematic changes in gene expression compared to normal tissue, revealing:**
1. **Oncogenes** (cancer-promoting genes) that are overexpressed
2. **Tumor suppressors** (cancer-preventing genes) that are silenced
3. **Metabolic pathways** altered to support cancer growth
4. **Drug targets** that could selectively kill cancer cells

---

## Learning Objectives

By the end of this seminar, students will:

### **Technical Skills:**
1. Execute differential expression analysis to identify cancer biomarkers
2. Run pathway enrichment to understand cancer biology
3. Use computational drug repositioning for therapeutic discovery

### **Medical Understanding:**
1. **Interpret cancer gene signatures** in clinical context
2. **Identify potential biomarkers** for cancer diagnosis/prognosis
3. **Understand drug repositioning** as a strategy for faster therapeutic development
4. **Connect computational findings** to real cancer treatment decisions

### **Clinical Translation:**
1. **Evaluate therapeutic targets** identified through RNA-seq
2. **Assess drug candidates** for cancer treatment potential
3. **Understand precision medicine** approaches to cancer care


## Part 0: Software Installation and Setup (10 minutes)

### Step 0.1: Install Conda in Google Colab

After running this cell, the runtime will restart automatically. Wait for it to complete, then continue.

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Step 0.2: Verify Conda installation

In [None]:
import condacolab
condacolab.check()

### Step 0.3: Install all required software and packages

In [None]:
%%bash
conda install -c bioconda -c conda-forge salmon gffread bioconductor-deseq2 r-optparse r-ggplot2 r-gprofiler2 r-dplyr r-tidyr -y -q 2>&1

### Step 0.4: Download Data and script repository

**Background: Version Control and Reproducible Research**

In modern bioinformatics, we use version control systems like Git to manage our analysis scripts and data. This ensures reproducibility and allows collaboration. We're downloading a repository that contains:
- Pre-processed RNA-seq data files
- Analysis scripts for each step
- Reference genome files and annotations

In [None]:
%%bash
git clone https://gitlab.uni-rostock.de/wb283/qb1rnaseq.git

In [None]:
%%bash
ls

In [None]:
%%bash
ls -lht

In [None]:
# Change working directory permanently
import os
os.chdir('qb1rnaseq')

In [None]:
%%bash
ls

In [None]:
%%bash
tar xzf griffith-data.tar.gz

In [None]:
%%bash
ls -lht

In [None]:
%%bash
head samples.tsv

---

## Part 1: Gene Expression Datasets (2 minutes)

All data and scripts are now included in the repository! No separate download needed.

### Step 1.1: Verify Dataset Contents

**🧬 Background: FASTQ Files and RNA-seq Data Format**

**What are FASTQ files?**
FASTQ files contain the raw sequencing data from RNA-seq experiments. Each sequencing read is represented by 4 lines:
1. **Header line**: Starts with `@`, contains read identifier
2. **Sequence line**: The actual DNA/RNA sequence (A, T, G, C)
3. **Plus line**: Starts with `+`, sometimes repeats the identifier
4. **Quality line**: ASCII characters representing base quality scores

**Why paired-end sequencing?**
Each RNA fragment is sequenced from both ends (R1 and R2), providing:
- Better mapping accuracy
- Detection of splice junctions
- Improved quantification of gene expression

**Our dataset:**
- **UHR samples**: Mixed cancer cell lines (Universal Human Reference)
- **HBR samples**: Normal brain tissue (Human Brain Reference)
- **Chr22 subset**: Educational dataset focusing on chromosome 22 genes


In [None]:
%%bash
ls reads/

In [None]:
%%bash
# expression data
head reads/UHR_1_R1.fq

In [None]:
%%bash
wc -l reads/HBR_1_R1.fq

In [None]:
%%bash
wc -l reads/UHR_1_R1.fq | awk '{print $1/4}'

In [None]:
%%bash
# Count reads per sample
for sample in UHR_1 UHR_2 UHR_3 HBR_1 HBR_2 HBR_3; do
    count=$(wc -l reads/${sample}_R1.fq | awk '{print $1/4}')
    echo "$sample: $count reads"
done

### Step 1.2: Examine Input Data Formats

**Background: Understanding Bioinformatics File Formats**

Let's examine the three fundamental file types used in RNA-seq analysis to understand their structure and biological meaning:

#### 1.2.1 FASTQ Format - Raw Sequencing Data

- **Line 1** (`@`): Read identifier with sequencing machine info, flow cell coordinates
- **Line 2**: DNA sequence (A, T, G, C) - this is the actual RNA converted to DNA
- **Line 3** (`+`): Separator (sometimes repeats identifier)
- **Line 4**: Quality scores - each character represents confidence in the corresponding base

**Quality scores (Phred scores):**
- High quality: `I` (Phred 40) = 99.99% accuracy
- Medium quality: `B` (Phred 33) = 99.95% accuracy  
- Low quality: `#` (Phred 2) = 37% accuracy

In [None]:
%%bash
ls -lht

In [None]:
%%bash
# expression data
head reads/UHR_1_R2.fq

#### 1.2.2 GTF Format - Gene Annotations

- **Column 1**: Chromosome name (22)
- **Column 2**: Annotation source (HAVANA, ENSEMBL)
- **Column 3**: Feature type (gene, transcript, exon, CDS)
- **Column 4-5**: Start and end coordinates (1-based)
- **Column 6**: Score (usually `.` for missing)
- **Column 7**: Strand (`+` forward, `-` reverse)
- **Column 8**: Frame (for coding sequences)
- **Column 9**: Attributes (gene_id, gene_name, transcript_id, etc.)

**Medical relevance:**
- Tells us where genes are located on the chromosome
- Defines exon boundaries (coding regions)
- Links gene IDs to human-readable gene names
- Essential for quantifying expression accurately


In [None]:
%%bash
ls refs/

In [None]:
%%bash
# transcriptome annotations
head refs/22.gtf

#### 1.2.3 FASTA Format - Genome Sequence

- **Header line** (`>`): Sequence identifier and description
- **Sequence lines**: Raw DNA sequence (A, T, G, C, N)
  - `A` = Adenine
  - `T` = Thymine  
  - `G` = Guanine
  - `C` = Cytosine
  - `N` = Unknown/ambiguous base

**Biological context:**
- This is the reference genome sequence for chromosome 22
- Contains ~51 million base pairs
- Includes both coding (genes) and non-coding regions
- Used as a template to identify where RNA-seq reads originated

**How these files work together:**
1. **FASTQ files**: Contain the expression experimental data (what we measured)
2. **GTF file**: Tells us where genes are located (the map)
3. **FASTA file**: Provides the reference sequence (the genome sequence blueprint)

Together, they allow us to determine which genes the RNA-seq reads came from and quantify their expression levels.

In [None]:
%%bash
# sequence data
head refs/22.fa

## Part 2: Build Transcriptome and Quantify Expression (15 minutes)

### Step 2.1: Extract Transcriptome and Build Salmon Index

**🧬 Background: Transcriptomes vs Genomes for RNA-seq Analysis**

**Why extract a transcriptome?**
- **Genome**: Contains all DNA sequences including introns, exons, and regulatory regions
- **Transcriptome**: Contains only the mature mRNA sequences (exons spliced together)
- **RNA-seq reads**: Come from mature mRNA, so they match transcriptome sequences

**What is gffread doing?**
The `gffread` tool reads the GTF annotation file to identify exon coordinates, then extracts and splices these exonic sequences from the genome to create a transcriptome FASTA file.

**What is a Salmon index?**
Salmon uses a sophisticated indexing strategy that creates:
- A compressed representation of all transcripts
- K-mer (short sequence) lookup tables for fast mapping
- This allows ultra-fast quantification without traditional alignment

**Clinical relevance:**
Accurate quantification is essential for identifying biomarkers and therapeutic targets in cancer research.


In [None]:
%%bash
gffread -w chr22_transcriptome.fa -g refs/22.fa refs/22.gtf

In [None]:
%%bash
head chr22_transcriptome.fa

In [None]:
%%bash
salmon index -t chr22_transcriptome.fa -i salmon_index -k 31 2>&1

### Step 2.2: Quantify All Samples

**Background: RNA-seq Quantification with Salmon**

**What is RNA-seq quantification?**
RNA-seq quantification determines how many RNA molecules (gene expression level) were present in each sample for each gene. This process involves:

1. **Mapping reads**: Determining which gene/transcript each sequencing read came from
2. **Counting**: Tallying how many reads map to each gene
3. **Normalization**: Adjusting for sequencing depth and gene length differences

**How does Salmon work?**
Unlike traditional aligners, Salmon uses:
- **Lightweight mapping**: Fast probabilistic assignment of reads to transcripts
- **EM algorithm**: Handles reads that map to multiple transcripts
- **Bias correction**: Accounts for sequence composition and positional biases

**Command parameters explained:**
- `-i salmon_index`: Use our pre-built index
- `-l A`: Auto-detect library type (paired-end, strand-specific, etc.)
- `-1` and `-2`: Forward and reverse read files (paired-end)
- `-o`: Output directory for results
- `-q`: Quiet mode (less verbose output)

**Medical significance:**
Gene expression differences between cancer and normal samples reveal:
- Oncogenes (overexpressed in cancer)
- Tumor suppressors (underexpressed in cancer)
- Potential therapeutic targets

In [None]:
%%bash
mkdir -p salmon_quant

In [None]:
%%bash
ls -lht

In [None]:
%%bash
bash mapping.sh

In [None]:
%%bash
cat mapping.sh

### Step 2.3: Examine Salmon Quantification Results

**Background: Understanding Salmon Output Format**

**What is in the `quant.sf` file?**
Salmon produces a quantification file (quant.sf) for each sample containing the estimated expression levels for all transcripts. This file is the core output that we'll use for downstream differential expression analysis.

**Let's examine the quantification results:**

In [None]:
%%bash
ls -lht salmon_quant/

In [None]:
%%bash
ls -lht salmon_quant/HBR_rep1/

In [None]:
%%bash
head salmon_quant/HBR_rep1/quant.sf

In [None]:
%%bash
tail salmon_quant/HBR_rep1/quant.sf

**Understanding the quant.sf format:**

**Column explanations:**
- **Name**: Transcript identifier (from our GTF annotation file)
- **Length**: Effective transcript length in base pairs (accounts for fragment length distribution)
- **EffectiveLength**: Length used for TPM calculation (adjusted for sequencing biases)
- **TPM**: **Transcripts Per Million** - normalized expression measure
- **NumReads**: Estimated number of reads assigned to this transcript (what we'll use for DESeq2)

**Why these metrics matter:**

**TPM (Transcripts Per Million):**
- **Normalized measure**: Accounts for both sequencing depth and transcript length
- **Comparable across samples**: Same TPM value means same expression level
- **Range**: 0 to very high (no upper limit)
- **Interpretation**: TPM of 1 means 1 transcript per million transcripts in the sample

**NumReads (Read Counts):**
- **Raw abundance**: Actual number of sequencing reads assigned to each transcript
- **Statistical analysis**: Used by DESeq2 for differential expression testing
- **Integer values**: Whole numbers suitable for count-based statistical models
- **Not normalized**: Higher values could mean higher expression OR deeper sequencing

**Length vs EffectiveLength:**
- **Length**: Actual transcript length from annotation
- **EffectiveLength**: Adjusted for experimental biases (GC content, positional bias)
- **Importance**: Longer transcripts naturally get more reads, so normalization is crucial

**Clinical Relevance:**
- **Biomarker discovery**: TPM values help identify consistently expressed genes
- **Drug targets**: NumReads provides statistical power for finding significant differences
- **Personalized medicine**: Expression levels guide treatment decisions

**What to expect:**
- **High TPM/NumReads**: Highly expressed genes (often housekeeping genes)
- **Zero values**: Genes not expressed in this tissue/condition
- **Intermediate values**: Tissue-specific or condition-dependent expression

---

## Part 3: Differential Expression Analysis (20 minutes)

### Step 3.1: Load Libraries and Data

**Background: Data Import and Preparation for Statistical Analysis**

**What is our R script doing?**
The `load_data.R` script performs several critical steps:

1. **Data Integration**: Combines salmon results from all 6 samples into a single count matrix
2. **Sample Annotation**: Creates metadata linking each sample to its experimental condition
3. **Quality Control**: Checks data integrity and reports basic statistics
4. **Format Preparation**: Converts data into the format required for DESeq2 analysis

**Why use count data?**
- **Raw counts**: Represent the actual number of sequencing reads per gene
- **Statistical requirements**: Count-based statistical models (like DESeq2) need integer counts
- **Comparability**: Counts can be normalized across samples for fair comparison

**Medical context:**
This step transforms raw sequencing data into a format suitable for identifying genes that are differentially expressed between cancer and normal tissue.


In [None]:
%%bash
ls -lht

In [None]:
%%bash
cat samples.tsv

In [None]:
%%bash
# 2025-07-14
Rscript load_data.R --input salmon_quant/ --metadata samples.tsv --output counts.tsv

### Step 3.1.1: Examine the Count Matrix

**Background: Understanding the Gene Expression Count Matrix**

**What is a count matrix?**
The count matrix is the fundamental data structure for RNA-seq analysis. It's a table where:
- **Rows**: Represent genes/transcripts
- **Columns**: Represent samples (cancer_rep1, cancer_rep2, etc.)
- **Values**: Number of sequencing reads assigned to each gene in each sample

**Why is this important?**
- **Statistical foundation**: All differential expression analysis starts with this matrix
- **Data quality**: Allows us to spot potential issues before analysis
- **Biological insight**: Shows which genes are highly vs lowly expressed
- **Clinical relevance**: Forms the basis for identifying cancer biomarkers

**Let's examine our count matrix:**


In [None]:
%%bash
wc -l counts.tsv

In [None]:
%%bash
head counts.tsv

In [None]:
%%bash
tail counts.tsv

**Understanding the count matrix structure:**

**Gene_ID Column:**
- **Transcript identifiers**: From our GTF annotation file
- **ENST format**: Ensembl transcript IDs (e.g., ENST00000215832.4)
- **Version numbers**: The `.4` indicates annotation version
- **Biological meaning**: Each ID represents a specific mRNA transcript

**Sample Columns:**
- **cancer_rep1, cancer_rep2, cancer_rep3**: Three biological replicates of cancer cell lines
- **normal_rep1, normal_rep2, normal_rep3**: Three biological replicates of normal brain tissue
- **Replication importance**: Multiple replicates allow statistical testing

**Count Values:**
- **Integer numbers**: Whole read counts (0, 1, 2, 150, 5000, etc.)
- **Dynamic range**: From 0 (not expressed) to tens of thousands (highly expressed)
- **Raw counts**: Not yet normalized for library size or gene length

**What the numbers tell us:**

| Genes | High count values | Medium count values | Low count values | Zero count values |
| --- | --- | --- | --- | --- |
| Counts range | >1000 | 10-1000 | 0-10 | 0 |
| Typical content | <ul><li>Housekeeping genes: Essential cellular functions (ribosomal proteins, metabolism)</li><li>Tissue-specific genes: Genes characteristic of the tissue type</li><li>Abundant transcripts: Genes producing lots of mRNA | <ul><li>Regulated genes: Genes that may change between conditions</li><li>Functional specialization: Genes involved in specific biological processes</li><li>Potential biomarkers: Candidates for differential expression</li></ul> | <ul><li>Lowly expressed genes: Genes with minimal transcription</li><li>Tissue-inappropriate genes: Genes not typically expressed in this tissue</li><li>Technical noise: Some low counts may be background</li></ul> | <ul><li>Not expressed: Gene is turned off in this sample/condition</li><li>Below detection: Expression too low to detect reliably</li><li>Technical artifacts: Missing due to technical limitations</li></ul> |


**Clinical and Research Implications:**

**Quality indicators:**
- **Similar patterns**: Replicates should show similar count patterns
- **Dynamic range**: Healthy samples should show diverse expression levels
- **Library sizes**: Total counts should be similar across samples

**Biological insights:**
- **Condition differences**: Cancer vs normal samples may show different patterns
- **Gene expression hierarchy**: Most genes lowly expressed, few highly expressed
- **Functional categories**: Different gene types have characteristic expression levels

This count matrix serves as the foundation for identifying genes that are differentially expressed between cancer and normal tissue, which will guide our search for therapeutic targets.

### Step 3.2: Run Differential Expression Analysis

**Background: Statistical Analysis with DESeq2**

**What is differential expression analysis?**
Differential expression analysis identifies genes whose expression levels significantly differ between experimental conditions (cancer vs normal). This involves:

1. **Normalization**: Adjusting for differences in sequencing depth and composition
2. **Variance modeling**: Estimating biological and technical variability
3. **Statistical testing**: Using negative binomial models to test for significant differences
4. **Multiple testing correction**: Adjusting p-values for testing thousands of genes simultaneously

**How does DESeq2 work?**
DESeq2 uses sophisticated statistical methods:
- **Size factors**: Normalize for sequencing depth differences
- **Dispersion estimation**: Model gene-specific variance
- **Wald test**: Test for significant expression differences
- **Benjamini-Hochberg**: Control false discovery rate

**Key outputs:**
- **Log2 fold change**: Magnitude of expression difference
- **P-value**: Statistical significance of the difference
- **Adjusted p-value**: Corrected for multiple testing

**Clinical interpretation:**
- **Positive fold change**: Higher expression in cancer (potential oncogenes)
- **Negative fold change**: Lower expression in cancer (potential tumor suppressors)

In [None]:
%%bash
# 2025-07-14
Rscript run_deseq2.R \
    --metadata samples.tsv \
    --expression counts.tsv \
    --output-degs degs.tsv \
    --output-plots pca_plot.png,heatmap.png,volcano_plot.png \
    --output-image deseq2_results.RData

In [None]:
%%bash
head degs.tsv

In [None]:
%%bash
grep "ENST00000328933.9" refs/22.gtf | cut -f9

### Step 3.3: Create Visualization Plots

**Background: Data Visualization for Biological Interpretation**

**Why visualize RNA-seq results?**
Visualization helps us understand complex datasets and communicate findings effectively. Our three key plots serve different purposes:

**1. PCA Plot (Principal Component Analysis):**
- **Purpose**: Shows overall similarity/differences between samples
- **What it reveals**: Whether cancer and normal samples cluster separately
- **Clinical significance**: Validates that cancer has distinct molecular signatures

**2. Volcano Plot:**
- **Purpose**: Displays both statistical significance and biological magnitude
- **X-axis**: Log2 fold change (magnitude of difference)
- **Y-axis**: -log10 p-value (statistical significance)
- **Clinical significance**: Identifies the most promising therapeutic targets

**3. Heatmap:**
- **Purpose**: Shows expression patterns of top differentially expressed genes
- **Colors**: Red = high expression, Blue = low expression
- **Clinical significance**: Reveals potential biomarkers for cancer diagnosis

**Our visualization script:**
Uses professional ggplot2 graphics to create publication-quality figures suitable for scientific papers and clinical presentations.

In [None]:
from IPython.display import Image, display
import os

display(Image('heatmap.png'))

Now let's see the annotations of one of these transcripts, e.g. `ENST00000390323.2`:

In [None]:
%%bash
# Annotations from differentially expressed genes
grep "ENST00000390323.2" refs/22.gtf | cut -f9

---

## Part 4: Pathway Analysis (15 minutes)

### Step 4.1: Run Pathway Analysis

**Background: Understanding Biological Pathways in Cancer**

**What is pathway enrichment analysis?**
Instead of looking at individual genes, pathway analysis examines groups of genes that work together in biological processes. This helps us understand:

**1. Biological mechanisms:** How cancer disrupts normal cellular functions
**2. Therapeutic targets:** Which pathways could be targeted with drugs
**3. Disease understanding:** The underlying biology of cancer progression

**How does pathway enrichment work?**
1. **Gene sets**: Pre-defined groups of genes involved in specific biological processes
2. **Statistical testing**: Determines if cancer-associated genes are overrepresented in specific pathways
3. **Databases used**:
   - **GO (Gene Ontology)**: Biological processes, molecular functions
   - **KEGG**: Metabolic and signaling pathways
   - **Reactome**: Detailed biological reactions

**Expected cancer pathways:**
- **Upregulated**: Cell cycle, DNA replication, metabolic reprogramming
- **Downregulated**: Apoptosis, DNA repair, immune response

**Clinical relevance:**
Pathway analysis reveals which biological systems are disrupted in cancer, guiding therapeutic strategies and drug development.

In [None]:
%%bash
Rscript pathway_analysis.R --input degs.tsv --output pathway_results.tsv

In [None]:
%%bash
head pathway_results.tsv

---

## Part 5: Drug Repositioning (10 minutes)

### 5.1 Drug Repositioning in Cancer Research

**What is drug repositioning?**
Drug repositioning (also called drug repurposing) involves finding new therapeutic uses for existing drugs. This approach offers several advantages:

1. **Faster development**: 5-10 years vs 15-20 years for new drugs
2. **Known safety profiles**: Existing drugs have established safety data
3. **Lower costs**: Reduces the risk and expense of drug development
4. **Immediate clinical application**: Can be prescribed off-label in some cases

**How does computational drug repositioning work?**
Our approach uses gene expression signatures:
1. **Cancer signature**: Lists of oncogenes (upregulated) and tumor suppressors (downregulated)
2. **Drug effects database**: How thousands of drugs affect gene expression
3. **Signature matching**: Find drugs that reverse cancer gene expression patterns

**Success stories:**
- **Metformin**: Diabetes drug → Cancer prevention (200+ clinical trials)
- **Aspirin**: Pain relief → Cancer prevention (FDA approved)
- **Rapamycin**: Immunosuppressant → Cancer and aging research

**Our script preparation:**
Converts our DESeq2 results into a format compatible with L1000CDS2, a major drug repositioning database.

### 5.2 Querying the L1000CDS2 Drug Database

**Background: The L1000CDS2 Drug Repositioning Platform**

**What is L1000CDS2?**
L1000CDS2 (L1000 Characteristic Direction Signature) is a computational tool developed by the Ma'ayan Laboratory that:

1. **Database scope**: Contains gene expression signatures for >20,000 drugs tested on human cell lines
2. **Signature matching**: Uses mathematical algorithms to find drugs that reverse disease signatures
3. **LINCS program**: Part of the NIH Library of Integrated Network-based Cellular Signatures initiative

**How does the algorithm work?**
1. **Input signature**: Our cancer gene signature (oncogenes + tumor suppressors)
2. **Database search**: Compares against drug-induced expression changes
3. **Scoring system**: Calculates how well each drug reverses the cancer signature
4. **Ranking**: Returns drugs ranked by their potential to counteract cancer

**Interpreting the results:**
- **Negative scores**: Drugs that reverse cancer signatures (high therapeutic potential)
- **Positive scores**: Drugs that mimic cancer signatures (avoid these)
- **Score magnitude**: Larger absolute values indicate stronger effects

**Clinical validation:**
The system has successfully identified:
- Known cancer drugs (validates the approach)
- Repositioned drugs already in cancer trials
- Novel repositioning opportunities for further investigation


In [None]:
%%bash
python drug_repositioning.py --input degs.tsv --output drug_candidates.txt

In [None]:
%%bash
cat drug_candidates.txt

### Step 5.3: Interpret Drug Repositioning Results

**💊 Background: Clinical Interpretation of Drug Repositioning Results**

**Understanding L1000CDS2 Scores:**

The drug repositioning analysis produces a ranked list of compounds based on their ability to reverse cancer gene signatures. Here's how to interpret the results:

**🎯 Score Interpretation:**
- **Negative scores**: High therapeutic potential (drugs that reverse cancer signatures)
- **Positive scores**: Avoid these drugs (they mimic or worsen cancer signatures)
- **Score magnitude**: Larger absolute values indicate stronger predicted effects

**✅ Validation Categories:**

**1️⃣ Known Cancer Drugs (Positive Controls):**
- **Examples**: Doxorubicin, Paclitaxel, Cisplatin, Tamoxifen, Imatinib
- **Significance**: Validates our computational approach
- **Clinical meaning**: Confirms the cancer signature is biologically relevant
- **Research value**: Shows the method can identify established cancer therapeutics

**2️⃣ Successfully Repositioned Drugs:**
- **Metformin**: Originally for diabetes → Now in 200+ cancer clinical trials
- **Aspirin**: Originally for pain/inflammation → FDA-approved for cancer prevention
- **Clinical success**: These drugs prove repositioning works in practice
- **Patient benefit**: Already available for off-label use in some cases

**3️⃣ Promising Repositioning Candidates:**
- **Statins** (cholesterol drugs): Anti-cancer properties discovered in studies
- **Rapamycin** (immunosuppressant): Active cancer and aging research
- **Chloroquine** (antimalarial): Being investigated for cancer applications
- **Research opportunity**: Novel applications requiring further validation

**🚀 Clinical Translation Success Stories:**

**📈 Metformin (Diabetes → Cancer Prevention):**
- **Original discovery**: Diabetic patients had unexpectedly lower cancer rates
- **Mechanism**: Targets cancer cell metabolism through mTOR pathway inhibition
- **Current status**: Phase III clinical trials for cancer prevention
- **Clinical application**: Already prescribed to diabetic cancer patients
- **Evidence**: Some studies show 30% reduction in cancer risk

**📈 Aspirin (Pain Relief → Cancer Prevention):**
- **Original discovery**: Regular aspirin users had reduced colorectal cancer incidence
- **Mechanism**: Inhibits chronic inflammation that promotes cancer development
- **Current status**: FDA-approved for cancer prevention in high-risk patients
- **Clinical application**: Daily low-dose aspirin protocols established
- **Evidence**: Significant reduction in colorectal cancer deaths

**🔬 Why Drug Repositioning Works:**

**Scientific advantages:**
- **Faster development timeline**: 5-10 years vs 15-20 years for new drugs
- **Known safety profiles**: Existing drugs have established safety and side effect data
- **Lower development costs**: Reduces financial risk for pharmaceutical companies
- **Regulatory advantages**: Faster approval process for new indications

**Biological rationale:**
- **Pathway targeting**: Many diseases share common molecular pathways
- **Polypharmacology**: Single drugs often affect multiple biological targets
- **Network effects**: Drugs can impact interconnected cellular systems
- **Serendipitous discoveries**: Unexpected beneficial effects in different diseases

**🏥 Clinical Translation Process:**

**Validation pipeline:**
1. **In vitro testing**: Test drug effects on cancer cell lines
2. **Animal studies**: Validate efficacy and safety in cancer models
3. **Phase I trials**: Test safety and dosing in cancer patients
4. **Phase II trials**: Evaluate efficacy in specific cancer types
5. **Phase III trials**: Compare to standard treatments
6. **Clinical implementation**: Integration into treatment guidelines

**Patient selection strategies:**
- **Biomarker-guided therapy**: Use gene expression to select patients
- **Combination approaches**: Pair repositioned drugs with standard treatments
- **Risk stratification**: Target high-risk patients for prevention
- **Personalized dosing**: Optimize doses based on individual characteristics


---

## Part 6: Summary and Clinical Applications (5 minutes)

### What We Accomplished Today

**🎉 Complete RNA-seq Analysis Pipeline Accomplished!**

**Technical achievements:**
- ✅ Downloaded and verified real cancer vs normal tissue data
- ✅ Built transcriptome index and quantified gene expression using Salmon
- ✅ Identified cancer-associated genes (oncogenes & tumor suppressors) with DESeq2
- ✅ Created publication-quality visualizations (PCA, volcano plot, heatmap)
- ✅ Performed pathway enrichment analysis to understand biological mechanisms
- ✅ Discovered drug repositioning candidates using computational approaches
- ✅ Learned professional bioinformatics workflow and command-line tools

**From FASTQ files → Therapeutic insights in 90 minutes!**

### Clinical Translation and Medical Impact

**🏥 How This Analysis Guides Cancer Medicine:**

#### **1. Biomarker Discovery**
**Diagnostic markers:** Genes consistently different between cancer and normal tissue can serve as diagnostic biomarkers. For example, if certain genes are always overexpressed in cancer samples, they could be used in blood tests or tissue biopsies to detect cancer early.

**Prognostic indicators:** Expression patterns that predict patient outcomes help physicians make treatment decisions. Patients with high expression of certain oncogenes might need more aggressive treatment, while those with favorable signatures might do well with less intensive therapy.

**Predictive biomarkers:** Signatures that indicate drug response likelihood enable personalized medicine. Before prescribing a targeted therapy, doctors could test whether a patient's tumor has the molecular signature that predicts response to that specific drug.

#### **2. Drug Target Identification**
**Oncogenes:** Genes overexpressed in cancer represent potential targets for inhibition. For example, kinase inhibitors can block overactive growth signals in cancer cells.

**Tumor suppressors:** Pathways that are suppressed in cancer could be restored through therapeutic intervention. For instance, drugs that reactivate p53 tumor suppressor function are being developed.

**Metabolic vulnerabilities:** Cancer-specific dependencies can be exploited therapeutically. If cancer cells become dependent on certain metabolic pathways, drugs targeting those pathways could selectively kill cancer cells while sparing normal cells.

#### **3. Drug Repositioning Success Stories**

**Metformin: Diabetes drug → Cancer prevention**
- **Discovery process:** Epidemiological studies noticed that diabetic patients taking metformin had lower cancer rates than expected
- **Mechanism understanding:** Research revealed metformin targets cancer cell metabolism through AMPK/mTOR pathway inhibition
- **Clinical validation:** Over 200 clinical trials are now testing metformin for cancer prevention and treatment
- **Current status:** Some studies show 30% reduction in cancer risk; already prescribed off-label to diabetic cancer patients

**Aspirin: Pain relief → Cancer prevention**
- **Discovery process:** Large epidemiological studies found that regular aspirin users had significantly reduced colorectal cancer incidence
- **Mechanism understanding:** Aspirin reduces chronic inflammation, which is known to promote cancer development
- **Clinical validation:** Multiple randomized trials confirmed the protective effect
- **Current status:** FDA-approved for cancer prevention in high-risk patients; daily low-dose aspirin protocols established

#### **4. Real-World Clinical Applications**

**Patient care example:**
A 65-year-old patient with high cancer risk could benefit from this analysis workflow:
1. **Tumor biopsy** → RNA-seq analysis performed in hospital laboratory
2. **Gene signature analysis** → Compare patient's tumor to our normal tissue database
3. **Therapeutic target identification** → Identify which oncogenes are overactive
4. **Treatment selection** → Choose targeted therapy based on molecular profile
5. **Response monitoring** → Follow-up RNA-seq to assess treatment effectiveness

**Precision medicine implementation:**
- **Tumor boards** use RNA-seq results to make treatment recommendations
- **Clinical trials** enroll patients based on molecular signatures
- **Therapeutic monitoring** tracks changes in gene expression during treatment
- **Resistance prediction** identifies mechanisms before clinical resistance develops



---

### Professional Bioinformatics Skills Learned

**🔧 Technical Skills Acquired:**
- **RNA-seq data processing pipeline:** Complete workflow from FASTQ to results
- **Differential expression analysis:** Statistical methods for identifying significant changes
- **Data visualization:** Creating publication-quality scientific figures
- **Pathway enrichment analysis:** Understanding biological mechanisms
- **Command-line proficiency:** Professional bioinformatics tool usage
- **Quality control:** Assessing data reliability and identifying potential issues

**🎯 Professional Skills Developed:**
- **Version control understanding:** Using Git repositories for reproducible research
- **Workflow management:** Organizing complex multi-step analyses
- **Error handling strategies:** Dealing with technical challenges professionally
- **Results interpretation:** Connecting statistical results to biological meaning
- **Scientific communication:** Presenting findings to clinical audiences

**🏥 Medical Applications Mastered:**
- **Cancer biomarker identification:** Finding genes useful for diagnosis and prognosis
- **Therapeutic target discovery:** Identifying molecular vulnerabilities in cancer
- **Drug repositioning strategies:** Finding new uses for existing medications
- **Precision medicine approaches:** Tailoring treatment to individual molecular profiles

### Key Learning Outcomes

**Students now understand:**

1. **How cancer differs molecularly from normal tissue:** Cancer involves systematic changes in gene expression that affect multiple cellular pathways, creating vulnerabilities that can be targeted therapeutically.

2. **Why RNA-seq is essential for precision medicine:** Gene expression profiling provides the molecular information needed to classify cancers, predict treatment responses, and monitor therapeutic efficacy.

3. **How computational biology accelerates drug discovery:** Bioinformatics approaches can identify therapeutic targets and repositioning opportunities much faster than traditional experimental methods.

4. **The complete pipeline from sequencing to therapeutics:** Understanding how raw sequencing data becomes actionable clinical information through systematic computational analysis.

5. **Professional bioinformatics workflow practices:** Experience with industry-standard tools, reproducible research methods, and systematic approaches to biological data analysis.

6. **How to handle technical challenges in research:** Developing problem-solving skills and backup strategies for when analyses don't go as planned.

### Next Steps in Cancer Research

**🔬 Experimental Validation:**
- **Cell line studies:** Test drug predictions in cancer cell cultures
- **Animal model validation:** Confirm therapeutic effects in mouse cancer models
- **Biomarker validation:** Test identified genes in independent patient cohorts
- **Mechanism studies:** Understand how identified drugs affect cancer cells

**🏥 Clinical Translation:**
- **Clinical trial design:** Develop protocols to test promising drug candidates
- **Diagnostic test development:** Create clinical assays for identified biomarkers
- **Treatment guideline integration:** Incorporate findings into clinical practice standards
- **Regulatory approval:** Navigate FDA processes for new therapeutic indications

**📊 Computational Extensions:**
- **Full transcriptome analysis:** Analyze complete gene expression profiles, not just chromosome 22
- **Multi-cancer studies:** Compare our findings across different cancer types
- **Multi-omics integration:** Combine RNA-seq with DNA, protein, and metabolite data
- **Machine learning applications:** Use AI approaches for more sophisticated pattern recognition

**🌍 Global Health Impact:**
- **Accessible diagnostics:** Develop cost-effective tests for resource-limited settings
- **Prevention strategies:** Implement repositioned drugs for cancer prevention programs
- **Personalized medicine democratization:** Make precision medicine available to more patients
- **Training programs:** Educate healthcare workers in genomic medicine approaches


---

## Assessment Questions for Medical Students

### **Clinical Reasoning Questions:**

1. **Therapeutic Application:**
   - "Based on our analysis, how would you prioritize the drug repositioning candidates for a cancer patient? Consider efficacy, safety, and current approval status."

2. **Biomarker Development:**
   - "Which genes from our analysis would make the best diagnostic biomarkers for cancer? What additional validation would be needed?"

3. **Precision Medicine:**
   - "How could this RNA-seq approach be integrated into routine cancer care? What are the practical challenges and benefits?"

### **Data Interpretation:**

1. **Results Analysis:**
   - "Why do known cancer drugs (doxorubicin, paclitaxel) appear in our repositioning results? What does this tell us about our computational approach?"

2. **Clinical Translation:**
   - "Metformin shows promise for cancer treatment. Design a clinical trial to test this hypothesis, including patient selection criteria and endpoints."

### **Professional Skills:**

1. **Workflow Management:**
   - "The L1000CDS2 API was unavailable during our analysis. How did we handle this professionally, and why is this important in research?"

2. **Scientific Communication:**
   - "Present our key findings to a hospital tumor board. Focus on actionable insights for patient care."

---

**Congratulations! You've completed a professional-level RNA-seq analysis and learned how computational biology directly contributes to cancer treatment and drug discovery.**

*This hands-on experience demonstrates the complete pipeline from raw sequencing data to therapeutic insights, preparing future physicians for the era of genomic medicine and personalized cancer care.*

---

## Contact

Dr. rer. nat. Israel Barrantes <br>
Junior Research Group Translational Bioinformatics (head)<br>
Institute for Biostatistics and Informatics in Medicine and Ageing Research, Office 3017<br>
Rostock University Medical Center<br>
Ernst-Heydemann-Str. 8<br>
18057 Rostock, Germany<br>

Email: israel.barrantes[bei]uni-rostock.de

---
Last update 2025/10/25
