# Case 1: Sparse Genetic Architecture (Few Strong-Effect SNPs)

We will simulate a simple genetic trait where only a few (5) out of 1000 SNPs really matter. Those few SNPs have noticeable (moderate) effects. Because only a handful matter, they are easier to spot and a score built from just them can predict the trait well.

What you will learn:
- How we create (simulate) genotype data (numbers 0,1,2 for each SNP)
- How we build a trait (phenotype) from a few “causal” SNPs plus random noise
- Why we standardize (turn values into z-scores)
- How a simple genome-wide scan finds important SNPs
- How to build a Polygenic Score (PGS)
- How to see if the PGS is useful (decile plot & variance explained)

Key vocabulary (plain words):
- SNP: A spot in the genome that can vary between people
- Genotype value: 0, 1, or 2 copies of the effect allele
- Phenotype: The trait we measure (simulated)
- Causal SNP: A SNP we chose to actually influence the trait
- Noise: Random variation not explained by genetics

Try asking Copilot (beginner prompts):
- "What does 0/1/2 mean in genotype data?"
- "Why do we add noise to the phenotype?"
- "Explain what a causal SNP is in simple terms."

### Step 0: Imports 

In [None]:
# Step 0: Imports & Base Parameters
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import List, Dict, Tuple

# For nicer plots in some environments
plt.style.use('seaborn-v0_8-whitegrid')



### Step 1: Simulate Genotypes & Phenotypes

**Configuration (dataclass):**  
We define all simulation knobs once:

- num_snps: total SNPs (default 1000)  
- num_individuals: people in discovery & target (each)  
- num_causal: how many SNPs truly affect the phenotype (first indices)  
- effect_size: per‑allele additive effect for each causal SNP  
- noise_std: standard deviation of random noise added to phenotype  
- n_permutations: how many shuffles to build the empirical threshold to determine which SNPs to include in the PGS 

(You can change these in the SimulationConfig block below and re‑run.)

**What are we making?**  
We’re creating a simple, realistic genetic dataset for learning. Each “person” gets:
- **Genotypes:** 0 / 1 / 2 copies of an effect allele at 1,000 SNPs
- **Phenotype:** Built from a few causal SNPs + noise

**Why two groups?**
We split the data so we can discover patterns on one set and test them fairly on a fresh set.

- **Discovery set:** find SNP–trait signals to create the PGS score 
- **Target set:** used for honest evaluation of the PGS score

**How do we simulate?**
1. Pick allele frequency per SNP  
2. Draw genotypes (0/1/2) via binomial draws  
3. Mark first `num_causal` SNPs as causal to represent our phenotype 
4. Phenotype = sum(effect_size * genotype at causal SNPs) + noise  

---

**Beginner prompts (try these for deeper understanding):**
- "How does allele frequency affect the chance of 0/1/2?"
- "What happens if I change the number of causal SNPs?"
- "Why do we need separate discovery and target sets?"
- "What real-world traits might follow this 'few causal SNPs' pattern?"

In [None]:
# Step 1: Config + Simulation + Train Preview (single cell)

from dataclasses import dataclass
import numpy as np
import pandas as pd
from IPython.display import display

# --- Configuration ---
@dataclass
class SimulationConfig:
    num_snps: int = 1000
    num_individuals: int = 1000
    num_causal: int = 5
    effect_size: float = 0.5   # per allele effect for each causal SNP
    noise_std: float = 1.0
    n_permutations: int = 100  # used later for permutation threshold
    broad_PGS_k: int = 100     # used later for optional broad PGS

config = SimulationConfig()

# --- Simulate Genotype & Phenotype (Discovery/Target) ---
rng = np.random.default_rng(42)

# Causal SNP indices (first num_causal)
causal_indices = np.arange(config.num_causal)

# Allele frequencies (0.1 to 0.5)
allele_freqs = rng.uniform(0.1, 0.5, size=config.num_snps)

# Genotypes 0/1/2 for train and test
geno_train = np.empty((config.num_individuals, config.num_snps), dtype=np.int8)
geno_test  = np.empty_like(geno_train)
for j, p in enumerate(allele_freqs):
    geno_train[:, j] = rng.binomial(2, p, size=config.num_individuals)
    geno_test[:, j]  = rng.binomial(2, p, size=config.num_individuals)

# Phenotypes: sum over causal SNPs + noise
phen_train = np.zeros(config.num_individuals, dtype=float)
phen_test  = np.zeros(config.num_individuals, dtype=float)
for snp in causal_indices:
    phen_train += config.effect_size * geno_train[:, snp]
    phen_test  += config.effect_size * geno_test[:, snp]
phen_train += rng.normal(0, config.noise_std, size=config.num_individuals)
phen_test  += rng.normal(0, config.noise_std, size=config.num_individuals)

# Package for later steps
data = {
    "allele_freqs": allele_freqs,
    "geno_train": geno_train,
    "geno_test": geno_test,
    "phen_train": phen_train,
    "phen_test": phen_test,
    "causal_snps": causal_indices,
}

print(f"Train phenotype mean/var: {phen_train.mean():.3f} / {phen_train.var():.3f}")
print(f"Causal SNP indices: {data['causal_snps']}")

# --- Preview (TRAIN ONLY) ---
n_rows = 5
n_snps = 8
snp_cols = [f"SNP_{i:04d}" for i in range(n_snps)]

df_train = pd.DataFrame(data["geno_train"][:n_rows, :n_snps], columns=snp_cols)
df_train["phen_train"] = data["phen_train"][:n_rows]

print("\nTrain (wide) first rows:")
display(df_train)

### Interpreting the Training Set Preview

You’re seeing only the discovery (training) table. The target (test) data exists but isn’t shown here.

- Columns
  - SNP_0000, SNP_0001, …: Genotypes coded 0/1/2 (0=none, 1=heterozygous, 2=homozygous effect).
  - phen_train: Simulated trait = sum(effect_size × genotype at causal SNPs) + random noise. Higher = “more” of the trait.

- Reading a row (example)
  - SNP_0000=1, SNP_0001=2, …, phen_train=3.47 → genotypes at causal SNPs plus noise produce a phenotype of 3.47.

- Key points
  - Only the first num_causal SNPs truly affect the phenotype; others are noise.
  - Train and test are separate draws; don’t use test data until evaluation.

- effect_size (what it means)
  - Per-allele additive change in the phenotype (slope). Moving 0→1→2 copies adds ≈ effect_size each step.
  - Simulation vs real data: in this notebook we choose effect_size (β). In real studies, β is unknown and estimated (β̂) via GWAS (per‑SNP regression), with standard errors and p‑values.
  - Larger |effect_size| generally produces bigger |r| and higher PRS R², but detectability also depends on sample size, allele frequency, and noise. In this notebook we use r as a simple proxy weight for β̂.

- Quick checks
  - Genotypes are only 0/1/2.
  - Train and test (when viewed) should be on similar scales given the same settings.
  - Causal SNPs may not be visually obvious; association appears via correlation.

- Next
  - Standardize (z-score) to compare SNP–trait correlations on a common scale.

### Step 2: Standardization

We now convert raw values into **z-scores** (also called standardization) in the **discovery set** (training data).

**What is a z-score (plain words)?**  
“How many standard deviations above (+) or below (–) the mean is this value?”

**Formula:**  
z = (value − mean) / standard_deviation

**Why we do this before correlations:**
- Puts every SNP column and the phenotype on the *same scale* (mean 0, SD 1)
- Makes Pearson r just the **average product** of two z-scored variables (acts like an effect size)
- Lets us compare SNP signals fairly (a rare vs common SNP aren’t unfairly scaled)
- Keeps the *shape* and *ordering* of values (only shifts & rescales)

**What changes vs what stays the same:**
- Changes: mean becomes 0, spread becomes 1
- Stays the same: who is larger/smaller than whom; histogram shape (aside from axis numbers)

**Tiny example:**  
Raw values: 2, 4, 6 (mean=4, SD≈1.633).  
z for 6 = (6 − 4)/1.633 ≈ +1.23 → “1.23 SD above average.”

**Intuition:**  
Standardization is like switching from different local units (inches, meters, feet) to a common “distance-from-average” unit so comparisons are clean.

**Optional prompts to explore:**
- “Why doesn’t standardization change ordering?”
- “Why is correlation simpler after z-scoring both variables?”

In [None]:
# Step 2: Standardize Genotypes and Phenotypes

def standardize_matrix(mat: np.ndarray):  # -> Tuple[np.ndarray, np.ndarray, np.ndarray]
    mean = mat.mean(axis=0)
    std = mat.std(axis=0)
    std[std == 0] = 1.0
    return (mat - mean) / std, mean, std


Z_geno_train, geno_mean_train, geno_std_train = standardize_matrix(data['geno_train'])
Z_geno_test, geno_mean_test_raw, geno_std_test_raw = standardize_matrix(data['geno_test'])  # independent standardization

phen_train = data['phen_train']
phen_test = data['phen_test']
Z_phen_train = (phen_train - phen_train.mean()) / phen_train.std()
Z_phen_test = (phen_test - phen_test.mean()) / phen_test.std()

# Diagnostics to show why raw vs standardized look similar
print("Raw phenotype   mean/std = {:.3f} / {:.3f}".format(phen_train.mean(), phen_train.std()))
print("Standardized    mean/std = {:.3f} / {:.3f}".format(Z_phen_train.mean(), Z_phen_train.std()))
print("Raw   min/max = {:.3f} / {:.3f}".format(phen_train.min(), phen_train.max()))
print("Z     min/max = {:.3f} / {:.3f}".format(Z_phen_train.min(), Z_phen_train.max()))

fig, axes = plt.subplots(1,2, figsize=(10,4))
# Raw distribution
axes[0].hist(phen_train, bins=30, color='skyblue', edgecolor='black')
axes[0].axvline(phen_train.mean(), color='k', linestyle='--', linewidth=1, label='Mean')
axes[0].set_title('Raw Phenotype (Discovery)')
axes[0].set_xlabel('Phenotype')
axes[0].legend()

# Standardized distribution (fixed axis to emphasize z-scale)
axes[1].hist(Z_phen_train, bins=30, color='salmon', edgecolor='black', density=True)
axes[1].axvline(0, color='k', linestyle='--', linewidth=1, label='Mean=0')
axes[1].set_xlim(-4,4)
axes[1].set_title('Standardized Phenotype (Discovery)')
axes[1].set_xlabel('Z-Phenotype')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Additional viz: one SNP before/after standardization (discovery set)
# Pick a SNP with allele frequency near 0.30 for variability
p = data['allele_freqs']
idx_snp = int(np.argmin(np.abs(p - 0.30)))
x_raw = data['geno_train'][:, idx_snp]
x_z = Z_geno_train[:, idx_snp]

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

# Left: raw genotype distribution (0/1/2)
axes[0].hist(x_raw, bins=[-0.5, 0.5, 1.5, 2.5], color='steelblue', edgecolor='black')
axes[0].set_xticks([0,1,2])
axes[0].axvline(x_raw.mean(), color='k', linestyle='--', linewidth=1, label=f"Mean={x_raw.mean():.2f}")
axes[0].set_title(f"SNP_{idx_snp:04d} (raw)")
axes[0].set_xlabel('Genotype (0/1/2)')
axes[0].set_ylabel('Count')
axes[0].legend(frameon=False)

# Right: standardized genotype distribution (z-scores)
axes[1].hist(x_z, bins=30, color='indianred', edgecolor='black', density=True)
axes[1].axvline(0, color='k', linestyle='--', linewidth=1, label='Mean=0')
axes[1].axvline(1, color='gray', linestyle=':', linewidth=1, label='±1 SD')
axes[1].axvline(-1, color='gray', linestyle=':', linewidth=1)
axes[1].set_title(f"SNP_{idx_snp:04d} (standardized)")
axes[1].set_xlabel('Z-score')
axes[1].legend(frameon=False)

plt.tight_layout()
plt.show()
# ...existing code...

### Step 3: GWAS Correlations

**Goal.** See which SNPs move with (are associated with) the trait (phenotype).

**How we compute.** With z-scores, the per-SNP Pearson correlation is the average product:
$$
r_j = \frac{1}{N}\sum_{i=1}^{N} G^{(z)}_{ij}\,y^{(z)}_i
$$
Where N = number of individuals, G^{(z)}_{ij} = standardized genotype at SNP j, y^{(z)}_i = standardized phenotype.

**Reading r.**
- r_j ≈ 0: no detectable association
- r_j > 0: higher genotype → higher trait
- r_j < 0: higher genotype → lower trait
- Larger |r_j| ⇒ stronger SNP–trait link (r_j² is in-sample variance explained by SNP j)

**What to expect in Case 1 (sparse).** Because only 5 SNPs matter, those should show noticeably larger |r| than the rest.

**Next.** Use permutations to see how large the max |r| can be by chance and set a threshold.

In [None]:
# Step 3: GWAS via Per-SNP Correlations
r_values = (Z_geno_train * Z_phen_train[:, None]).mean(axis=0)
r_abs = np.abs(r_values)
max_r = r_abs.max()
max_r_snp = r_abs.argmax()
count_modest = (r_abs > 0.1).sum()
print(f"Maximum |r| = {max_r:.3f} at SNP {max_r_snp}")
print(f"Number of SNPs with |r| > 0.1: {count_modest} / {config.num_snps}")
print(f"Causal SNPs: {data['causal_snps']}")

In [None]:
# Optional Step 3 viz: one SNP vs phenotype (discovery set)

import numpy as np
import matplotlib.pyplot as plt

idx = int(np.argmax(r_abs))  # SNP with largest |r|
snp_name = f"SNP_{idx:04d}"

x_raw = data['geno_train'][:, idx]
y_raw = phen_train
x_z = Z_geno_train[:, idx]
y_z = Z_phen_train
r = float(r_values[idx]); r2 = r*r

rng = np.random.default_rng(123)
jit = (rng.random(x_raw.size) - 0.5) * 0.12  # jitter to separate 0/1/2 columns

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Raw scale: shows per-allele shifts
axes[0].scatter(x_raw + jit, y_raw, s=8, alpha=0.5)
means = [y_raw[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
axes[0].plot([0,1,2], means, 'r-o', lw=2, label='Group means')
axes[0].set_xticks([0,1,2]); axes[0].legend(frameon=False)
axes[0].set_xlabel('Genotype (0/1/2)'); axes[0].set_ylabel('Phenotype (raw)')
axes[0].set_title(f'{snp_name} vs phen_train (raw)')

# Z scale: matches how r is computed
axes[1].scatter(x_z + jit, y_z, s=8, alpha=0.5)
# Group means in z space (plot at mean x_z for each genotype 0/1/2)
x_means = [x_z[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
y_means = [y_z[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
axes[1].plot(x_means, y_means, 'r-o', lw=2, label='Group means')
axes[1].legend(frameon=False)
axes[1].set_xlabel('Genotype (z-scored)'); axes[1].set_ylabel('Phenotype (z-scored)')
axes[1].set_title(f'{snp_name} (|r|={abs(r):.2f}, R²={r2:.2f})')

plt.tight_layout(); plt.show()

### How the scatterplot connects to Step 3 correlations

- Left panel (raw):
  - Three vertical bands (genotype 0/1/2). The red “group means” line shows the average phenotype for each genotype.
  - A steeper upward line means a larger per‑allele shift in the raw phenotype (mirroring the causal effect_size β in our simulation).

- Right panel (z-scored):
  - Both axes are standardized, which is exactly how we compute r in Step 3.
  - Pearson r for this SNP is the average product r = mean(x_z · y_z). A tighter, more tilted cloud (and a rising red line) yields larger |r|.
  - The title displays this SNP’s r and R² = r² (in‑sample variance explained by this SNP).

- Why this matters:
  - The r shown here is the same r_values[idx] used in the Manhattan plot and later as the PRS weight for this SNP.
  - Larger |r| places a higher dot in the Manhattan plot, increases the chance of selection, and gives more weight in the PRS.

Note: Jitter only separates overlapping 0/1/2 points visually; it does not change r.

### Mini-GWAS framing (what we’re mimicking)

- A GWAS tests each SNP across the genome for association with a phenotype (one SNP at a time).  
- Real GWAS uses regression (**β, SE, p-value**) and includes covariates (age, sex, ancestry PCs) to control confounding.  
- In this exercise, we use **z-scores** and compute a simple **per-SNP correlation (r)** as a stand-in for GWAS effect size.  
- Output of this step is a “summary stats–like” table: **SNP ID, r, |r|**.  
- Next, we’ll visualize all SNPs together (Manhattan plot) and set a significance cut line via **permutation**—analogous to GWAS genome-wide thresholds.  

**TL;DR:** Step 3 ≈ a **mini-GWAS** pass over SNPs to get per-SNP effects we can carry forward.  


### Step 4: Permutation (simple) + Manhattan Plot

From one SNP to all SNPs
- The Step 3 scatterplot shows how a single SNP’s correlation (r) is computed on z-scored data.
- The Manhattan plot stacks this same |r| for every SNP to give a genome-wide view.

What we do (simple permutation)
- We ask: “How tall could the biggest |r| be just by chance?”
- Shuffle the phenotype B times.
- Each time, compute |r| across all SNPs and record the maximum.
- Use the 95th percentile of these maxima as the threshold line (controls the “tallest-by-chance” effect when scanning many SNPs).

How to read the Manhattan plot
- Dot height = |r| for a SNP; the dashed orange line is the permutation-based threshold.
- Orange dots = SNPs above the threshold (selected candidates).
- Red stars = true causal SNPs (unknown in real data).
- In this sparse case, a few clear peaks should rise above the background.

Beginner tip
- Like a “tallest building contest” among 1000 buildings: the threshold sits just above how tall the tallest would be under random heights. Peaks well above it are likely real.


In [None]:
# Step 4: Compact permutation + Manhattan (single threshold line)
def perm_threshold(Z_geno: np.ndarray, y: np.ndarray, B: int = 100, q: float = 0.95, seed: int = 0) -> float:
    rng = np.random.default_rng(seed)
    max_abs = np.empty(B, dtype=float)
    for b in range(B):
        y_shuf = rng.permutation(y)
        r_shuf = (Z_geno * y_shuf[:, None]).mean(axis=0)
        max_abs[b] = np.abs(r_shuf).max()
    return float(np.quantile(max_abs, q))

ALPHA = 0.05
B = config.n_permutations if hasattr(config, "n_permutations") else 100
threshold = perm_threshold(Z_geno_train, Z_phen_train, B=B, q=1-ALPHA, seed=0)
selected_idx = np.flatnonzero(r_abs >= threshold)

print(f"Permutation threshold (95%): {threshold:.3f} | Selected SNPs: {selected_idx.size}")

# Manhattan plot (simple)
plt.figure(figsize=(12, 5))
x = np.arange(config.num_snps)
plt.scatter(x, r_abs, c='black', s=8, alpha=0.7, label='All SNPs')
if selected_idx.size:
    plt.scatter(selected_idx, r_abs[selected_idx], c='tab:orange', s=18, alpha=0.9, label='Selected (≥ threshold)')
plt.scatter(data['causal_snps'], r_abs[data['causal_snps']], c='red', s=80, marker='*', label='True causal SNPs')
plt.axhline(threshold, color='orange', linestyle='--', linewidth=1.5, label=f'95% perm threshold = {threshold:.3f}')
plt.xlabel('SNP Index'); plt.ylabel('|r| (association strength)')
plt.title('Manhattan Plot: |r| per SNP with Empirical Threshold')
plt.legend(loc='upper right', frameon=False)
plt.tight_layout(); plt.show()


### From GWAS results to Manhattan (our version)

- Classic GWAS Manhattan plots **−log10(p)** by genomic position; taller peaks = stronger evidence.  
- Here, we plot **|r|** instead of −log10(p). The goal is the same: a genome-wide view of signal strength.  
- GWAS uses fixed significance lines (e.g., *5×10⁻⁸*). We use a **permutation-based line** that reflects our data.  
- Selection rule: take all SNPs above the line; if none cross, use a **top-K by |r| fallback** (keep the sign of r for direction).  
- Same idea, simpler ingredients: our permutation line plays the role of a **GWAS significance threshold**.  


### Step 5: Build and Evaluate a Polygenic Score

**Goal:** Turn SNP-trait associations into a predictive score and evaluate performance.

**What we do:**
1. **Select SNPs:** Use SNPs that passed the permutation threshold from Step 4
2. **Fallback strategy:** If no SNPs pass threshold, use top-K SNPs by |r| (K=100)
3. **Calculate PRS:** Weighted sum of standardized genotypes (weights = correlations from discovery)
4. **Evaluate performance:** Correlation metrics (r, R²) and decile stratification
5. **Visualize:** Scatter plot and decile plot showing PRS-phenotype relationship

The decile plot is particularly useful for seeing the practical impact of the PRS - it shows how the average trait value differs across individuals grouped by their genetic risk score.

In [None]:
# Step 5: Build and Evaluate Polygenic Score

# Primary selection: permutation-thresholded set
selected_idx = np.flatnonzero(r_abs >= threshold)
label = "Selected (perm-threshold)"

# Fallback to top-K by |r| if nothing selected
if selected_idx.size == 0:
    selected_idx = np.argsort(r_abs)[-topK:]
    label = f"Top-{topK} by |r| (fallback)"

# Build PRS with training weights (r-values) on standardized target genotypes
prs_raw = Z_geno_test[:, selected_idx] @ r_values[selected_idx]

# Standardize PRS
prs = (prs_raw - prs_raw.mean()) / prs_raw.std()

# Evaluate against standardized target phenotype
R = float(np.corrcoef(prs, Z_phen_test)[0, 1])
R2 = R * R

print(f"PRS type: {label}")
print(f"SNPs used: {selected_idx.size} | Pearson r = {R:.3f} | R^2 = {R2:.3f}")

# Visualization: Scatter plot
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(prs, Z_phen_test, s=8, alpha=0.5)
plt.xlabel('PRS (z-score)')
plt.ylabel('Phenotype (z-score)')
plt.title(f'PRS vs Phenotype (r={R:.2f}, R²={R2:.2f})')

# Visualization: Decile plot
plt.subplot(1, 2, 2)
# Create deciles
edges = np.quantile(prs, np.linspace(0, 1, 11))
dec = np.digitize(prs, edges[1:-1], right=True)
mean_per_decile = [Z_phen_test[dec == d].mean() if np.sum(dec==d) > 0 else np.nan for d in range(10)]
se_per_decile = [Z_phen_test[dec == d].std(ddof=1)/np.sqrt(np.sum(dec==d)) 
                if np.sum(dec==d) > 1 else np.nan for d in range(10)]
gap = mean_per_decile[-1] - mean_per_decile[0] if not np.isnan(mean_per_decile[0]) and not np.isnan(mean_per_decile[-1]) else np.nan

plt.errorbar(range(1, 11), mean_per_decile, yerr=se_per_decile, fmt='-o', capsize=3)
plt.title(f'Phenotype by PRS Decile (Δ₁₀–₁ ≈ {gap:.2f} SD)')
plt.xlabel('PRS Decile (1=lowest, 10=highest)')
plt.ylabel('Mean Phenotype (z-score)')

plt.tight_layout()
plt.show()

### Step 5 Interpretation

**Performance metrics:**
- PRS type: Selected (perm-threshold)
- SNPs used: 5
- Correlation: r = 0.557, R² = 0.310

**What this means:**
- **Selected SNPs:** Only variants that cleared the permutation threshold are included, reducing false positives
- **Strong predictive power:** Explaining ~31% of trait variance with just 5 SNPs is impressive and confirms our sparse architecture
- **Direct genetic insight:** The significant SNPs likely represent true causal variants (in real-world data, they might tag causal variants)

**Understanding the decile plot:**
- Each point shows the average trait value for individuals in that PRS decile
- Error bars represent standard error of the mean
- The steep slope demonstrates the strong predictive power of the PRS
- The gap between lowest and highest deciles (Δ₁₀-₁) shows the practical effect size
- In this sparse case, we see a strong, clear gradient across deciles - individuals with high PRS have substantially higher trait values

**Why this matters:**
The strong performance with few SNPs is characteristic of traits with a simple genetic architecture. This approach efficiently identifies the key genetic drivers and builds a highly predictive, interpretable model.

**Comparison to Other Cases**
- Unlike Case 2 (polygenic), signals are strong enough to pass significance threshold
- Unlike Case 4 (hybrid), all genetic signal comes from a few large-effect variants
- A strict PRS approach works best since additional SNPs add only noise

### Conclusion: What We Learned from Case 1

**Journey summary**  
In this case study, we explored a sparse genetic architecture where only a few variants strongly influence a trait. This mirrors certain real-world traits governed by a small number of impactful genetic factors.

**What we accomplished**
1. **Simulation setup**: Created genetic data with just 5 causal SNPs (out of 1000) affecting the trait
2. **Data preparation**: Standardized all variables to enable fair comparisons on a common scale
3. **Signal detection**: Used correlation analysis to identify the SNPs that truly matter
4. **Statistical rigor**: Established an empirical significance threshold through permutation testing
5. **Visual insight**: Demonstrated how causal SNPs stand out as peaks in a Manhattan plot
6. **Prediction model**: Built a focused polygenic score using only significant SNPs
7. **Validation**: Confirmed strong predictive power (~31% variance explained) in independent data

**Key insights**
- In sparse architectures, true signals can be clearly distinguished from background noise
- Permutation-based thresholds provide protection against false positives
- A small set of correctly identified SNPs can deliver substantial predictive power
- Standardization creates a level playing field for comparing and weighting genetic effects

**Why this matters**
This approach excels for traits with simple genetic architectures—like certain Mendelian disorders or traits influenced by a few major loci. The ability to pinpoint specific causal variants makes these models both interpretable and potentially actionable in clinical settings.

**Taking it further**
What happens when a trait is influenced by hundreds of variants each with tiny effects? This polygenic scenario requires different strategies, which we explore in Case 2.

# Complete the Reflection & Comparison Questions in Shared Slides in Groups in Canvas