# Case 2: Diffuse Polygenicity (Many Tiny Effects)

This notebook follows the GWAS→PRS workflow but simulates a trait with many tiny genetic effects.  Unlike Case 1, no single SNP will stand out as genome‑wide significant, yet aggregating weak signals into a polygenic score yields modest predictive power.


### Step 0: Imports

Import required libraries for simulation, computation and plotting.


In [None]:
# Step 0: Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass
from IPython.display import display

# Style for plots
plt.style.use('seaborn-v0_8-whitegrid')


### Step 1: Simulate Genotypes & Phenotypes

This case simulates a highly polygenic trait: many SNPs each contribute a tiny effect. We split individuals 70/30 into discovery (training) and target (testing) sets, and scale environmental noise so that the trait’s narrow‑sense heritability is about 30%.

**Configuration (dataclass):**
We define all simulation knobs once:

- num_snps: total SNPs (default 1000)
- num_individuals: total people (split 70/30 into discovery/target)
- num_causal: how many SNPs truly affect the phenotype (randomly chosen indices)
- effect_size: SD for tiny additive effects of causal SNPs (effects drawn ~ N(0, effect_size²) with mild clipping and re-scaling)
- n_permutations: number of shuffles to build the empirical max |r| threshold (used in Step 4)
- broad_PGS_k: how many top-|r| SNPs to aggregate into the PRS (used in Step 5)
- noise_std: present in the config; noise is actually scaled to hit a target heritability (H2_TARGET ≈ 0.30)

(You can change these in the SimulationConfig block below and re‑run.)

**What are we making?**
A simple, realistic dataset for learning polygenic prediction. Each “person” gets:
- Genotypes: 0 / 1 / 2 copies of an allele at 1,000 SNPs
- Phenotype: Genetic component from many tiny-effect causal SNPs + environmental noise scaled to target h² ≈ 0.30

**Why two groups?**
We discover patterns on one set and test them fairly on a fresh set.
- Discovery set (train): estimate SNP–trait associations used to create the PRS
- Target set (test): honest evaluation of the PRS

**How do we simulate?**
1) Draw an allele frequency for each SNP (uniform between 0.05 and 0.5).  
2) Generate genotypes (0/1/2) via binomial draws given each SNP’s allele frequency.  
3) Randomly choose `num_causal` causal SNPs; draw tiny effects for them ~ N(0, effect_size²), clip to ±2×effect_size, and re-standardize to keep the intended SD.  
4) Compute the genetic component g = G·β.  
5) Scale environmental noise so that h² ≈ 0.30 in the discovery set, then form phenotype = g + noise.  
6) Use the discovery set to compute SNP–trait correlations; the target set is held out for PRS evaluation.

Note: In a diffuse polygenic architecture, no single SNP is expected to clear a strict significance threshold; the useful information lies in the ranking of many small signals that we aggregate into a PRS.

**Beginner prompts (try these for deeper understanding):**
- “How do allele frequencies determine the probabilities of 0/1/2 genotypes?”  
- “What happens if I change the number of causal SNPs (e.g., 50 vs 300 vs 600)?”  
- “How does the target heritability (H2_TARGET) affect PRS performance?”  
- “Why do many tiny effects make it hard for any single SNP to be significant?”  
- “Why do we need separate discovery and target sets?”

In [None]:
# Step 1: Config + Polygenic Simulation + Train Preview

from dataclasses import dataclass
import numpy as np
import pandas as pd
from IPython.display import display

@dataclass
class SimulationConfig:
    num_snps: int = 1000
    num_individuals: int = 600
    num_causal: int = 300
    effect_size: float = 0.02
    noise_std: float = 1.0
    n_permutations: int = 100
    broad_PGS_k: int = 50

config = SimulationConfig()
rng = np.random.default_rng(42)
H2_TARGET = 0.30

# 1. Draw allele frequencies and genotypes (0/1/2)
allele_freqs = rng.uniform(0.05, 0.5, size=config.num_snps)
geno = np.empty((config.num_individuals, config.num_snps), dtype=np.int8)
for j, p in enumerate(allele_freqs):
    geno[:, j] = rng.binomial(2, p, size=config.num_individuals)

# 2. Split into discovery (train) and target (test) sets
n_train = int(0.7 * config.num_individuals)
train_idx = np.arange(n_train)
test_idx = np.arange(n_train, config.num_individuals)
geno_train, geno_test = geno[train_idx], geno[test_idx]

# 3. Assign causal SNPs and tiny effects
causal_idx = rng.choice(config.num_snps, size=config.num_causal, replace=False)
beta = np.zeros(config.num_snps)
beta_draw = rng.normal(0.0, config.effect_size, size=config.num_causal)
beta_draw = np.clip(beta_draw, -2*config.effect_size, 2*config.effect_size)
# re-standardize to maintain target SD
beta_draw *= config.effect_size / beta_draw.std()
beta[causal_idx] = beta_draw

# 4. Genetic component and noise scaled to target h^2
g_train = geno_train @ beta
g_test  = geno_test  @ beta

var_g = g_train.var()
var_e = max(var_g * (1 - H2_TARGET) / H2_TARGET, 1e-12)
env_sd = np.sqrt(var_e)
phen_train = g_train + rng.normal(0.0, env_sd, size=n_train)
phen_test  = g_test  + rng.normal(0.0, env_sd, size=config.num_individuals - n_train)

# Package for later steps
data = {
    "allele_freqs": allele_freqs,
    "geno_train": geno_train,
    "geno_test": geno_test,
    "phen_train": phen_train,
    "phen_test": phen_test,
    "causal_snps": np.sort(causal_idx),
}

# Diagnostics
emp_h2 = var_g / (var_g + var_e)
print(f"Train phenotype mean/var: {phen_train.mean():.3f} / {phen_train.var():.3f}")
print(f"Empirical h^2 ≈ {emp_h2:.3f}")
print(f"Number of causal SNPs: {len(causal_idx)}")

# Preview first 5 rows
preview_snps = 8
df_train = pd.DataFrame(geno_train[:5, :preview_snps],
                        columns=[f"SNP_{i:04d}" for i in range(preview_snps)])
df_train["phen_train"] = phen_train[:5]
print("Train (wide) first rows:")
display(df_train)


### Interpreting the Training Set Preview

You’re seeing only the discovery (training) table. The target (test) data exists but isn’t shown here.

- Columns
  - SNP_0000, SNP_0001, …: Genotypes coded 0/1/2 (0=none, 1=heterozygous, 2=homozygous effect).
  - phen_train: Simulated trait = genetic component g + noise, where g = Σ_j (β_j × genotype_ij) over many tiny-effect causal SNPs. Noise is scaled so target h² ≈ 0.30. Higher = “more” of the trait.

- Reading a row (example)
  - SNP_0000=1, SNP_0001=2, …, phen_train=3.47 → genotypes at many causal SNPs (with small β_j of mixed signs) plus noise produce a phenotype of 3.47.

- Key points
  - Causal SNPs are randomly chosen indices; most β_j are tiny and may be positive or negative.
  - Train and test are separate draws; don’t use test data until evaluation.
  - In this diffuse polygenic case, no single SNP is expected to be genome‑wide significant; signal is in ranking many small |r|.

- effect_size (what it means)
  - Per-allele additive change in the phenotype (slope). Moving 0→1→2 copies adds ≈ effect_size each step.
  - Simulation vs real data: in this notebook we choose effect_size (β). In real studies, β is unknown and estimated (β̂) via GWAS (per‑SNP regression), with standard errors and p‑values.
  - Larger |effect_size| generally produces bigger |r| and higher PRS R², but detectability also depends on sample size, allele frequency, and noise. In this notebook we use r as a simple proxy weight for β̂.

- Quick checks
  - Genotypes are only 0/1/2.
  - Train and test (when viewed) should be on similar scales given the same settings.
  - Causal SNPs won’t be visually obvious; association appears via correlation.

- Next
  - Standardize (z-score) to compare SNP–trait correlations on a common scale.

In [None]:
# Visualize the ground truth effect sizes
plt.figure(figsize=(8, 3))
causal_idx = data['causal_snps']
effect_sizes = beta[causal_idx]
plt.stem(np.arange(len(causal_idx)), effect_sizes)
plt.axhline(0, color='gray', linestyle='-', alpha=0.3)
plt.ylabel('Effect Size (per allele)')
plt.xlabel('Causal SNP Index')
plt.title('True Effect Sizes (Polygenic Architecture)')
plt.tight_layout()
plt.show()

### Step 2: Standardization

We now convert raw genotype and phenotype values into **z-scores** in the discovery set.  A z-score answers the question: *"How many standard deviations above (+) or below (-) the mean is this value?"*

**Formula:**  
$z = (value - mean) / \mathrm{SD}$

**Why we do this before computing correlations:**
- Puts every SNP column and the phenotype on the **same scale** (mean 0, SD 1).
- Makes Pearson’s $r$ simply the **average product** of two standardized variables, which acts like an effect size.
- Allows us to compare SNP signals fairly — rare and common variants are no longer unfairly scaled.
- Keeps the *ordering* of individuals unchanged; we merely shift and rescale.

In a polygenic context with thousands of small effects, standardization is even more important: it ensures that none of the weak signals is artificially inflated or deflated simply due to allele frequency or measurement scale differences.


In [None]:
# Step 2: Standardize Genotypes and Phenotypes

def standardize_matrix(mat: np.ndarray):  # -> Tuple[np.ndarray, np.ndarray, np.ndarray]
    mean = mat.mean(axis=0)
    std = mat.std(axis=0)
    std[std == 0] = 1.0
    return (mat - mean) / std, mean, std

Z_geno_train, geno_mean_train, geno_std_train = standardize_matrix(data['geno_train'])
Z_geno_test, geno_mean_test_raw, geno_std_test_raw = standardize_matrix(data['geno_test'])  # independent standardization

phen_train = data['phen_train']
phen_test = data['phen_test']
Z_phen_train = (phen_train - phen_train.mean()) / phen_train.std()
Z_phen_test = (phen_test - phen_test.mean()) / phen_test.std()

# Diagnostics to show why raw vs standardized look similar
print("Raw phenotype   mean/std = {:.3f} / {:.3f}".format(phen_train.mean(), phen_train.std()))
print("Standardized    mean/std = {:.3f} / {:.3f}".format(Z_phen_train.mean(), Z_phen_train.std()))
print("Raw   min/max = {:.3f} / {:.3f}".format(phen_train.min(), phen_train.max()))
print("Z     min/max = {:.3f} / {:.3f}".format(Z_phen_train.min(), Z_phen_train.max()))

fig, axes = plt.subplots(1,2, figsize=(10,4))
# Raw distribution
axes[0].hist(phen_train, bins=30, color='skyblue', edgecolor='black')
axes[0].axvline(phen_train.mean(), color='k', linestyle='--', linewidth=1, label='Mean')
axes[0].set_title('Raw Phenotype (Discovery)')
axes[0].set_xlabel('Phenotype')
axes[0].legend()

# Standardized distribution (fixed axis to emphasize z-scale)
axes[1].hist(Z_phen_train, bins=30, color='salmon', edgecolor='black', density=True)
axes[1].axvline(0, color='k', linestyle='--', linewidth=1, label='Mean=0')
axes[1].set_xlim(-4,4)
axes[1].set_title('Standardized Phenotype (Discovery)')
axes[1].set_xlabel('Z-Phenotype')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Additional viz: one SNP before/after standardization (discovery set)
# Pick a SNP with allele frequency near 0.30 for variability
p = data['allele_freqs']
idx_snp = int(np.argmin(np.abs(p - 0.30)))
x_raw = data['geno_train'][:, idx_snp]
x_z = Z_geno_train[:, idx_snp]

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

# Left: raw genotype distribution (0/1/2)
axes[0].hist(x_raw, bins=[-0.5, 0.5, 1.5, 2.5], color='steelblue', edgecolor='black')
axes[0].set_xticks([0,1,2])
axes[0].axvline(x_raw.mean(), color='k', linestyle='--', linewidth=1, label=f"Mean={x_raw.mean():.2f}")
axes[0].set_title(f"SNP_{idx_snp:04d} (raw)")
axes[0].set_xlabel('Genotype (0/1/2)')
axes[0].set_ylabel('Count')
axes[0].legend(frameon=False)

# Right: standardized genotype distribution (z-scores)
axes[1].hist(x_z, bins=30, color='indianred', edgecolor='black', density=True)
axes[1].axvline(0, color='k', linestyle='--', linewidth=1, label='Mean=0')
axes[1].axvline(1, color='gray', linestyle=':', linewidth=1, label='±1 SD')
axes[1].axvline(-1, color='gray', linestyle=':', linewidth=1)
axes[1].set_title(f"SNP_{idx_snp:04d} (standardized)")
axes[1].set_xlabel('Z-score')
axes[1].legend(frameon=False)

plt.tight_layout()
plt.show()
# ...existing code...

### Step 3: GWAS Correlations

**Goal.** Quantify the relationship between each SNP and the trait.

**What we do.** For each SNP $j$, we compute the Pearson correlation $r_j$ between its standardized genotypes and the standardized phenotype in the discovery set.  Because both are z-scored, $r_j$ is just the average product of two series of standardized numbers.

**How to read $r$:**
- $r_j \approx 0$ means no detectable association.
- $r_j > 0$ means individuals with more minor alleles tend to have **higher** trait values.
- $r_j < 0$ means individuals with more minor alleles tend to have **lower** trait values.
- Larger $|r_j|$ implies a stronger SNP–trait link on a common scale (and $r_j^2$ is the in-sample variance explained by SNP $j$).

**Why so small?**  In a diffuse polygenic architecture, each true causal SNP has a tiny effect.  Their $|r_j|$ values are buried in sampling noise, so the largest observed $|r_j|$ will typically be modest (≈ 0.17) and may not exceed the permutation threshold.  The important information is in the **ranking** of $|r_j|$, not the individual magnitude.

**Intuition.**  To decide if a SNP and trait move together, imagine multiplying their standardized values person-by-person and averaging: if the values tend to be above average together, the average product is positive; if one tends to be high when the other is low, the average product is negative; if there’s no pattern, the average product is near zero.


In [None]:
# Step 3: GWAS via per-SNP correlations

r_values = (Z_geno_train * Z_phen_train[:, None]).mean(axis=0)
r_abs = np.abs(r_values)
max_r = r_abs.max()
print(f"Maximum |r| = {max_r:.3f}")


In [None]:
# Optional Step 3 viz: one SNP vs phenotype (discovery set)

import numpy as np
import matplotlib.pyplot as plt

idx = int(np.argmax(r_abs))  # SNP with largest |r|
snp_name = f"SNP_{idx:04d}"

x_raw = data['geno_train'][:, idx]
y_raw = phen_train
x_z = Z_geno_train[:, idx]
y_z = Z_phen_train
r = float(r_values[idx]); r2 = r*r

rng = np.random.default_rng(123)
jit = (rng.random(x_raw.size) - 0.5) * 0.12  # jitter to separate 0/1/2 columns

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Raw scale: shows per-allele shifts
axes[0].scatter(x_raw + jit, y_raw, s=8, alpha=0.5)
means = [y_raw[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
axes[0].plot([0,1,2], means, 'r-o', lw=2, label='Group means')
axes[0].set_xticks([0,1,2]); axes[0].legend(frameon=False)
axes[0].set_xlabel('Genotype (0/1/2)'); axes[0].set_ylabel('Phenotype (raw)')
axes[0].set_title(f'{snp_name} vs phen_train (raw)')

# Z scale: matches how r is computed
axes[1].scatter(x_z + jit, y_z, s=8, alpha=0.5)
# Group means in z space (plot at mean x_z for each genotype 0/1/2)
x_means = [x_z[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
y_means = [y_z[x_raw==g].mean() if np.any(x_raw==g) else np.nan for g in [0,1,2]]
axes[1].plot(x_means, y_means, 'r-o', lw=2, label='Group means')
axes[1].legend(frameon=False)
axes[1].set_xlabel('Genotype (z-scored)'); axes[1].set_ylabel('Phenotype (z-scored)')
axes[1].set_title(f'{snp_name} (|r|={abs(r):.2f}, R²={r2:.2f})')

plt.tight_layout(); plt.show()

### How the scatterplot connects to Step 3 correlations - Polygenic Case

- Left panel (raw):
  - Three vertical bands (genotype 0/1/2). The red "group means" line shows the average phenotype for each genotype.
  - In this polygenic architecture, even the SNP with largest |r| shows only a mild slope, reflecting its small effect size.
  - The difference between genotype groups is modest compared to the within-group variation.

- Right panel (z-scored):
  - Both axes are standardized, which is exactly how we compute r in Step 3.
  - Pearson r for this SNP is the average product r = mean(x_z · y_z).
  - Notice the r value is small (typically ~0.15-0.17) - this is expected in polygenic traits where each SNP has a tiny effect.
  - In polygenic architectures, no single SNP explains much variance (low R²).

- Why this matters:
  - Even the "best" SNP shows a weak association - yet collectively, many such weak signals can be informative.
  - The value of r becomes our weight when building the polygenic score in Step 5.
  - In polygenic traits, we rely on aggregating many weak signals rather than finding a few strong ones.

Note: The cloud of points lacks the clear upward trend seen in Case 1 (sparse architecture), illustrating why polygenic traits need different analytical approaches.

### Mini-GWAS framing (what we’re mimicking)

- A GWAS tests each SNP across the genome for association with a phenotype (one SNP at a time).  
- Real GWAS uses regression (**β, SE, p-value**) and includes covariates (age, sex, ancestry PCs) to control confounding.  
- In this exercise, we use **z-scores** and compute a simple **per-SNP correlation (r)** as a stand-in for GWAS effect size.  
- Output of this step is a “summary stats–like” table: **SNP ID, r, |r|**.  
- Next, we’ll visualize all SNPs together (Manhattan plot) and set a significance cut line via **permutation**—analogous to GWAS genome-wide thresholds.  

**TL;DR:** Step 3 ≈ a **mini-GWAS** pass over SNPs to get per-SNP effects we can carry forward.  


### Step 4: Permutation (simple) + Manhattan Plot

From one SNP to all SNPs
- Step 3 computed an r for each SNP; the Manhattan plot shows these |r| values across all SNPs on the same z-scale.

What we do (simple permutation)
- We ask: “How tall could the biggest |r| be just by chance?”
- Shuffle the phenotype B times.
- Each time, compute |r| across all SNPs and record the maximum.
- Use the 95th percentile of these maxima as the dashed threshold line.

How to read the Manhattan
- Dot height = |r|; the dashed line is the permutation-based threshold.
- Red stars = Top‑K SNPs by |r| (used for PRS), even if none pass the threshold.
- In a polygenic trait, most points sit near zero; few (if any) exceed the line.

Polygenic interpretation
- Lack of peaks above the threshold does not mean “no genetics.”
- The useful signal is in the ranking of many small effects; we aggregate Top‑K into a PRS in Step 5.

Beginner tip
- Think “many pennies add up,” or a “tallest building contest”: the line sits just above how tall the tallest would be if there were no real signal.

In [None]:
# ...existing code...
# Step 4: Compact permutation + Manhattan (simple line + Top‑K overlay)
def perm_threshold(Z_geno: np.ndarray, y: np.ndarray, B: int = 100, q: float = 0.95, seed: int = 0) -> float:
    rng = np.random.default_rng(seed)
    max_abs = np.empty(B, dtype=float)
    for b in range(B):
        y_shuf = rng.permutation(y)
        r_shuf = (Z_geno * y_shuf[:, None]).mean(axis=0)
        max_abs[b] = np.abs(r_shuf).max()
    return float(np.quantile(max_abs, q))

ALPHA = 0.05
B = config.n_permutations if hasattr(config, "n_permutations") else 100
threshold = perm_threshold(Z_geno_train, Z_phen_train, B=B, q=1-ALPHA, seed=0)

# Selection for display and PRS
r_abs = np.abs(r_values)
topK = config.broad_PGS_k
topK_idx = np.argsort(r_abs)[-topK:]
selected_idx = np.flatnonzero(r_abs >= threshold)  # often 0 in polygenic case

print(f"Permutation threshold (95%): {threshold:.3f} | Above threshold: {selected_idx.size} | Top-K (PRS) = {topK}")

# Manhattan plot
plt.figure(figsize=(10, 5))
x = np.arange(config.num_snps)
plt.scatter(x, r_abs, s=10, c='black', alpha=0.6, label='All SNPs')
plt.scatter(topK_idx, r_abs[topK_idx], s=40, c='red', alpha=0.9, marker='*', label=f'Top {topK} by |r| (PRS)')
plt.axhline(threshold, color='tab:blue', linestyle='--', linewidth=1.5, label='95% permutation threshold')
plt.xlabel('SNP index'); plt.ylabel('Absolute correlation |r|')
plt.title('Manhattan Plot (Permutation line + Top‑K overlay)')
plt.legend(loc='upper right', frameon=False)
plt.tight_layout(); plt.show()


### From GWAS results to Manhattan (our version)

- Classic GWAS Manhattan plots **−log10(p)** by genomic position; taller peaks = stronger evidence.  
- Here, we plot **|r|** instead of −log10(p). The goal is the same: a genome-wide view of signal strength.  
- GWAS uses fixed significance lines (e.g., *5×10⁻⁸*). We use a **permutation-based line** that reflects our data.  
- Selection rule: take all SNPs above the line; if none cross, use a **top-K by |r| fallback** (keep the sign of r for direction).  
- Same idea, simpler ingredients: our permutation line plays the role of a **GWAS significance threshold**.  


### Step 5: Build and Evaluate a Polygenic Score

**Goal:** Aggregate many weak genetic signals into a single predictive score and evaluate its performance.

**What we do:**
1. **Select SNPs:** Since no SNPs pass the permutation threshold, we use the top-K SNPs by |r| (K=50)
2. **Calculate PRS:** Weighted sum of standardized genotypes (weights = correlations from discovery)
3. **Evaluate performance:** Correlation metrics (r, R²) and decile stratification
4. **Visualize:** Scatter plot and decile plot showing PRS-phenotype relationship

In polygenic traits, the decile plot is especially valuable - it demonstrates that even when no individual SNP is significant, aggregating many weak signals can still stratify individuals by genetic risk.

In [None]:
# Step 5: Build and evaluate polygenic score (PRS)

# Always use top-K SNPs (no genome-wide hits)
selected_idx = topK_idx
label = f'Top-{topK} by |r|'

# Build raw PRS for target samples
prs_raw = Z_geno_test[:, selected_idx] @ r_values[selected_idx]
# Standardize PRS
prs = (prs_raw - prs_raw.mean()) / prs_raw.std()

# Correlation with phenotype
R = float(np.corrcoef(prs, Z_phen_test)[0,1]) if prs.std() > 0 else 0.0
R2 = R * R
print(f"PRS type: {label}")
print(f"SNPs used: {len(selected_idx)} | Pearson r = {R:.3f} | R^2 = {R2:.3f}")

# Visualization: Scatter plot
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(prs, Z_phen_test, s=8, alpha=0.4)
plt.xlabel('PRS (z-score)')
plt.ylabel('Phenotype (z-score)')
plt.title(f'PRS vs Phenotype (r={R:.2f}, R²={R2:.2f})')

# Visualization: Decile plot
plt.subplot(1, 2, 2)
# Create deciles
edges = np.quantile(prs, np.linspace(0,1,11))
dec = np.digitize(prs, edges[1:-1], right=True)
mean_per_decile = [Z_phen_test[dec == d].mean() if np.sum(dec==d) > 0 else np.nan for d in range(10)]
se_per_decile = [Z_phen_test[dec == d].std(ddof=1)/np.sqrt(np.sum(dec==d)) if np.sum(dec==d) > 1 else np.nan for d in range(10)]
gap = mean_per_decile[-1] - mean_per_decile[0] if not np.isnan(mean_per_decile[0]) and not np.isnan(mean_per_decile[-1]) else np.nan

plt.errorbar(range(1,11), mean_per_decile, yerr=se_per_decile, fmt='-o', capsize=3)
plt.title(f'Phenotype by PRS Decile (Δ₁₀–₁ ≈ {gap:.2f} SD)')
plt.xlabel('PRS Decile (1=lowest, 10=highest)')
plt.ylabel('Mean Phenotype (z-score)')

plt.tight_layout()
plt.show()

### Step 5 Interpretation

**Performance metrics:**
- PRS type: Top-50 by |r|
- SNPs used: 50
- Correlation: r ≈ 0.16, R² ≈ 0.02

**What this means:**
- **No genome-wide significant SNPs:** In this polygenic architecture, no single variant has a large enough effect to pass the threshold
- **Modest but meaningful prediction:** Despite small individual effects, aggregating the top 50 SNPs produces a score that correlates with the trait
- **Small R²:** The PRS explains only a small fraction of variance, reflecting the challenging nature of polygenic prediction

**Understanding the decile plot:**
- Each point shows the average trait value for individuals in that PRS decile
- Error bars represent standard error of the mean
- The gradual upward slope confirms the PRS has real predictive value
- The gap between lowest and highest deciles (Δ₁₀-₁ ≈ 0.5 SD) represents the practical effect size
- In this polygenic case, the gradient is more modest than in a sparse architecture, but still shows clear stratification

**Why this matters:**
This demonstrates a key principle of complex traits: even when no single variant reaches significance, the aggregated small effects can still provide useful prediction. Most human traits and diseases follow this pattern, where many tiny genetic effects combine to influence outcomes.

### Conclusion: What We Learned from Case 2

**Journey summary**  
We explored a diffuse polygenic architecture where many SNPs each have tiny effects. No single SNP clears a strict threshold, yet aggregating weak signals into a PRS yields measurable prediction.

**What we accomplished**
1. Simulated a polygenic trait (h² ≈ 0.30) with many small-effect causal SNPs
2. Standardized genotypes and phenotypes for fair, comparable correlations
3. Computed per-SNP r values and used permutations to set a family-wise threshold
4. Observed a “flat sea” Manhattan plot with no genome-wide hits (as expected)
5. Built a PRS from the top-K SNPs by |r| and evaluated it on a held-out set
6. Visualized performance via scatter and decile plots (modest r, small R², clear but gradual stratification)

**Key insights**
- In polygenic traits, the useful signal lies in the ranking of many small effects; aggregation beats single-marker significance
- Lack of significant peaks does not imply a non-genetic trait
- PRS performance is modest per individual but informative at the group level; it improves with larger training N and better methods
- Standardization and honest holdout evaluation prevent leakage and overstatement of accuracy

**Why this matters**  
Most complex traits and common diseases are polygenic. Even modest PRS can stratify risk and inform research, screening, and trial enrichment when used responsibly.

**Taking it further**
- Tune K via cross-validation; add LD-aware/shrinkage methods (clump+threshold, ridge/BLUP, LDpred, PRS-CS, lassosum)
- Increase discovery sample size or use external GWAS summary stats
- Include covariates (age/sex/PCs), check ancestry transferability, and assess calibration/clinical utility.

# Complete the Reflection & Comparison Questions in Shared Slides in Groups in Canvas