# Exercise 2: Exploring Mutational Signatures in PCAWG Data

## Overview

In this exercise, we will explore real mutational signature data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium. The PCAWG project analyzed over 2,600 cancer samples across multiple cancer types, providing a comprehensive landscape of mutational signatures in human cancer.

## Learning Objectives

By the end of this exercise, you will be able to:
1. Load and explore large-scale mutational signature datasets
2. Compare signature patterns across different cancer types
3. Interpret signature activities using the COSMIC SBS database
4. Perform statistical analysis of signature distributions
5. Identify clinically relevant signature associations

## Part 1: Data Loading and Initial Exploration

In [45]:
# Import required libraries
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import musical
import warnings
from scipy.spatial.distance import cosine
warnings.filterwarnings('ignore')

# Set plotting parameters
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

In [None]:
# Load PCAWG mutational signature profiles
# TODO: Load the PCAWG-146_profiles.csv file
pcawg_profiles = pd.read_csv('data/PCAWG-146_profiles.csv', index_col=0)

# Display basic information about the dataset
print(f"Dataset shape: {pcawg_profiles.shape}")

Dataset shape: (96, 146)


Unnamed: 0_level_0,Biliary-AdenoCA::SP99325,Bladder-TCC::SP1059,Bladder-TCC::SP96136,Breast-AdenoCA::SP117369,Breast-AdenoCA::SP2293,CNS-GBM::SP25494,ColoRect-AdenoCA::SP18310,ColoRect-AdenoCA::SP17172,ColoRect-AdenoCA::SP96133,ColoRect-AdenoCA::SP110242,...,Stomach-AdenoCA::SP84392,Stomach-AdenoCA::SP105018,Stomach-AdenoCA::SP84439,Stomach-AdenoCA::SP84922,Uterus-AdenoCA::SP90209,Uterus-AdenoCA::SP93540,Uterus-AdenoCA::SP93227,Uterus-AdenoCA::SP94933,Uterus-AdenoCA::SP92364,Uterus-AdenoCA::SP92659
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A[C>A]A,542,883,268,92,61,190,239,321,674,554,...,473,1155,1171,276,291,197,648,528,457,692
A[C>A]C,595,486,206,48,31,211,261,280,368,423,...,375,1012,910,193,1028,222,5227,326,358,1040
A[C>A]G,72,146,53,9,7,25,25,32,64,55,...,32,134,115,25,40,27,168,59,62,119
A[C>A]T,895,605,155,40,39,221,462,491,495,723,...,426,1059,1664,183,6354,268,31709,303,386,3914
C[C>A]A,1571,966,252,86,92,213,930,1109,978,1742,...,910,1086,2812,239,2288,653,7809,305,473,923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
G[T>G]T,659,154,150,12,9,61,280,1046,622,787,...,724,314,3917,4277,1183,146,4113,479,744,926
T[T>G]A,812,207,183,25,13,67,42,170,133,77,...,101,559,685,299,65,24,129,159,194,4524
T[T>G]C,256,83,91,11,3,62,54,137,91,116,...,144,210,576,514,290,41,1250,203,248,1237
T[T>G]G,459,167,136,15,7,76,210,278,192,239,...,214,407,768,339,129,124,211,340,339,594


In [11]:
# Cancer types in the dataset
pcawg_profiles.columns.map(lambda x: x.split('::')[0]).value_counts()

Skin-Melanoma       65
Lung-SCC            17
ColoRect-AdenoCA    16
Eso-AdenoCA         10
Stomach-AdenoCA     10
Lung-AdenoCA         8
Uterus-AdenoCA       6
Head-SCC             3
Bladder-TCC          2
Breast-AdenoCA       2
Lymph-BNHL           2
Panc-AdenoCA         2
Biliary-AdenoCA      1
CNS-GBM              1
Liver-HCC            1
Name: count, dtype: int64

In [None]:
# Load precomputed musical outputs
with open('./data/PCAWG-146_model.pkl', 'rb') as f:
    model = pickle.load(f)

<musical.denovo.DenovoSig at 0x17eb3cbe0>

In [43]:
W = model.W_df
H = model.H_df
W_catalog = model.W_catalog

In [53]:
# Print de novo signatures
W

Unnamed: 0,Sig1,Sig2,Sig3,Sig4,Sig5,Sig6,Sig7,Sig8,Sig9,Sig10,Sig11,Sig12,Sig13,Sig14,Sig15,Sig16,Sig17
A[C>A]A,0.001453,0.005154,2.777027e-03,0.000822,0.000802,0.003175,5.096567e-04,0.001915,0.005053,0.001096,0.053333,0.018554,0.002990,2.093870e-03,0.002781,0.000485,0.000720
A[C>A]C,0.000801,0.002891,1.069933e-03,0.000563,0.000970,0.000878,3.465061e-03,0.002676,0.002719,0.001425,0.028191,0.014620,0.002468,1.187869e-03,0.002437,0.001409,0.006133
A[C>A]G,0.000178,0.000670,1.556058e-04,0.000093,0.000109,0.000206,1.206555e-07,0.000223,0.000300,0.000323,0.012906,0.001785,0.000314,1.087896e-04,0.000119,0.000055,0.000192
A[C>A]T,0.000712,0.003841,2.084579e-03,0.001111,0.001024,0.001262,2.377382e-04,0.011501,0.002187,0.021487,0.033100,0.012940,0.000301,1.206473e-07,0.000500,0.007699,0.037702
C[C>A]A,0.001670,0.004663,1.926408e-07,0.002308,0.000934,0.003219,2.707843e-03,0.002899,0.002849,0.004347,0.072207,0.010944,0.004621,1.205542e-02,0.012000,0.003630,0.009492
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
G[T>G]T,0.000162,0.007487,5.285886e-03,0.002043,0.000227,0.005358,3.768347e-03,0.003668,0.071338,0.011885,0.000051,0.003996,0.004222,2.823110e-02,0.005268,0.001908,0.004953
T[T>G]A,0.000371,0.007842,7.887094e-03,0.001084,0.000281,0.004687,2.668796e-03,0.016589,0.005673,0.017006,0.000878,0.011893,0.001238,2.705641e-03,0.000102,0.000001,0.000103
T[T>G]C,0.000123,0.005014,8.710143e-03,0.000957,0.000283,0.004416,8.938456e-03,0.005874,0.007627,0.016258,0.000378,0.004193,0.000846,2.411043e-03,0.000305,0.000340,0.001504
T[T>G]G,0.000200,0.003962,4.763181e-03,0.000753,0.000346,0.004324,1.743479e-03,0.002418,0.005202,0.006887,0.002548,0.010034,0.002412,2.540582e-03,0.002912,0.000224,0.000207


In [54]:
# Print reference signature catalog
W_catalog

Unnamed: 0_level_0,SBS1,SBS2,SBS3,SBS4,SBS5,SBS6,SBS7a,SBS7b,SBS7c,SBS7d,...,SBS91,SBS92,SBS93,SBS94,SBS95,SBS96,SBS98,SBS97,SBS99,SBS100
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A[C>A]A,8.861572e-04,5.800168e-07,0.020808,0.042196,0.011998,0.000425,6.704351e-05,0.002329,0.004830,0.000040,...,0.002945,0.011329,0.011573,0.015580,0.014191,0.002303,0.013372,3.393376e-04,0.010959,0.004213
A[C>A]C,2.280405e-03,1.480043e-04,0.016507,0.033297,0.009438,0.000524,1.791162e-04,0.000461,0.001150,0.000765,...,0.052997,0.009745,0.008096,0.024746,0.004125,0.000252,0.010144,4.083738e-03,0.008626,0.020254
A[C>A]G,1.770314e-04,5.230151e-05,0.001751,0.015599,0.001850,0.000052,7.124623e-05,0.000186,0.000377,0.000250,...,0.000204,0.004697,0.001761,0.001574,0.001476,0.000000,0.002156,4.285896e-04,0.000852,0.023699
A[C>A]T,1.280227e-03,9.780282e-05,0.012205,0.029498,0.006609,0.000180,2.481610e-04,0.000710,0.001960,0.004049,...,0.000131,0.007758,0.008421,0.011076,0.001789,0.000000,0.012239,1.545020e-03,0.022806,0.039299
C[C>A]A,3.120554e-04,2.080060e-04,0.022509,0.080693,0.007429,0.001821,4.552955e-04,0.001140,0.000109,0.014498,...,0.008191,0.018550,0.006498,0.079926,0.002319,0.000000,0.007782,3.545584e-03,0.014845,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
G[T>G]T,1.460259e-05,2.230064e-16,0.005832,0.000252,0.002350,0.000787,8.355422e-04,0.001830,0.006751,0.006839,...,0.000054,0.000200,0.006148,0.002121,0.003204,0.004054,0.004398,1.199683e-07,0.012050,0.017845
T[T>G]A,2.230396e-16,1.670048e-05,0.007253,0.000377,0.005219,0.000105,1.280831e-04,0.000955,0.019302,0.000211,...,0.005955,0.002208,0.053674,0.004072,0.001073,0.003347,0.008400,2.296746e-04,0.069147,0.004601
T[T>G]C,5.510978e-05,7.040203e-05,0.006283,0.000174,0.006559,0.000287,1.160753e-04,0.001550,0.017401,0.000115,...,0.000143,0.000301,0.013276,0.001235,0.002069,0.001813,0.005467,3.090285e-04,0.015748,0.000000
T[T>G]G,5.831035e-04,9.540276e-05,0.008053,0.002320,0.006939,0.000324,2.231448e-16,0.001350,0.007641,0.000125,...,0.000628,0.001743,0.012705,0.003048,0.006678,0.000000,0.007531,3.340668e-03,0.021036,0.019134


In [55]:
# Esposure activity
H

Unnamed: 0,Biliary-AdenoCA::SP99325,Bladder-TCC::SP1059,Bladder-TCC::SP96136,Breast-AdenoCA::SP117369,Breast-AdenoCA::SP2293,CNS-GBM::SP25494,ColoRect-AdenoCA::SP18310,ColoRect-AdenoCA::SP17172,ColoRect-AdenoCA::SP96133,ColoRect-AdenoCA::SP110242,...,Stomach-AdenoCA::SP84392,Stomach-AdenoCA::SP105018,Stomach-AdenoCA::SP84439,Stomach-AdenoCA::SP84922,Uterus-AdenoCA::SP90209,Uterus-AdenoCA::SP93540,Uterus-AdenoCA::SP93227,Uterus-AdenoCA::SP94933,Uterus-AdenoCA::SP92364,Uterus-AdenoCA::SP92659
Sig1,3634.550421,24646.366227,29683.325454,54999.173197,66237.709946,0.0,377.438604,550.234451,0.0,0.0,...,377.072633,6842.701158,0.0,1450.493847,0.0,0.0,0.0,0.0,0.0,0.0
Sig2,7163.846364,754.539087,0.0,0.0,0.0,412.803271,0.0,2980.707202,1195.004099,0.0,...,2803.289484,3472.971718,9755.462111,0.0,0.0,0.0,0.0,3682.543905,0.0,0.0
Sig3,0.0,0.0,7366.93527,7292.272768,3002.674888,0.0,0.0,246.002896,401.859867,1961.355355,...,0.0,0.0,2105.901145,0.0,0.0,0.0,0.0,2214.722906,0.0,0.0
Sig4,0.0,0.0,3612.336094,2585.669972,1559.568737,11049.442569,485.757713,25.998119,246.898125,0.0,...,0.0,0.0,0.0,281.880282,1304.316796,0.0,0.0,0.0,954.759452,0.0
Sig5,4278.068532,0.0,0.0,0.0,0.0,237879.471277,277.072541,0.0,1476.931492,2343.970208,...,1794.660988,0.0,5287.104187,0.0,0.0,3889.462207,0.0,0.0,255.177696,0.0
Sig6,0.0,0.0,0.0,0.0,0.0,0.0,1797.736705,549.883906,0.0,15.838711,...,0.0,84.493597,7602.061258,28.676633,2262.895554,0.0,0.0,0.0,0.0,0.0
Sig7,5621.365506,0.0,0.0,0.0,0.0,0.0,1015.160644,1526.723787,228.65669,0.0,...,472.526352,33.953288,0.0,153.899886,0.0,0.0,0.0,0.0,132.578266,0.0
Sig8,0.0,321.930479,0.0,0.0,0.0,0.0,1073.682492,954.493802,1526.645321,0.0,...,684.507835,73.597204,1692.048008,71.532428,7926.531924,0.0,2391.933702,487.792373,1336.949478,281834.474304
Sig9,0.0,841.503116,148.792343,0.0,0.0,0.0,0.0,4771.235731,4638.086293,2903.367416,...,5617.882928,1391.534094,4853.612899,58287.785982,0.0,0.0,0.0,911.7483,2397.418958,0.0
Sig10,1961.997166,0.0,749.483914,0.0,0.0,0.0,0.0,1262.344727,0.0,0.0,...,607.635728,942.284143,0.0,8014.062276,0.0,0.0,0.0,3599.693567,1422.669364,0.0


## Part 2: Cancer Type-Specific Signature Analysis

In [56]:
# TODO: Assign each de novo (W) signature to the closest COSMIC signature based on cosine similarity, if possible. Rename the others as New_{i} (i in 1,2,3,...). Visualize pairs of de novo and reference signatures with high and low matching.

In [57]:
# TODO: Display the total burden per patient (H columns) grouped by re-annotated signatures

## Part 3: COSMIC SBS Database Integration

In [58]:
# COSMIC SBS signature annotations
cosmic_sbs_annotations = {
    'SBS1': 'Age-related (spontaneous deamination of 5-methylcytosine)',
    'SBS2': 'APOBEC cytidine deaminase activity',
    'SBS3': 'Homologous recombination deficiency',
    'SBS4': 'Tobacco smoking',
    'SBS5': 'Age-related (unknown mechanism)',
    'SBS6': 'Mismatch repair deficiency',
    'SBS7a': 'UV radiation exposure',
    'SBS7b': 'UV radiation exposure',
    'SBS8': 'Unknown etiology',
    'SBS9': 'Polymerase η somatic hypermutation',
    'SBS10a': 'POLE exonuclease domain mutations',
    'SBS10b': 'POLE exonuclease domain mutations',
    'SBS11': 'Alkylating agents',
    'SBS12': 'Unknown etiology',
    'SBS13': 'APOBEC cytidine deaminase activity',
    'SBS14': 'POLD1 exonuclease domain mutations',
    'SBS15': 'Mismatch repair deficiency',
    'SBS16': 'Unknown etiology',
    'SBS17a': 'Unknown etiology',
    'SBS17b': 'Unknown etiology',
    'SBS18': 'Damage by reactive oxygen species',
    'SBS19': 'Unknown etiology',
    'SBS20': 'Mismatch repair deficiency and POLD1 mutations',
    # Add more signatures as needed...
}

# TODO: Map signature names to COSMIC annotations
# Interpret mutation burdens stratified by signature contribution in different cancer types.

## Expected Outputs

By the end of this exercise, you should have generated:

1. **Descriptive Statistics**: Summary tables of signature activities across cancer types
2. **Visualizations**
3. **Statistical Results**
4. **Biological Interpretations**: COSMIC-based annotation of active signatures

## Resources

- [COSMIC Mutational Signatures Database](https://cancer.sanger.ac.uk/signatures/)
- [PCAWG Consortium Papers](https://www.nature.com/collections/afdejfafdb/)
- [Mutational Signatures Analysis Guidelines](https://github.com/AlexandrovLab/SigProfiler)