# 01 - Chimeric Spectra Analysis (PXD037527)

**Dataset**: WWA_30m (Wide Window Acquisition)

**Obiettivo**: Analizzare la distribuzione di spettri chimerici in funzione della isolation window.

**Ipotesi**: Window piÃ¹ larghe â†’ piÃ¹ peptidi co-isolati â†’ piÃ¹ spettri chimerici â†’ piÃ¹ competizione di frammentazione.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Paths - auto-detect
PROJECT_DIR = Path('.').resolve()
while PROJECT_DIR.name != 'v.3.0.0' and PROJECT_DIR != PROJECT_DIR.parent:
    PROJECT_DIR = PROJECT_DIR.parent

DATA_DIR = PROJECT_DIR / 'processed_data'
PLOT_DIR = PROJECT_DIR / 'plots' / '01_chimeric_analysis'
PLOT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Project: {PROJECT_DIR}")
print(f"Data: {DATA_DIR}")
print(f"Plots: {PLOT_DIR}")

mkdir -p failed for path /user/antwerpen/211/vsc21150/.cache/matplotlib: [Errno 122] Disk quota exceeded: '/user/antwerpen/211/vsc21150/.cache/matplotlib'
Matplotlib created a temporary cache directory at /tmp/matplotlib-quexufss because there was an issue with the default path (/user/antwerpen/211/vsc21150/.cache/matplotlib); it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories


Project: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0
Data: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/processed_data
Plots: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/plots/01_chimeric_analysis


## 1. Load Data

In [2]:
# Load PSM data
df = pd.read_csv(DATA_DIR / 'psm_clean.csv')

print(f"Total PSMs: {len(df):,}")
print(f"Columns: {list(df.columns)}")
df.head()

Total PSMs: 1,276,641
Columns: ['Spectrum', 'Spectrum File', 'Peptide', 'Modified Peptide', 'Extended Peptide', 'Prev AA', 'Next AA', 'Peptide Length', 'Charge', 'Retention', 'Observed Mass', 'Calibrated Observed Mass', 'Observed M/Z', 'Calibrated Observed M/Z', 'Calculated Peptide Mass', 'Calculated M/Z', 'Delta Mass', 'SpectralSim', 'RTScore', 'Expectation', 'Hyperscore', 'Nextscore', 'Probability', 'Number of Enzymatic Termini', 'Number of Missed Cleavages', 'Protein Start', 'Protein End', 'Intensity', 'Assigned Modifications', 'Observed Modifications', 'Purity', 'Is Unique', 'Protein', 'Protein ID', 'Entry Name', 'Gene', 'Protein Description', 'Mapped Genes', 'Mapped Proteins', 'source_folder', 'window_mz', 'replicate', 'mzml_name', 'scan_number', 'charge_from_spectrum', 'spectrum_key', 'window_category', 'n_psm', 'is_chimeric', 'peptide_length']


Unnamed: 0,Spectrum,Spectrum File,Peptide,Modified Peptide,Extended Peptide,Prev AA,Next AA,Peptide Length,Charge,Retention,...,window_mz,replicate,mzml_name,scan_number,charge_from_spectrum,spectrum_key,window_category,n_psm,is_chimeric,peptide_length
0,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,interact-Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15...,REMDQTMAANAQK,REM[147]DQTM[147]AANAQK,EITISIIK.REMDQTMAANAQK.NKFIIDGF,K,N,13,3,2168.482,...,12.0,2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,1173,3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,medium,1,False,13
1,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,interact-Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15...,VKEDPDGEHAR,,SISGRPIK.VKEDPDGEHAR.RAMQKVMA,K,R,11,3,2170.9443,...,12.0,2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,1183,3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,medium,1,False,11
2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,interact-Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15...,NEEDEGHSNSSPR,,GAKIDASK.NEEDEGHSNSSPR.HSEAATAQ,K,H,13,3,2171.4343,...,12.0,2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,1185,3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,medium,1,False,13
3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,interact-Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15...,GTSPSSSSRPQR,,FTQFKRIK.GTSPSSSSRPQR.VIEDRDSQ,K,V,12,3,2174.5068,...,12.0,2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,1198,3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,medium,2,True,12
4,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,interact-Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15...,VKEDPDGEHAR,,SISGRPIK.VKEDPDGEHAR.RAMQKVMA,K,R,11,3,2174.5068,...,12.0,2,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,1198,3,Ex_AuLC1_30m_2D19_3_20um30cm_SPE50_15118120_OT...,medium,2,True,11


In [3]:
# Basic stats
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Total PSMs:          {len(df):,}")
print(f"Unique spectra:      {df['spectrum_key'].nunique():,}")
print(f"Unique peptides:     {df['Peptide'].nunique():,}")
print(f"mzML files:          {df['mzml_name'].nunique()}")
print(f"Window sizes:        {sorted(df['window_mz'].unique())}")
print(f"\nChimeric spectra:    {df[df['is_chimeric']]['spectrum_key'].nunique():,} ({100*df['is_chimeric'].mean():.1f}%)")

DATASET OVERVIEW
Total PSMs:          1,276,641
Unique spectra:      693,596
Unique peptides:     26,266
mzML files:          58
Window sizes:        [1.6, 2.0, 4.0, 8.0, 12.0, 18.0, 24.0, 48.0]

Chimeric spectra:    355,592 (73.5%)


## 2. Summary Statistics per Window Size

In [4]:
# Aggregate by window size
window_stats = df.groupby('window_mz').agg(
    n_psm=('spectrum_key', 'count'),
    n_spectra=('spectrum_key', 'nunique'),
    n_chimeric=('is_chimeric', 'sum'),
    pct_chimeric=('is_chimeric', 'mean'),
    avg_psm_per_spectrum=('n_psm', 'mean'),
    n_peptides=('Peptide', 'nunique'),
).reset_index()

window_stats['pct_chimeric'] = (window_stats['pct_chimeric'] * 100).round(1)
window_stats['avg_psm_per_spectrum'] = window_stats['avg_psm_per_spectrum'].round(2)

# Add window category
window_stats['category'] = pd.cut(
    window_stats['window_mz'],
    bins=[0, 4, 12, 100],
    labels=['Narrow (â‰¤4)', 'Medium (8-12)', 'Wide (â‰¥18)']
)

print("Statistics per Window Size:")
window_stats

Statistics per Window Size:


Unnamed: 0,window_mz,n_psm,n_spectra,n_chimeric,pct_chimeric,avg_psm_per_spectrum,n_peptides,category
0,1.6,23514,19623,7343,31.2,1.37,10015,Narrow (â‰¤4)
1,2.0,92099,73211,34638,37.6,1.49,15378,Narrow (â‰¤4)
2,4.0,130544,87566,74649,57.2,1.87,17970,Narrow (â‰¤4)
3,8.0,176538,99058,127749,72.4,2.28,19018,Medium (8-12)
4,12.0,197742,103229,152800,77.3,2.45,17782,Medium (8-12)
5,18.0,222201,107548,181045,81.5,2.64,17149,Wide (â‰¥18)
6,24.0,220587,104971,181471,82.3,2.68,15111,Wide (â‰¥18)
7,48.0,213416,98390,178942,83.8,2.75,11244,Wide (â‰¥18)


## 3. Visualization: Chimericity by Window Size

### Plot 1: % Chimeric Spectra vs Window Size

In [5]:
fig, ax = plt.subplots(figsize=(10, 6))

# Colors by category
colors = {'Narrow (â‰¤4)': '#2ecc71', 'Medium (8-12)': '#f39c12', 'Wide (â‰¥18)': '#e74c3c'}
bar_colors = [colors[cat] for cat in window_stats['category']]

bars = ax.bar(window_stats['window_mz'].astype(str), 
              window_stats['pct_chimeric'],
              color=bar_colors,
              edgecolor='black',
              linewidth=1.2)

# Add value labels
for bar, val in zip(bars, window_stats['pct_chimeric']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{val:.1f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.set_xlabel('Isolation Window (m/z)', fontsize=14)
ax.set_ylabel('Chimeric Spectra (%)', fontsize=14)
ax.set_title('Chimericity Increases with Isolation Window Width', fontsize=16, fontweight='bold')
ax.set_ylim(0, 100)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=l, edgecolor='black') for l, c in colors.items()]
ax.legend(handles=legend_elements, loc='upper left', fontsize=11)

plt.tight_layout()
plt.savefig(PLOT_DIR / 'plot1_chimericity_by_window.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"âœ… Saved: {PLOT_DIR / 'plot1_chimericity_by_window.png'}")

<Figure size 1000x600 with 1 Axes>

âœ… Saved: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/plots/01_chimeric_analysis/plot1_chimericity_by_window.png


### Plot 2: Distribution of PSMs per Spectrum (by Window Category)

In [6]:
# Get spectrum-level data (one row per spectrum)
spectrum_df = df.groupby(['spectrum_key', 'window_mz', 'window_category']).agg(
    n_psm=('Spectrum', 'count')
).reset_index()

# Cap at 5+ for visualization
spectrum_df['n_psm_capped'] = spectrum_df['n_psm'].clip(upper=5)
spectrum_df.loc[spectrum_df['n_psm'] >= 5, 'n_psm_capped'] = '5+'
spectrum_df['n_psm_capped'] = spectrum_df['n_psm_capped'].astype(str)

print(f"Spectra with 5+ PSMs: {(spectrum_df['n_psm'] >= 5).sum():,}")

Spectra with 5+ PSMs: 13,884


  spectrum_df.loc[spectrum_df['n_psm'] >= 5, 'n_psm_capped'] = '5+'


In [7]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

categories = ['narrow', 'medium', 'wide']
titles = ['Narrow (1.6-4 m/z)', 'Medium (8-12 m/z)', 'Wide (18-48 m/z)']
colors_cat = ['#2ecc71', '#f39c12', '#e74c3c']

for ax, cat, title, color in zip(axes, categories, titles, colors_cat):
    subset = spectrum_df[spectrum_df['window_category'] == cat]
    
    # Count distribution
    dist = subset['n_psm'].clip(upper=5).value_counts().sort_index()
    dist.index = [str(i) if i < 5 else '5+' for i in dist.index]
    
    bars = ax.bar(dist.index, dist.values, color=color, edgecolor='black', linewidth=1.2)
    
    # Add percentages
    total = dist.sum()
    for bar, val in zip(bars, dist.values):
        pct = 100 * val / total
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + total*0.01,
                f'{pct:.1f}%', ha='center', va='bottom', fontsize=10)
    
    ax.set_xlabel('PSMs per Spectrum', fontsize=12)
    ax.set_title(title, fontsize=13, fontweight='bold')
    
    # Stats annotation
    chimeric_pct = 100 * (subset['n_psm'] >= 2).mean()
    ax.text(0.95, 0.95, f'Chimeric: {chimeric_pct:.1f}%\nN={len(subset):,}',
            transform=ax.transAxes, ha='right', va='top', fontsize=10,
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

axes[0].set_ylabel('Number of Spectra', fontsize=12)

fig.suptitle('Distribution of PSMs per Spectrum by Window Category', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(PLOT_DIR / 'plot2_psm_distribution_by_category.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"âœ… Saved: {PLOT_DIR / 'plot2_psm_distribution_by_category.png'}")

<Figure size 1500x500 with 3 Axes>

âœ… Saved: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/plots/01_chimeric_analysis/plot2_psm_distribution_by_category.png


### Plot 3: Chimericity Trend Line with Confidence

In [9]:
fig, ax = plt.subplots(figsize=(10, 6))

# Main line
ax.plot(window_stats['window_mz'], window_stats['pct_chimeric'], 
        'o-', color='#3498db', markersize=12, linewidth=2.5, 
        markeredgecolor='black', markeredgewidth=1.5, label='Observed')

# Fill area
ax.fill_between(window_stats['window_mz'], 0, window_stats['pct_chimeric'],
                alpha=0.3, color='#3498db')

# Trend line (polynomial fit)
from numpy.polynomial import polynomial as P
x = window_stats['window_mz'].values
y = window_stats['pct_chimeric'].values
x_smooth = np.linspace(x.min(), x.max(), 100)
y_smooth = P.polyval(x_smooth, coef)

# Annotations
ax.axhline(y=50, color='gray', linestyle=':', alpha=0.7)
ax.text(48, 51, '50% threshold', fontsize=10, color='gray')

# Labels
ax.set_xlabel('Isolation Window (m/z)', fontsize=14)
ax.set_ylabel('Chimeric Spectra (%)', fontsize=14)
ax.set_title('Chimericity vs Isolation Window Width\nWWA Dataset (PXD037527)', 
             fontsize=16, fontweight='bold')
ax.set_ylim(0, 100)
ax.set_xlim(0, 52)
ax.legend(loc='lower right', fontsize=11)

# Add window category regions
ax.axvspan(0, 4, alpha=0.1, color='green', label='Narrow')
ax.axvspan(4, 12, alpha=0.1, color='orange', label='Medium')
ax.axvspan(12, 52, alpha=0.1, color='red', label='Wide')

plt.tight_layout()
plt.savefig(PLOT_DIR / 'plot3_chimericity_trend.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"âœ… Saved: {PLOT_DIR / 'plot3_chimericity_trend.png'}")

<Figure size 1000x600 with 1 Axes>

âœ… Saved: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/plots/01_chimeric_analysis/plot3_chimericity_trend.png


## 4. Summary Table for Thesis

In [10]:
# Create clean summary table
summary_table = window_stats[['window_mz', 'category', 'n_spectra', 'n_psm', 'pct_chimeric', 'avg_psm_per_spectrum']].copy()
summary_table.columns = ['Window (m/z)', 'Category', 'Spectra', 'PSMs', '% Chimeric', 'Avg PSM/Spectrum']

print("\n" + "="*80)
print("TABLE: Chimericity Statistics by Isolation Window (WWA Dataset)")
print("="*80)
print(summary_table.to_string(index=False))
print("="*80)

# Save as CSV
summary_table.to_csv(DATA_DIR / 'chimericity_by_window.csv', index=False)
print(f"\nâœ… Saved: {DATA_DIR / 'chimericity_by_window.csv'}")


TABLE: Chimericity Statistics by Isolation Window (WWA Dataset)
 Window (m/z)      Category  Spectra   PSMs  % Chimeric  Avg PSM/Spectrum
          1.6   Narrow (â‰¤4)    19623  23514        31.2              1.37
          2.0   Narrow (â‰¤4)    73211  92099        37.6              1.49
          4.0   Narrow (â‰¤4)    87566 130544        57.2              1.87
          8.0 Medium (8-12)    99058 176538        72.4              2.28
         12.0 Medium (8-12)   103229 197742        77.3              2.45
         18.0    Wide (â‰¥18)   107548 222201        81.5              2.64
         24.0    Wide (â‰¥18)   104971 220587        82.3              2.68
         48.0    Wide (â‰¥18)    98390 213416        83.8              2.75

âœ… Saved: /data/antwerpen/211/vsc21150/Exploring-Fragmentation-Competion-in-Proteomics-Data-to-Decode-Chimeric-Spectra/v.3.0.0/processed_data/chimericity_by_window.csv


## 5. Key Findings Summary

In [11]:
# Calculate key metrics
narrow = window_stats[window_stats['window_mz'] <= 4]['pct_chimeric'].mean()
medium = window_stats[(window_stats['window_mz'] > 4) & (window_stats['window_mz'] <= 12)]['pct_chimeric'].mean()
wide = window_stats[window_stats['window_mz'] > 12]['pct_chimeric'].mean()

print("\n" + "="*60)
print("KEY FINDINGS")
print("="*60)
print(f"""
1. CHIMERICITY INCREASES WITH WINDOW SIZE:
   â€¢ Narrow (1.6-4 m/z):   {narrow:.1f}% chimeric
   â€¢ Medium (8-12 m/z):    {medium:.1f}% chimeric
   â€¢ Wide (18-48 m/z):     {wide:.1f}% chimeric

2. INCREASE FACTOR:
   â€¢ Wide vs Narrow: {wide/narrow:.1f}x more chimeric
   â€¢ From 1.6 mz to 48 mz: {window_stats['pct_chimeric'].iloc[-1] - window_stats['pct_chimeric'].iloc[0]:.1f} percentage points increase

3. IMPLICATIONS FOR FRAGMENTATION COMPETITION:
   â€¢ More co-isolated peptides â†’ more competition for MS2 signal
   â€¢ MS1share may NOT equal fragShare in wide windows
   â€¢ LASSO deconvolution needed to separate peptide contributions

4. DATASET SIZE:
   â€¢ Total spectra: {df['spectrum_key'].nunique():,}
   â€¢ Chimeric spectra: {df[df['is_chimeric']]['spectrum_key'].nunique():,} (for analysis)
   â€¢ Singleton spectra: {df[~df['is_chimeric']]['spectrum_key'].nunique():,} (for validation)
""")
print("="*60)


KEY FINDINGS

1. CHIMERICITY INCREASES WITH WINDOW SIZE:
   â€¢ Narrow (1.6-4 m/z):   42.0% chimeric
   â€¢ Medium (8-12 m/z):    74.8% chimeric
   â€¢ Wide (18-48 m/z):     82.5% chimeric

2. INCREASE FACTOR:
   â€¢ Wide vs Narrow: 2.0x more chimeric
   â€¢ From 1.6 mz to 48 mz: 52.6 percentage points increase

3. IMPLICATIONS FOR FRAGMENTATION COMPETITION:
   â€¢ More co-isolated peptides â†’ more competition for MS2 signal
   â€¢ MS1share may NOT equal fragShare in wide windows
   â€¢ LASSO deconvolution needed to separate peptide contributions

4. DATASET SIZE:
   â€¢ Total spectra: 693,596
   â€¢ Chimeric spectra: 355,592 (for analysis)
   â€¢ Singleton spectra: 338,004 (for validation)



## 6. Export Plots Summary

In [12]:
print("\n" + "="*60)
print("GENERATED PLOTS")
print("="*60)
for f in sorted(PLOT_DIR.glob('*.png')):
    print(f"  ðŸ“Š {f.name}")
print("="*60)


GENERATED PLOTS
  ðŸ“Š plot1_chimericity_by_window.png
  ðŸ“Š plot2_psm_distribution_by_category.png
  ðŸ“Š plot3_chimericity_trend.png
