# WELCOME TO THE ORCA NOTEBOOK - MS2 Auxilary

Pipeline for the **O**bjective **R**elational **C**omparative **A**nalyses of mass spectral data, along with other data sources. All you need is a directory of mzML files to get started!

To run cells of code, select cell and then press **Shift + Enter**. The first cell loads-in python modules necessary for the rest of the code to function

In some of the cells below, the user will need to input information, such as setting paths, setting parameters, etc. In these cases, the user will see a cell with variables to be set at the top, followed by a line of '###', below which the rest of the code can be seen. Please set all applicable variables above the '###' line. Tinkering with the code below the '###' line is highly encouraged, as that is precisely why we chose to make the code available as a Jupyter Notebook, however it could result in a 'breaking' of ORCA. If that appears to be the case, simply clone again from GitHub to get back to working code.


This notebook was created several years after ORCA, in order to dig deeper into MS2 fragmentation patterns. Initial analyses using GNPS Classic Molecular Networking (https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp) revealed a cluster of MS2 nodes that were networked together, suggesting that they were analogs. In an effort to understand how these nodes might be related, we put together the below code.

In [None]:
import pandas as pd
import numpy as np
from pyteomics import mzml, auxiliary
import glob
import matplotlib 
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet, set_link_color_palette, fcluster
from collections import Counter

%matplotlib inline

**path_sample_directory**: Path to the directory that contains sample files to be loaded-in. Note: these files must be in mzML format, and NOT mzXML.

In [None]:
path_sample_directory = './CL001_ms'

################################################################################################################

fns = glob.glob(path_sample_directory + '/*ML')
fns.sort()
fns

Running the cell below pulls MS2 scans from all sample mzML files in the designated sample directory, and then arranges them into a dataframe so that each peak in each scan is set into a row. In order to facilitate upcoming aggregate analyses, each peak is also assigned a rounded m/z. The order of magnitude to which peak values should be rounded can be set with the **bin_OOM** variable. For example, if a peak has a recorded value of 100.222, setting **bin_OOM** to 0 will return 100, setting to -1 will return 100.2, setting to -2 will return 100.22, etc. 

The resultant table will have 8 columns:

file: The name of the file from which the peak originates

scan: The number of the scan from which the peak originates

pmz: The precursor m/z for the scan from which the peak originates

mz: The m/z of the peak

rt: The retention time for the scan from which the peak originates

inten: The intensity of the peak

pmz_round: The precursor m/z for the scan from which the peak originates, rounded via the bin_OOM parameter

mz_round: The m/z of the peak, rounded via the bin_OOM parameter


In [None]:
bin_OOM = 0

#######################################################################################################################

pmz = []
rt = []
mz = []
inten = []
scan = []
file = []
for file_name in fns:
    print(file_name)
    with mzml.read(file_name) as reader:
        for spectrum in reader:
            if spectrum.get('ms level') == 2:
                pmz.extend([spectrum.get('precursorList').get('precursor')[0].get('isolationWindow').get('isolation window target m/z')] * len(spectrum.get('m/z array')))
                rt.extend([spectrum.get('scanList').get('scan')[0].get('scan start time')] * len(spectrum.get('m/z array')))
                mz.extend(spectrum.get('m/z array'))
                inten.extend(spectrum.get('intensity array'))
                scan.extend([spectrum.get('id').split('scan=')[1]] * len(spectrum.get('m/z array')))
                file.extend([file_name] * len(spectrum.get('m/z array')))

data = pd.DataFrame([file,scan,pmz,mz,rt,inten]).T
data.columns = ['file','scan','pmz','mz','rt','inten']
data["pmz"] = pd.to_numeric(data["pmz"])
data["mz"] = pd.to_numeric(data["mz"])
data["inten"] = pd.to_numeric(data["inten"])

data['pmz_round'] = data['pmz'].round(-bin_OOM)
data['mz_round'] = data['mz'].round(-bin_OOM)
data

### Querying the MS2 data
The number of ways someone might want to query their MS2 data is endless, so we have included just a couple of examples below. If you are having trouble building off what is there, check out the documentation for the Pandas 'query' function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

In [None]:
# Only show rows where pmz_round is equal to 457
data.query('pmz_round == 457')


In [None]:
# Only show rows where pmz_round is equal to 465, and mz is between 307 and 309
data.query('pmz_round == 465 and mz > 307 and mz < 309')


### Filtering the MS2 data, based on a fragment peak of interest
This code will allow for the generation of a list of precursor mzs whose scans include a particular fragment peak of interest.

frag_peak_OI: Input the rounded fragment peak mz that you are interested in.

In [None]:
frag_peak_OI = 168

###########################################################################################

pmzs = list(data[data['mz_round'] == frag_peak_OI]['pmz_round'].unique())
pmzs.sort()
pmzs

Once it is known which precuror m/z's include a particular fragment peak of interest, those precursor m/z's and their fragmentation patterns can be further investigated...

### Cluster fragmentation patterns
Designate a rounded precursor m/z of interest, as well as a cutoff for delineating clusters, and then run the code below to peform hierarchical clustering on all scans with that particular precusor m/z. The fragmentation pattern in each applicable scan is compared to all others using cosine distance, and then clustered.

In [None]:
peak_of_interest = 457
cutoff = 0.15

###########################################
peak = pd.pivot_table(data.query('pmz_round == @peak_of_interest'),index=['file','scan'],columns='mz_round',values='inten',aggfunc='max', fill_value=0)
peak.drop(peak.columns[peak.columns >= peak_of_interest],axis=1,inplace=True)

#Clustering linkages followed by dendrogram construction
matplotlib.rcParams['lines.linewidth'] = 3
link = linkage(peak, 'average', metric = 'cosine')
plt.figure(figsize=(15,8))
plt.title('Hierarchical Clustering of m/z '+str(peak_of_interest)+' MS2 spectra', fontsize=20)
plt.ylabel('Cosine Distance', fontsize=18)
plt.yticks(fontsize=14)
dendrogram(
    link,
    color_threshold=cutoff,
    above_threshold_color='k',
    )
plt.show()


Run the cell below to get details on how many scans are in each cluster generated above. The result will be a list of pairs of numbers (called tuples). The first number in each pair is the cluster number, while the second number indicates how many scans are included in that cluster.


In [None]:
C = fcluster(link, cutoff, criterion="distance")
Counter(C).most_common()

Based on the outputs above, one cluster may have caught your attention. Input the cluster number below and then run the cell in order to generate a consensus fragmentation pattern representing the scans in that cluster.

In [None]:
cluster = 1
x_min = 160
x_max = 460

##################################################################################################################

peak_plot = peak[C == cluster].T/peak[C == cluster].T.max()

index_expand = []
index_drop = []
for x in range(int(peak_plot.index.min()),int(peak_plot.index.max())+1):
    ints = [int(i) for i in peak_plot.index]
    if x not in ints:
        index_expand.append(float(x))
    if x < x_min:
        index_drop.append(float(x))
    if x > x_max:
        index_drop.append(float(x))
        
for idx in index_expand:
    peak_plot = peak_plot.append(pd.Series(name=idx))
    
peak_plot.sort_index(inplace=True)
peak_plot.fillna(0, inplace=True)
peak_plot.drop(index_drop, axis=0, inplace=True)

xticklab = []
for x in pd.Series(peak_plot.index):
    if x in list(peak_plot.T.median().T[peak_plot.T.median() > .01].index):
        xticklab.append(str(int(x)))
    else:
        xticklab.extend(' ')

plt.figure(figsize=(15,8))
ax = peak_plot.T.median().plot(kind='bar', color='k', width=1.1)
ax.tick_params(axis='y', which='major', labelsize=14)
ax.set_xticklabels(xticklab, rotation=90, fontsize=14)
ax.set_title('Consensus MS2 spectrum for cluster '+str(cluster)+' of m/z '+str(peak_of_interest)+', based on '+str(len(peak_plot.T))+' scans', fontsize=20)

In the consensus spectrum above, only peaks that are greater than 0.01 in relatvie intensity are labelled. For a full accounting of the peaks in the consensus spectrum, run the below cell.

In [None]:
peaks = list(peak_plot.T.median().T[peak_plot.T.median() > .00001].index)
peaks

In some cases, it may be useful to calculate the ratios of two fragment peaks across all of the scans in a cluster. This can be done in the cell below.

In [None]:
peak_1 = 321
peak_2 = 323

##################################################################################################################

ratios = peak_plot.T[peak_1]/peak_plot.T[peak_2]
ratios