# Mass spectrometry data

The objective of this exercise is to read in raw peptide MSMS spectrum information and output a dataframe.
The .msp file can be downloaded [here](https://chemdata.nist.gov/download/peptide_library/libraries/cptaclib/2015/cptac2_mouse_hcd_selected.msp.tar.gz).

The information in this ASCII based text file is organized spectrum by spectrum.
The first line per spectrum provides formatted like this:

&emsp;<code>Name: sequence/charge_nmods_collisionenergy</code>

followed by a comment section which can be disregarded and the actual spectrum data which is tab-separated:

&emsp;<code>m/z&emsp;intensity&emsp;additional_info</code>

Spectra are separated by an empty line.

Code a function that returns two DataFrames or arrays containing the processed and filtered data. The first one should contain the spectrum information (n_spectra, n_m/z_features) and the second one the sequences per row (n_spectra).

Here are some general guidelines:

* The m/z values need to be binned to integer values (mathematically rounded), otherwise the dataframe size would get out of hand. This will allow for multiple values mapped to a single bin (e.g. if there are peaks at 145.1 and 145.2). Here, only the maximum of those peaks should be kept in the final dataframe.

* Rows that are all-zero should be dropped.

Your function should allow for selecting a range on the x-axis (m/z-range). All peaks outside this range can be disregarded. Furthermore, only spectra within a set collision energy range and a maximum sequence length should be contained in the output dataframe.

The faster your function runs, the better. I will time them all in the end.

Made with &#10084;&#65039;&nbsp; by Michelle and me

In [34]:
import numpy as np
import pandas as pd
import timeit
import plotly.express as px
from pathlib import Path
pd.set_option("max_columns",3000)

In [35]:
def msp_to_df(
    input_file,
    max_seq_len=30,
    min_ce=36,
    max_ce=40,
    mz_min=135,
    mz_max=1400,
):
    """
    Function to read spectrum data from .msp file and convert to dataframe.
    Args:
        input_file (str): path to .msp file
        max_seq_len (int): maximum acceptable sequence length
        min_ce (int): minimum collision energy of spectra to be included in df
        max_ce (int): maximum collision energy of spectra to be included in df
        mz_min (int): lower boundary for m/z to be included in df
        mz_max (int): lower boundary for m/z to be included in df

    Returns:
        df (pd.DataFrame or np.array): spectrum information within defined parameters [n_spectra, n_features]
        seqs (pd.DataFrame or np.array): sequences
    """
    df = pd.DataFrame(columns=range(mz_min,mz_max))
    seqs = []

    with open(input_file, "r") as file:
        continue_to_next_name = False
        index_counter = -1
        for line in file:
            if "Name" in line:
                continue_to_next_name = False
                split = line.split(" ")[1].split("/")
                name = split[0]
                ce = split[1].rsplit("_")[-1].replace("eV","")
                ce = float(ce)
                if not min_ce < ce < max_ce:
                    continue_to_next_name = True
                else:
                    seqs.append(name)
                    index_counter += 1
                    df.loc[index_counter] = np.zeros(df.shape[1])
            if continue_to_next_name == True:
                continue
            if any(substring in line for substring in ["MW","Comment","Num peaks","Name"]):
                continue
            if line == "\n":
                continue
            else:
                split2 = line.split("\t")
                mz = round(float(split2[0]))
                intensity = float(split2[1])
                if mz_min < mz < mz_max:
                    if intensity > df.at[index_counter,mz]:
                        df.at[index_counter,mz] = intensity
    df = df.loc[~(df==0).all(axis=1)]
    return df, seqs

In [36]:
df,seqs = msp_to_df("../../data/cptac2_mouse_hcd_selected.msp")

In [37]:
def create_plots(df,seqs, directory):
    Path(directory).mkdir(parents=True, exist_ok=True)
    df = df.copy()
    df["Name"] = seqs
    df = pd.melt(df, id_vars=["Name"], var_name="mz", value_name="Intensity")
    for name in set(df["Name"]):
        subdf = df[df["Name"]==name].copy()
        fig = px.line(subdf,x="mz",y="Intensity",title=name)
        fig.write_image(f"{directory}/"+name+".png")

In [38]:
create_plots(df,seqs,"plots")