# Using `aiproteomics` to build a spectral library

In this demo notebook we will see how to use the `aiproteomics` infrastructure to wrap a prediction model (e.g. MSMS) and generate spectral library entries.

First, we import the `aiproteomics` library, and check the version. This can be helpful for debugging by making sure you are looking at the correct documentation version.

In [9]:
import aiproteomics
print(aiproteomics.__VERSION__)

0.6.0


## Creating an `AIProteomicsModel`

To create an `AIProteomicsModel`, we need three things:

* A `SequenceMapper` that knows how to map the string representation of a peptide sequence (e.g. _AEEDEILNRS(UniMod:21)PR_ ) to an integer array suitable for input to the AI model.
* A `ModelParams` instance, that defines how the output of the AI model should be interpreted.
* The AI model itself (currently a `tensorflow` model, but could be extended easily to include `pytorch` models too)

Essentially, the `SequenceMapper` is used to prepare input for the prediction model, and the `ModelParams` is used to interpret and post-process the output of the prediction model.

#### Making a `SequenceMapper`

All (deep learning) prediction models used with `aiproteomics` are assumed to take a peptide sequence (in some form) as input. A `SequenceMapper` is responsible for pre-processing the string representation of a peptide sequence (including `UniMod` modifications) into a format that is ready to be given straight to the model.

Currently, this means that there are three parameters for creating a sequence mapper for your model:
* `min_seq_len` and `max_seq_len`, to only accept sequences in a length range appropriate for the model
* `mapping`, a `Dict` that maps each amino acid (and modification, if applicable) to a particular integer value.

Of course, there are many ways of mapping the sequence to an array of integers. `aiproteomics` offers two ready to go: `PHOSPHO_MAPPING` which handles most amino acids as well as several modifications (including phosphorylation), and `PROSIT_MAPPING` which uses the same mapping as for the original 2019 Prosit MSMS model.

The mapping implementation is relatively simple (see the module code `aiproteomics.core.sequence`), essentially just a dictionary, so it is possible to make your own such mapping and use that to initialize a `SequenceMapper` instead.

Below, we choose to accept only peptides in the range 7 to 50 amino acids, and the `PHOSPHO_MAPPING`.

In [23]:
from aiproteomics.core.sequence import SequenceMapper, PHOSPHO_MAPPING

# Choose what sequence mapping you want to use
seqmap = SequenceMapper(min_seq_len=7, max_seq_len=50, mapping=PHOSPHO_MAPPING)

You can see the contents of this `SequenceMapper` by printing it out:

In [24]:
seqmap

SequenceMapper(min_seq_len=7, max_seq_len=50, mapping=SequenceMapping(description='Mapping for the phospho model', aa_int_map={' ': 0, 'A': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'K': 9, 'L': 10, 'M': 11, 'N': 12, 'P': 13, 'Q': 14, 'R': 15, 'S': 16, 'T': 17, 'V': 18, 'W': 19, 'Y': 20, '1': 21, '2': 22, '3': 23, '4': 24, '*': 25}, aa_mod_map={'M(UniMod:35)': '1', 'S(UniMod:21)': '2', 'T(UniMod:21)': '3', 'Y(UniMod:21)': '4', '(UniMod:1)': '*', 'C(UniMod:4)': 'C'}))

In the above you should see exactly how each amino acid is mapped to an integer, as well as how modifications are handled.

#### Defining the `ModelParams` for your model

We now must define what kind of prediction model we will use. In practice, this is mainly used for post-processing the model output, since it determines how to interpret the output layer(s).

In the following we will show this for the most complex model type, a fragmentation (MSMS) model. A `ModelParamsMSMS` object can be constructed with the following info:
* `seq_len`: the maximum length of a peptide sequence to the model (needed for determining how many breakage points and therefore possible fragments)
* `ions`: a list of the ion types supported by the model. It is common to use just 'y' and 'b', but more can be specified (e.g. 'a').
* `max_charge`: the maximum charge on any fragment. If set to 3, for example, then only fragments of max 3+ charge will appear in the output.
* `neutral_losses`: this can be `H2O`, `NH3` or (for phosphoproteomics) `H3PO4`

Note that all the above choices determine the possible fragments and therefore the size of the output layer of the AI model (in `aiproteomics` we currently support models that predict one intensity value per possible peptide fragment).

In [11]:
from aiproteomics.core.modeltypes import ModelParamsMSMS

# Choose your model type (msms) and the parameters for its output
params = ModelParamsMSMS(seq_len=50, ions=['y','b'], max_charge=2, neutral_losses=['', 'H3PO4'])

If you wish to check what fragments are defined by your `ModelParamsMSMS` object, and the ordering they will have in the output layer, you can print out the `.fragments` member:

In [15]:
params.fragments

array([Fragment(fragment_type='y', fragment_charge=1, fragment_series_number=1, fragment_loss_type=''),
       Fragment(fragment_type='y', fragment_charge=2, fragment_series_number=1, fragment_loss_type=''),
       Fragment(fragment_type='b', fragment_charge=1, fragment_series_number=1, fragment_loss_type=''),
       Fragment(fragment_type='b', fragment_charge=2, fragment_series_number=1, fragment_loss_type=''),
       Fragment(fragment_type='y', fragment_charge=1, fragment_series_number=2, fragment_loss_type=''),
       Fragment(fragment_type='y', fragment_charge=2, fragment_series_number=2, fragment_loss_type=''),
       Fragment(fragment_type='b', fragment_charge=1, fragment_series_number=2, fragment_loss_type=''),
       Fragment(fragment_type='b', fragment_charge=2, fragment_series_number=2, fragment_loss_type=''),
       Fragment(fragment_type='y', fragment_charge=1, fragment_series_number=3, fragment_loss_type=''),
       Fragment(fragment_type='y', fragment_charge=2, fragment_s

Checking the number of fragments will show you the (expected) length of your model's output layer:

In [17]:
len(params.fragments)

392

#### Create a (`tensorflow`) model

Finally, having configured the infrastructure for input (sequence mapping) and output (model params) of the AI model, we now need to provide the model itself. In the following we just use a "dummy" model with the input/output layers expected of an MSMS modl. In practice you can use your own (new) model here, or generate a new one with the functions provided in `aiproteomics.models`.

Currently only tensorflow models are directly supported but pytorch models could be handled in the code without too much work.

In [18]:
from aiproteomics.models.dummy_models import generate_dummy_msms_model

# Make a compatible NN model
nn_model, creation_meta = generate_dummy_msms_model(
    params=params,
    num_layers=6,
    num_heads=8,
    d_ff=2048,
    dropout_rate=0.1
)


#### Build your `AIProteomicsModel`

Finally, with the above three objects, we can now build our `AIProteomicsModel`. This is really just a convenient class that keeps all relevant parts of the model together, and also provides utility functionality such as saving the model to a directory, or reloading an existing model.

In [25]:
from aiproteomics.core.modeltypes import AIProteomicsModel

# Build the model
msmsmodel = AIProteomicsModel(seq_map=seqmap, model_params=params, nn_model=nn_model, nn_model_creation_metadata=creation_meta)

In the following, we can print out the `AIProteomicsModel` we have created, which should contain all the information required to process input and output of the given MSMS prediction model:

In [28]:
from pprint import pprint
pprint(msmsmodel)

AIProteomicsModel(seq_map=SequenceMapper(min_seq_len=7,
                                         max_seq_len=50,
                                         mapping=SequenceMapping(description='Mapping '
                                                                             'for '
                                                                             'the '
                                                                             'phospho '
                                                                             'model',
                                                                 aa_int_map={' ': 0,
                                                                             '*': 25,
                                                                             '1': 21,
                                                                             '2': 22,
                                                                             '3': 23,
                            

### Saving and reloading the `AIProteomicsModel`

A useful feature of an `AIProteomicsModel` is the ability to save it to a directory. This means it can be reloaded into python later (for example, in another script)

In [31]:
# Save the model
msmsmodel.to_dir("testmodelfrag/", overwrite=True)

# Load the model back in as a new AIProteomicsModel instance
reloaded_msms = AIProteomicsModel.from_dir("testmodelfrag/")

# Check what's inside
pprint(reloaded_msms)



AIProteomicsModel(seq_map=SequenceMapper(min_seq_len=7,
                                         max_seq_len=50,
                                         mapping=SequenceMapping(description='Mapping '
                                                                             'for '
                                                                             'the '
                                                                             'phospho '
                                                                             'model',
                                                                 aa_int_map={' ': 0,
                                                                             '*': 25,
                                                                             '1': 21,
                                                                             '2': 22,
                                                                             '3': 23,
                            

### Making an AIProteomicsModel for iRT and CCS

Briefly, the following shows that retention time and ion mobility models may be similarly made into `AIProteomics` models:

In [36]:
from aiproteomics.core.modeltypes import ModelParamsRT, ModelParamsCCS
from aiproteomics.models.dummy_models import generate_dummy_iRT_model, generate_dummy_ccs_model

# Try making a phospho-style iRT model
params = ModelParamsRT(seq_len=50, iRT_rescaling_mean=101.11514, iRT_rescaling_var=46.5882)
nn_model, creation_meta = generate_dummy_iRT_model(
    params=params,
    num_layers=6,
    num_heads=8,
    d_ff=2048,
    dropout_rate=0.1
)
rtmodel = AIProteomicsModel(seq_map=seqmap, model_params=params, nn_model=nn_model, nn_model_creation_metadata=creation_meta)

# Try making a phospho-style ion mobility model
params = ModelParamsCCS(seq_len=50)
nn_model, creation_meta = generate_dummy_ccs_model(
    params=params,
    num_layers=6,
    num_heads=8,
    d_ff=2048,
    dropout_rate=0.1
)
ccsmodel = AIProteomicsModel(seq_map=seqmap, model_params=params, nn_model=nn_model, nn_model_creation_metadata=creation_meta)

### Use the `AIProteomicsModel` to generate a spectral library
Now we can use the models to generate a spectral library for a given list of input peptides (in `peptides_to_predict.tsv`).

In [38]:
import pandas as pd

input_peptides = pd.read_csv('peptides_to_predict.tsv', sep='\t')

input_peptides

Unnamed: 0,peptide,charge
0,STADAADA,1
1,ADSDDDDD,1
2,DDDDDDDD,2


In [41]:
from aiproteomics.core.utils import build_spectral_library

speclib_df = build_spectral_library(input_peptides, msms=msmsmodel, rt=rtmodel, ccs=ccsmodel)

speclib_df



Unnamed: 0,index,PrecursorMz,ProductMz,Annotation,ProteinId,GeneName,PeptideSequence,ModifiedPeptideSequence,PrecursorCharge,LibraryIntensity,NormalizedRetentionTime,PrecursorIonMobility,FragmentType,FragmentCharge,FragmentSeriesNumber,FragmentLossType
0,0,721.299866,88.039307,b1,,,STADAADA,STADAADA,1,1.131502,0.032277,-2.011733,b,1,1,
1,1,721.299866,260.124084,b3,,,STADAADA,STADAADA,1,1.014848,0.032277,-2.011733,b,1,3,
2,2,721.299866,375.151031,b4,,,STADAADA,STADAADA,1,1.587804,0.032277,-2.011733,b,1,4,
3,3,721.299866,462.183075,y5,,,STADAADA,STADAADA,1,0.944723,0.032277,-2.011733,y,1,5,
4,4,721.299866,517.225281,b6,,,STADAADA,STADAADA,1,2.516924,0.032277,-2.011733,b,1,6,
5,5,721.299866,91.110085,b2-H3PO4,,,STADAADA,STADAADA,1,3.544521,0.032277,-2.011733,b,1,2,H3PO4
6,6,721.299866,178.14212,y3-H3PO4,,,STADAADA,STADAADA,1,1.795422,0.032277,-2.011733,y,1,3,H3PO4
7,7,721.299866,162.147202,b3-H3PO4,,,STADAADA,STADAADA,1,2.130765,0.032277,-2.011733,b,1,3,H3PO4
8,8,721.299866,435.243286,y6-H3PO4,,,STADAADA,STADAADA,1,1.329103,0.032277,-2.011733,y,1,6,H3PO4
9,9,721.299866,419.248383,b6-H3PO4,,,STADAADA,STADAADA,1,1.953735,0.032277,-2.011733,b,1,6,H3PO4


Finally, you can output the generated spectral library to `.tsv` format:

In [42]:
speclib_df.to_csv('speclib.tsv', sep='\t')

Or, more efficiently, to `parquet`:

In [43]:
speclib_df.to_parquet('speclibe.parquet')