# UniProt data pre-processing for binding site prediction downstream task

This notebook guides you through:

* 📥 **Downloading Data**: Retrieve information from the UniProt website, including details on protein families, binding sites, active sites, and amino acid sequences.
* 🛠️ **Processing Data**: Handle special symbols (angle brackets and question marks) in binding/active site information and convert this data into binary labels. Each amino acid position in the protein sequences is marked as 1 (binding/active site) or 0 (non-binding/active site).
* ✂️ **Splitting Data**: Divide amino acid sequences and their labels into stratified train/test sets based on UniProt protein families.
* 🔄 **Chunking Sequences**: Split sequences and their labels into non-overlapping chunks of a specified length to define a context window for the ESM-2 model.

This tutorial is made to run without any GPU support, and can be used in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/UniProt_Data_Preprocessing_for_Binding_Sites.ipynb)

## Download from UniProt

Let's first download a dataset of proteins from UniProt. We will obtain a TSV (Tab-Separated Values) file with specific columns such as Protein families, Binding site, Active site, and Sequence. You can achieve this following these steps:

- Go to the [UniProt website](https://www.uniprot.org/) and perform a search to query for the proteins of interest (you can search by organism, protein name, function, etc). Filter your results with the filters on the left-hand side to refine your results further if necessary. Here I performed the search: (organism_id:9606) AND (family:kinase) AND (existence:1 OR existence:2) in UniProtKB.

- Select columns: Above the search results, there is an option to select the columns you want to be included in your download. Click on the 'Columns' button and a dropdown menu will appear.

- Customize columns: In the dropdown menu, you can check the boxes next to the columns you want to include in your TSV file. Look for the 'Protein families', 'Binding site', 'Active site', and 'Sequence' options. I also added further info such as entry name, protein name, gene name, organism, sequence length and whether the entry has been reviewed.

- Download the file: After selecting the desired columns, click the 'Download' button located above the search results. Choose the 'Tab-separated' format from the list of available formats. You may also have the option to select the number of entries you want to download (e.g., all entries, displayed entries, or a custom range).
Click on the 'Download' button to start the download process and your browser will prompt you to save the TSV file.

## Process data

Now, let's process the downloaded UniProt TSV file with columns (Protein families, Binding site, Active site, Sequence). If the family annotation or binding sites are missing, the code will filter out this sequence. If the Active site annotation is missing, the sequence will be included without issue. Missing sequences are not handled by this notebook.

But first, let's set up the environment:

In [None]:
!pip install pandas
!pip install numpy
!pip install requests



In [None]:
# I/O
import pandas as pd
import numpy as np
import re
import random
import pickle
import os
import requests
import xml.etree.ElementTree as ET
# set seed
random.seed(42)
np.random.seed(42)

If you upload the downloaded file from UniProt to Google Drive, you should be able to access it by first mounting your Google Drive and then loading it:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Load the dataset
file_path = "/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29.tsv"
data = pd.read_csv(file_path, sep='\t')
data.head()

Unnamed: 0,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Protein families,Sequence,Length,Binding site,Active site
0,A0A087WV00,unreviewed,A0A087WV00_HUMAN,Diacylglycerol kinase (DAG kinase) (EC 2.7.1.107),DGKI,Homo sapiens (Human),Eukaryotic diacylglycerol kinase family,MDAAGRGCHLLPLPAARGPARAPAAAAAAAASPPGPCSGAACAPSA...,1057,,
1,A0A090N7W4,unreviewed,A0A090N7W4_HUMAN,Cell division protein kinase 5,CDK5 hCG_18690 tcag7.772,Homo sapiens (Human),"Protein kinase superfamily, CMGC Ser/Thr prote...",MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...,292,"BINDING 33; /ligand=""ATP""; /ligand_id=""ChEBI:C...",
2,A0A0S2Z310,unreviewed,A0A0S2Z310_HUMAN,Serine/threonine-protein kinase receptor (EC 2...,ACVRL1,Homo sapiens (Human),"Protein kinase superfamily, TKL Ser/Thr protei...",MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...,503,"BINDING 229; /ligand=""ATP""; /ligand_id=""ChEBI:...",
3,A0A0S2Z4D1,unreviewed,A0A0S2Z4D1_HUMAN,non-specific serine/threonine protein kinase (...,STK11,Homo sapiens (Human),"Protein kinase superfamily, CAMK Ser/Thr prote...",MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...,433,"BINDING 78; /ligand=""ATP""; /ligand_id=""ChEBI:C...",
4,A0A2P9DU05,unreviewed,A0A2P9DU05_HUMAN,Rho-associated protein kinase (EC 2.7.11.1),ROCK2,Homo sapiens (Human),"Protein kinase superfamily, AGC Ser/Thr protei...",MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,1388,"BINDING 121; /ligand=""ATP""; /ligand_id=""ChEBI:...","ACT_SITE 214; /note=""Proton acceptor""; /eviden..."


Now let's extract the required information for the purposes of this task: Protein families, Binding site, Active site, Sequence. Also, let's filter out entries without binding site or protein families information.

In [None]:
data["Binding site"]

0                                                     NaN
1       BINDING 33; /ligand="ATP"; /ligand_id="ChEBI:C...
2       BINDING 229; /ligand="ATP"; /ligand_id="ChEBI:...
3       BINDING 78; /ligand="ATP"; /ligand_id="ChEBI:C...
4       BINDING 121; /ligand="ATP"; /ligand_id="ChEBI:...
                              ...                        
2186                                                  NaN
2187                                                  NaN
2188                                                  NaN
2189    BINDING 73; /ligand="ATP"; /ligand_id="ChEBI:C...
2190    BINDING 165; /ligand="ATP"; /ligand_id="ChEBI:...
Name: Binding site, Length: 2191, dtype: object

In [None]:
data = data[["Entry", "Protein families", "Binding site", "Active site", "Sequence"]]
# Filter out rows with NaN values in the 'Protein families' column nor the 'Binding site' column nor the 'Sequence' column
data = data[pd.notna(data['Protein families']) & pd.notna(data['Binding site']) & pd.notna(data['Sequence'])]
print(data.shape)
data.head()

(1406, 5)


Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence
1,A0A090N7W4,"Protein kinase superfamily, CMGC Ser/Thr prote...","BINDING 33; /ligand=""ATP""; /ligand_id=""ChEBI:C...",,MQKYEKLEKIGEGTYGTVFKAKNRETHEIVALKRVRLDDDDEGVPS...
2,A0A0S2Z310,"Protein kinase superfamily, TKL Ser/Thr protei...","BINDING 229; /ligand=""ATP""; /ligand_id=""ChEBI:...",,MTLGSPRKGLLMLLMALVTQGDPVKPSRGPLVTCTCESPHCKGPTC...
3,A0A0S2Z4D1,"Protein kinase superfamily, CAMK Ser/Thr prote...","BINDING 78; /ligand=""ATP""; /ligand_id=""ChEBI:C...",,MEVVDPQQLGMFTEGELMSVGMDTFIHRIDSTEVIYQPRRKRAKLI...
4,A0A2P9DU05,"Protein kinase superfamily, AGC Ser/Thr protei...","BINDING 121; /ligand=""ATP""; /ligand_id=""ChEBI:...","ACT_SITE 214; /note=""Proton acceptor""; /eviden...",MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...
5,A3QNQ0,"Protein kinase superfamily, TKL Ser/Thr protei...","BINDING 250..258; /ligand=""ATP""; /ligand_id=""C...","ACT_SITE 379; /note=""Proton acceptor""; /eviden...",MGRGLLRGLWPLHIVLWTRIASTIPPHVQKSVNNDMIVTDNNGAVK...


So we have a dataset of 1406 proteins, all having a binding site and information of the aminoacids sequence and the protein family. We download proteins proteins from human and kinase family, however there may still exist subgroups of protein families:

In [None]:
# Group the data by 'Protein families' and get the size of each group
family_sizes = data.groupby('Protein families').size()
print(family_sizes.sort_values(ascending=False))

# Create a new column with the size of each family and sort by 'Family size' in descending order and then by 'Protein families'
data['Family size'] = data['Protein families'].map(family_sizes)
data = data.sort_values(by=['Family size', 'Protein families'], ascending=[False, True])
data.drop(columns='Family size', inplace=True) # Drop the 'Family size' column as it is no longer needed
data

Protein families
Protein kinase superfamily                                                             164
Protein kinase superfamily, CMGC Ser/Thr protein kinase family, CDC2/CDKX subfamily     96
Protein kinase superfamily, STE Ser/Thr protein kinase family, STE20 subfamily          78
Protein kinase superfamily, Tyr protein kinase family, Insulin receptor subfamily       73
Protein kinase superfamily, CAMK Ser/Thr protein kinase family                          56
                                                                                      ... 
GHMP kinase family, Mevalonate kinase subfamily                                          1
Protein kinase superfamily, TKL Ser/Thr protein kinase family, ROCO subfamily            1
Glutamate 5-kinase family; Gamma-glutamyl phosphate reductase family                     1
Guanylate kinase family                                                                  1
GHMP kinase family                                                       

Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence
359,Q504Y2,Protein kinase superfamily,"BINDING 144..152; /ligand=""ATP""; /ligand_id=""C...","ACT_SITE 278; /note=""Proton acceptor""; /eviden...",MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...
414,Q8IWB6,Protein kinase superfamily,"BINDING 233..241; /ligand=""ATP""; /ligand_id=""C...",,MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...
427,Q8NB16,Protein kinase superfamily,"BINDING 209..217; /ligand=""ATP""; /ligand_id=""C...",,MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...
778,A0A7P0T838,Protein kinase superfamily,"BINDING 71; /ligand=""ATP""; /ligand_id=""ChEBI:C...",,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
779,A0A7P0T952,Protein kinase superfamily,"BINDING 71; /ligand=""ATP""; /ligand_id=""ChEBI:C...",,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
...,...,...,...,...,...
1770,M1VPF4,"Protein kinase superfamily, Tyr protein kinase...","BINDING 358; /ligand=""ATP""; /ligand_id=""ChEBI:...",,MMEAIKKKMQMLKLDKENALDRAEQAEAEQKQAEERSKQLEDELAA...
21,O00764,Pyridoxine kinase family,"BINDING 12; /ligand=""pyridoxal""; /ligand_id=""C...","ACT_SITE 235; /note=""Proton acceptor""; /eviden...",MEEECRVLSIQSHVIRGYVGNRAATFPLQVLGFEIDAVNSVQFSNH...
1017,M1V485,SLC34A transporter family; Protein kinase supe...,"BINDING 906; /ligand=""ATP""; /ligand_id=""ChEBI:...",,MAPWPELGDAQPNPDKYLEGAAGQQPTAPDKSKETNKTDNTEAPVT...
82,P04183,Thymidine kinase family,"BINDING 26..33; /ligand=""ATP""; /ligand_id=""ChE...","ACT_SITE 98; /note=""Proton acceptor""; /evidenc...",MSCINLPTVLPGSPSKTRGQIQVILGPMFSGKSTELMRRVRRFQIA...


Now let's make the binding and active sites information clearer:

In [None]:
# Extract the location from the binding and active site columns
def extract_location(site_info):
    if pd.isnull(site_info):
        return None
    locations = []
    for info in site_info.split(';'):
        if 'BINDING' in info or 'ACT_SITE' in info:
            locations.append(info.split()[1])
    return '; '.join(locations)

# Apply the function to the 'Binding site' and 'Active site' columns to extract the locations
data['Binding site'] = data['Binding site'].apply(extract_location)
data['Active site'] = data['Active site'].apply(extract_location)

# Display the first few rows of the modified dataframe
data.head()

Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence
359,Q504Y2,Protein kinase superfamily,144..152; 166,278.0,MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...
414,Q8IWB6,Protein kinase superfamily,233..241; 273,,MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...
427,Q8NB16,Protein kinase superfamily,209..217; 230,,MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...
778,A0A7P0T838,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...
779,A0A7P0T952,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...


In [None]:
# Create a new column that combines the 'Binding site' and 'Active site' columns
data['Binding-Active site'] = data['Binding site'].astype(str) + '; ' + data['Active site'].astype(str)
# Replace 'nan' values with None
data['Binding-Active site'] = data['Binding-Active site'].replace('nan; nan', None)

data.head()

Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence,Binding-Active site
359,Q504Y2,Protein kinase superfamily,144..152; 166,278.0,MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...,144..152; 166; 278
414,Q8IWB6,Protein kinase superfamily,233..241; 273,,MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...,233..241; 273; None
427,Q8NB16,Protein kinase superfamily,209..217; 230,,MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...,209..217; 230; None
778,A0A7P0T838,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...,71; None
779,A0A7P0T952,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...,71; None


### Angle bracket symbols in Binding/Active site

In biological databases like UniProt, you may encounter entries in the "Binding site" or "Active site" columns (or any other feature-related columns) that contain symbols like '<' or '>', these typically indicate positional uncertainty or boundaries that are outside the range of the sequence currently being annotated:

- '<': This symbol is used to indicate that the feature (such as a binding or active site) starts before the position given. For example, if you see "<5" in the context of a binding site, it suggests that the binding site starts before amino acid position 5 in the protein sequence.

- '>': Conversely, this symbol is used to show that the feature extends beyond the position given. If you see ">200" for an active site, it implies that the active site extends beyond amino acid position 200.

These annotations provide information about the location of certain functional sites within a protein, but with an acknowledgment of some level of uncertainty or incompleteness in the data that could be due to various reasons, such as limitations in experimental data, partial protein sequences, or predictions based on related proteins rather than direct evidence.

We will filter out entries containing these symbols so as to work with a dataset with certainty on the binding/active sites.

In [None]:
# Find entries containing '<' or '>'
entries_angles = data['Binding-Active site'].str.contains('<|>', na=False)
print(f"Number of entries with angle brackets: {entries_angles.sum()}")

# Remove all rows where the "Binding-Active site" column contains '<' or '>'
data = data[~entries_angles]
print(f"Number of remaining rows: {data.shape[0]}")


Number of entries with angle brackets: 0
Number of remaining rows: 1406


### Question mark ("?") symbols in Binding/Active site

In biological databases like UniProt, a question mark ("?") in the "Binding site" or "Active site" columns typically indicates uncertainty or incomplete information regarding the feature in question. It might mean the exact position of the binding or active site within the protein sequence may not be clearly determined, or it may be a predicted feature based on computational models or inferred from homologous proteins, but not yet experimentally verified. It can also be due to conflicting data or interpretations about the presence or characteristics of the site, or the annotation process just being incomplete.

In [None]:
# Find rows where the "Binding-Active site" column contains the character "?", treating "?" as a literal character
entries_question_mark = data[data['Binding-Active site'].str.contains('\?', na=False, regex=True)]
print(f"Number of entries with angle brackets: {entries_question_mark.shape[0]}")

# Remove all rows containing '?' in the "Binding-Active site" column
data = data.drop(entries_question_mark.index)
print(f"Number of remaining rows: {data.shape[0]}")


Number of entries with angle brackets: 0
Number of remaining rows: 1406


### Binding/active sites labels

Now let's define all aminoacids involved in binding/active sites by expanding the ranges to especify all amino acid indexes that are a binding/active site:

In [None]:
def expand_ranges(s):
    """Expand ranges into a comma-separated string."""
    return re.sub(r'(\d+)\.\.(\d+)', lambda m: ', '.join(map(str, range(int(m.group(1)), int(m.group(2))+1))), str(s))

data['Binding-Active site'] = data['Binding-Active site'].apply(expand_ranges)
print(data.head())

          Entry            Protein families   Binding site Active site  \
359      Q504Y2  Protein kinase superfamily  144..152; 166         278   
414      Q8IWB6  Protein kinase superfamily  233..241; 273        None   
427      Q8NB16  Protein kinase superfamily  209..217; 230        None   
778  A0A7P0T838  Protein kinase superfamily             71        None   
779  A0A7P0T952  Protein kinase superfamily             71        None   

                                              Sequence  \
359  MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...   
414  MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...   
427  MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...   
778  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   
779  MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...   

                                   Binding-Active site  
359  144, 145, 146, 147, 148, 149, 150, 151, 152; 1...  
414  233, 234, 235, 236, 237, 238, 239, 240, 241; 2...  
427  209, 210, 211, 212, 213, 214, 

You can now convert the binding/active sites information into a binary label: 1s where there is a binding/active site; 0s where there is not. Retrieve the indices in 'Bindig/active site' column, and set their corresponding positions in the protein sequence to 1. All other aminoacids of the sequence are set to 0:





In [None]:
def convert_to_binary_list(binding_active_str, sequence_len):
    """Convert a Binding-Active site string to a binary list based on the sequence length."""
    binary_list = [0] * sequence_len
    # Retrieve the indices in bindig/active sites and set their corresponding positions to 1
    if pd.notna(binding_active_str):
        indices = [int(x) - 1 for segment in binding_active_str.split(';') for x in segment.split(',') if x.strip().isdigit()]
        for idx in indices:
            if 0 <= idx < sequence_len: # Ensure the index is within the valid range
                binary_list[idx] = 1

    return binary_list

# Apply the function to both datasets
data['Binding-Active site'] = data.apply(lambda row: convert_to_binary_list(row['Binding-Active site'], len(row['Sequence'])), axis=1)
data.head()

Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence,Binding-Active site
359,Q504Y2,Protein kinase superfamily,144..152; 166,278.0,MRRRRAAVAAGFCASFLLGSVLNVLFAPGSEPPRPGQSPEPSPAPG...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
414,Q8IWB6,Protein kinase superfamily,233..241; 273,,MSRAVRLPVPCPVQLGTLRNDSLEAQLHEYVKQGNYVKVKKILKKG...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
427,Q8NB16,Protein kinase superfamily,209..217; 230,,MENLKHIITLGQVIHKRCEEMKYCKKQCRRLGHRVLGLIKPLEMLQ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
778,A0A7P0T838,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
779,A0A7P0T952,Protein kinase superfamily,71,,MPRVKAAQAGRQSSAKRHLAEQFAVGEIITDMAKKEWKVGLPIGQG...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


## Split train/test sets

 Let's create a split of the data into training and test sets based on UniProt protein families, such that it ensures entire protein families are either in the training set or the test set. The goal is that the test set will contain completely "new" families of proteins that are not seen in the training set, so the evaluation represents the model's ability to generalize to entirely new families of proteins that it has not seen during training.

Notably, this is different from the traiditional stratified split, which aims to preserve the distribution of classes across both sets.

In [None]:
# Get the number of distinct protein families
num_families = data['Protein families'].nunique()
print(f"Number of distinct protein families: {num_families}")

Number of distinct protein families: 126


In [None]:
def split_data_by_family(data, test_ratio=0.20):
    """
    Splits the dataset into train and test sets by entire protein families (not a family-stratified split!).

    Parameters:
    - data: pandas DataFrame containing the dataset with a 'Protein families' column.
    - test_ratio: float, the proportion of the dataset to include in the test split.

    Returns:
    - test_df: pandas DataFrame containing the test set.
    - train_df: pandas DataFrame containing the training set.
    """
    # Get unique protein families and shuffle them to randomize the selection
    unique_families = data['Protein families'].unique()
    np.random.shuffle(unique_families)

    # Loop through the shuffled families and add rows to the test set
    test_rows = []
    current_test_rows = 0
    for family in unique_families:
        family_rows = data[data['Protein families'] == family].index.tolist()
        if current_test_rows + len(family_rows) <= int(test_ratio * data.shape[0]):
            test_rows.extend(family_rows)
            current_test_rows += len(family_rows)
        else:
            # If adding the current family exceeds the target, stop adding
            test_rows.extend(family_rows)
            break

    # Create the test and train datasets
    train_rows = [i for i in data.index if i not in test_rows]
    test_df = data.loc[test_rows]
    train_df = data.loc[train_rows]

    return test_df, train_df

test_df, train_df = split_data_by_family(data, test_ratio=0.20)
print(test_df.shape[0], train_df.shape[0])

392 1014


In [None]:
test_df.head()

Unnamed: 0,Entry,Protein families,Binding site,Active site,Sequence,Binding-Active site
39,O43252,APS kinase family; Sulfate adenylyltransferase...,62..67; 89..92; 101; 106..109; 132..133; 171; ...,,MEIPGSLCKKVKLSNNAQNWGMQRATNVTYQAHHVSRNKRGQVVGT...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
68,O95340,APS kinase family; Sulfate adenylyltransferase...,52..57; 79..82; 91; 96..99; 122..123; 161; 174...,,MSGIKKQKTENQQKSTNVVYQAHHVSRNKRGQVVGTRGGFRGCTVW...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,A0A2P9DU05,"Protein kinase superfamily, AGC Ser/Thr protei...",121,214.0,MSRPPPTGKMPGAPETAPGDGAGASRQRKLEALIRDPRSPINVESL...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
12,O00141,"Protein kinase superfamily, AGC Ser/Thr protei...",104..112; 127,222.0,MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
22,O14578,"Protein kinase superfamily, AGC Ser/Thr protei...",103..111; 126,221.0,MLKFKYGARNPLDAGAAEPIASRASRLNLFFQGKPPFMTQQQMSPL...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In case you don't want to keep the entire train/test datasets, you can create a smaller version (with a random representation of the original dataset). Uncomment the code below if that is the case:

In [None]:
# # Percentage of data you want to keep
# k = 0.05  # for keeping 5% of the data

# # Generate random indices representing a percentage of each dataset
# train_df = train_df.sample(frac=k, random_state=42)
# test_df = test_df.sample(frac=k, random_state=42)

## Split sequences into chunks

Sequences aren’t always of the same length. We will split the longer protein sequences and their lables into non-overlapping chunks of certain length or less to account for a given context window of ESM-2 models. Most protein sequences are on average 350 or so residues, so having longer context windows is often unnecessary, but keep in mind this will effect training time and batch size. Here, we pick a context of 1000.

In [None]:
def split_into_chunks(sequences, labels, chunk_size = 1000):
    """Split sequences and labels into chunks of size "chunk_size" or less."""
    new_sequences = []
    new_labels = []
    for seq, lbl in zip(sequences, labels):
        if len(seq) > chunk_size:
            # Split the sequence and labels into chunks of size "chunk_size" or less
            for i in range(0, len(seq), chunk_size):
                new_sequences.append(seq[i:i+chunk_size])
                new_labels.append(lbl[i:i+chunk_size])
        else:
            new_sequences.append(seq)
            new_labels.append(lbl)

    return new_sequences, new_labels


In [None]:
# Create lists of sequences and labels
test_seq = test_df['Sequence'].tolist()
test_labels = test_df['Binding-Active site'].tolist()
train_seq = train_df['Sequence'].tolist()
train_labels = train_df['Binding-Active site'].tolist()

In [None]:
# Apply the function to create new datasets with chunks of size "chunk_size" or less
chunk_size = 1000
test_seq_chunked, test_labels_chunked = split_into_chunks(test_seq, test_labels)
train_seq_chunked, train_labels_chunked = split_into_chunks(train_seq, train_labels)

The resulting train and test files will be exported to the same path where the input data file was located:

In [None]:
filename = os.path.splitext(os.path.basename(file_path))[0]
dir = os.path.dirname(file_path)

# Paths to save the new chunked pickle files
test_labels_path =  os.path.join(dir, filename + "_test_labels_chunked_" + str(chunk_size) + ".pkl")
test_seq_path = os.path.join(dir, filename + "_test_sequences_chunked_" + str(chunk_size) + ".pkl")
train_labels_path = os.path.join(dir, filename + "_train_labels_chunked_" + str(chunk_size) + ".pkl")
train_seq_path = os.path.join(dir, filename + "_train_sequences_chunked_" + str(chunk_size) + ".pkl")

# Save the chunked datasets as new pickle files
with open(test_labels_path, 'wb') as file:
    pickle.dump(test_labels_chunked, file)
with open(test_seq_path, 'wb') as file:
    pickle.dump(test_seq_chunked, file)
with open(train_labels_path, 'wb') as file:
    pickle.dump(train_labels_chunked, file)
with open(train_seq_path, 'wb') as file:
    pickle.dump(train_seq_chunked, file)

test_labels_path, test_seq_path, train_labels_path, train_seq_path


('/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_labels_chunked_1000.pkl',
 '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_test_sequences_chunked_1000.pkl',
 '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_labels_chunked_1000.pkl',
 '/content/gdrive/MyDrive/ESMbind/data/uniprotkb_data_2024_05_29_train_sequences_chunked_1000.pkl')

# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:


## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.


## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/cGzwCdrUqS) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Citing this tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.




```
@manual{Bioinformatics,
 title={UniProt data pre-processing for binding site prediction downstream task},
 organization={DeepChem},
 author={Gómez de Lope, Elisa},
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/UniProt_Data_Preprocessing_for_Binding_Sites.ipynb}},
 year={2024},
}
```

