# Data processing

In [4]:
# !python -m pip install -r requirements.txt

In [1]:
import sidechainnet as scn
import pickle
from utils.preprocess_sidechainnet import combine_data, preprocess, cut_missing_ends

  from .autonotebook import tqdm as notebook_tqdm


Take the 70% identity dataset and expand it with sequences from the same PDBs in the 100% identity dataset. 

This is by far the most memory-intensive step so we will save the resulting intermediate file. I have uploaded my output from this step to S3.

To download it, run:

In [2]:
# !aws s3 cp s3://ml4-main-storage/updated_70.pkl sidechainnet_data

To compute, run:

In [3]:
# data_70 = scn.load(casp_version=12, thinning=70)
# data_100 = scn.load(casp_version=12, thinning=100)
# for dataset in ["train"]:
#     data_70 = combine_data(
#         main_data_scn=data_70, 
#         add_data_scn=data_100,
#         dataset=dataset,
#     )
# with open("./sidechainnet_data/updated_casp12_70.pkl", "wb") as f:
#     pickle.dump(data_70, f)

SidechainNet was loaded from ./sidechainnet_data/sidechainnet_casp12_70.pkl.
SidechainNet was loaded from ./sidechainnet_data/sidechainnet_casp12_100.pkl.
Adding 11332 chains...
Adding 0 chains...
Adding 0 chains...
Adding 0 chains...
Adding 0 chains...
Adding 0 chains...
Adding 0 chains...
Adding 0 chains...


Now we will open that file, filter the chains by name, resolution, length and fraction of missing values and combine them in a new multi-chain dataset.

In [None]:
with open("./sidechainnet_data/updated_70.pkl", "rb") as f:
    data = pickle.load(f)

train = preprocess(data, dataset="train")
validation = preprocess(data, dataset="valid-30")

Searching for shortened chains...


100%|██████████| 32446/32446 [00:00<00:00, 628871.34it/s]


Searching for duplicates...


100%|██████████| 6684/6684 [00:00<00:00, 333941.56it/s]


Calculating sequence similarities...


100%|██████████| 4136/4136 [00:08<00:00, 515.87it/s]


Finding chains to combine...


100%|██████████| 3317/3317 [00:00<00:00, 1017665.60it/s]


Filtering by resolution, length and missing values...


100%|██████████| 52712/52712 [00:01<00:00, 28451.69it/s]


Optionally, we can cut the ends where there is structure information missing.

In [None]:
train = cut_missing_ends(train)
validation = cut_missing_ends(validation)

Cutting missing ends...


100%|██████████| 24276/24276 [00:00<00:00, 192420.28it/s]

Cut 18474 start intervals (mean length 10.8) and 19713 end intervals (mean length 15.9)





This is the final dataset. We will save it in the same folder.

In [None]:
with open("./sidechainnet_data/multichain_casp12_70.pkl", "wb") as f:
    pickle.dump(train, f)
with open("./sidechainnet_data/multichain_casp12_70_val10.pkl", "wb") as f:
    pickle.dump(train, f)

This file is organised similarly to SidechainNet datasets but there are some important differences.

It is a dictionary with the following keys:
- `'ids'`: a list of PDB IDs,
- `'scn'`: a lisf of lists of SidechainNet IDs that were combined here,
- `'seq'`: a list of lists of residue sequences,
- `'msk'`: a list of lists of string masks (e.g. `'+++-----+++++++++++++++'`; `'-'` indicates missing structural information),
- `'crd'`: a list of `numpy` coordinate arrays of shape `(L * 14, 3)` (first 4 sets of coordinates correspond to N, Ca, C, O, in that order; the rest is side chain information)