# Data processing

In [None]:
!python -m pip install -r requirements.txt

In [1]:
import sidechainnet as scn
import pickle
from utils.preprocess_sidechainnet import combine_data, preprocess, cut_missing_ends

  from .autonotebook import tqdm as notebook_tqdm


Take the 70% identity dataset and expand it with sequences from the same PDBs in the 100% identity dataset. 

This is by far the most memory-intensive step so we will save the resulting intermediate file. I have uploaded my output from this step to `s3://ml4-main-storage/updated_70.pkl`.

In [2]:
data_70 = combine_data(
    main_data_scn=scn.load(casp_version=12, thinning=70), 
    add_data_scn=scn.load(casp_version=12, thinning=100)
)
with open("./sidechainnet_data/updated_casp12_70.pkl", "wb") as f:
    pickle.dump(data_70, f)

Now we will open that file, filter the chains by name, resolution, length and fraction of missing values and combine them in a new multi-chain dataset.

In [3]:
with open("./sidechainnet_data/updated_casp12_70.pkl", "rb") as f:
    data = pickle.load(f)

data = preprocess(data, dataset="train")

Searching for shortened chains...


100%|██████████| 32446/32446 [00:00<00:00, 455538.18it/s]


Searching for duplicates...


100%|██████████| 6684/6684 [00:00<00:00, 305755.57it/s]


Calculating sequence similarities...


100%|██████████| 4136/4136 [00:07<00:00, 520.05it/s]


Finding chains to combine...


100%|██████████| 3317/3317 [00:00<00:00, 1132755.77it/s]


Filtering by resolution, length and missing values...


100%|██████████| 52712/52712 [00:01<00:00, 30859.65it/s] 


Removing 24505 chains...
Recombining chains...


100%|██████████| 1092/1092 [00:00<00:00, 1193.47it/s]


Generating multi-chain entries...


100%|██████████| 24276/24276 [00:13<00:00, 1846.08it/s]


Optionally, we can cut the ends where there is structure information missing.

In [4]:
data = cut_missing_ends(data)

Cutting missing ends...


100%|██████████| 24276/24276 [00:00<00:00, 192420.28it/s]

Cut 18474 start intervals (mean length 10.8) and 19713 end intervals (mean length 15.9)





This is the final dataset. We will save it in the same folder.

In [None]:
with open("./sidechainnet_data/multichain_casp12_70.pkl", "wb") as f:
    pickle.dump(data, f)

This file is organised similarly to SidechainNet datasets but there are some important differences.

It is a dictionary with the following keys:
- `'ids'`: a list of PDB IDs,
- `'scn'`: a lisf of lists of SidechainNet IDs that were combined here,
- `'seq'`: a list of lists of residue sequences,
- `'msk'`: a list of lists of string masks (e.g. `'+++-----+++++++++++++++'`; `'-'` indicates missing structural information),
- `'crd'`: a list of `numpy` coordinate arrays of shape `(L * 14, 3)` (first 4 sets of coordinates correspond to N, Ca, C, O, in that order; the rest is side chain information)