# Building an initial collection of PK domains

Here, we'll build a collection of PK domains from scratch.
We'll use UniProt sequences in the SIFTS database to map UniProt to PDB IDs.
We'll find domains in the SIFTS sequences and fetch the associated PDB structures for successful hits.
Using these boundaries, we'll transfer the discovered domain boundaries to PDB structures and subset each sequence and structure domain.
The accompanying paper provides a more detailed description of this process. Also, don't hesitate to inspect the [docs](https://kinactive.readthedocs.io/en/latest/index.html) (they also provide links to the relevant source code) or [raise an issue](https://github.com/edikedik/kinactive/issues).

Completing this notebook may depend on the internet connection and the PC used.
Here, we'll use a laptop with 24-core 13th gen Intel processor and 32GB RAM.

In [1]:
import logging
import warnings
from pathlib import Path

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from kinactive import DB, DBConfig

In [2]:
logging.basicConfig(level=logging.INFO)

In [3]:
DATA = Path('../data')  # A path to the directory where data will be stored.
DATA.mkdir(exist_ok=True)
REPRODUCE = False
N_SEQ_DOMAINS = 3  # Restrict the number of processed canonical sequence domains for demonstration

if REPRODUCE:
    from kinactive.io import load_txt_lines
    # Replace with your paths if needed
    uni_list_path = Path('../data/submit/IDlists/UniProt_ids.txt')
    pdb_list_path = Path('../data/submit/IDlists/PDB_ids.txt')
    
    uni_ids = load_txt_lines(uni_list_path)
    pdb_ids = load_txt_lines(pdb_list_path)
else:
    uni_ids, pdb_ids = None, None

cfg = DBConfig(
    verbose=True,
    target_dir=DATA / 'lXt-PK',
    pdb_dir=DATA / 'pdb' / 'cif',
    pdb_dir_info=DATA / 'pdb' / 'info',
    seq_dir=DATA / 'uniprot' / 'fasta',
    io_cpus=10,
    init_map_numbering_cpus=10,
    init_cpus=10
)
db = DB(cfg)

DB is built according to settings specified in a `DBConfig` dataclass.
Consult with the [docs](https://kinactive.readthedocs.io/en/latest/kinactive.config.html#kinactive.config.DBConfig) to see what the various options mean.

In [4]:
?DBConfig

[0;31mInit signature:[0m
[0mDBConfig[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtarget_dir[0m[0;34m:[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m [0;34m=[0m [0mPosixPath[0m[0;34m([0m[0;34m'db'[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpdb_dir[0m[0;34m:[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m [0;34m=[0m [0mPosixPath[0m[0;34m([0m[0;34m'pdb/structures'[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpdb_dir_info[0m[0;34m:[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m [0;34m=[0m [0mPosixPath[0m[0;34m([0m[0;34m'pdb/info'[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mseq_dir[0m[0;34m:[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m [0;34m=[0m [0mPosixPath[0m[0;34m([0m[0;34m'uniprot/fasta'[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_fetch_trials[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m

In [5]:
?db.build

[0;31mSignature:[0m
[0mdb[0m[0;34m.[0m[0mbuild[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0muniprot_ids[0m[0;34m:[0m [0mcollections[0m[0;34m.[0m[0mabc[0m[0;34m.[0m[0mCollection[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m|[0m [0;32mNone[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpdb_chain_ids[0m[0;34m:[0m [0mcollections[0m[0;34m.[0m[0mabc[0m[0;34m.[0m[0mCollection[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m|[0m [0;32mNone[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_domains[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mlXtractor[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mchain[0m[0;34m.[0m[0mlist[0m[0;34m.[0m[0mChainList[0m[0;34m[[0m[0mlXtractor[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mchain[0m[0;34m.[0m[0mchain[0m[0;34m.[0m[0mChain[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstrin

In [6]:
%%time

db.build(uni_ids, pdb_ids, n_domains=N_SEQ_DOMAINS);

INFO:kinactive.db:205 remaining sequences to fetch.


Fetching:   0%|          | 0/3 [00:00<?, ?it/s]

Saving fetched sequences: 0it [00:00, ?it/s]

Initializing objects:   0%|          | 0/61750 [00:00<?, ?it/s]

INFO:kinactive.db:Got 61750 seqs from ../data/uniprot/fasta
INFO:kinactive.db:Filtered to 49701 seqs in [150, 3000]


Annotating sequence domains: 0it [00:00, ?it/s]

INFO:kinactive.db:Found 680 PK domains within 666 seqs.
INFO:kinactive.db:Sampled to 3 random initial domains.
INFO:kinactive.db:Fetching info for 19 PDB IDs.


Fetching trials:   0%|          | 0/2 [00:00<?, ?it/s]

Fetching:   0%|          | 0/19 [00:00<?, ?it/s]

INFO:kinactive.db:Filtered to 18 X-ray PDB IDs out of 19.
INFO:kinactive.db:Fetching 18 X-ray structures


Fetching trials:   0%|          | 0/2 [00:00<?, ?it/s]

Fetching:   0%|          | 0/18 [00:00<?, ?it/s]

Initializing sequences:   0%|          | 0/2 [00:00<?, ?it/s]

Initializing structures: 0it [00:00, ?it/s]

Mapping numberings: 0it [00:00, ?it/s]

INFO:kinactive.db:Initialized 2 `Chain` objects.
INFO:kinactive.db:Filtered to 29 out of 29 domain structures having >=100 extracted domain size and >=0.9 canonical seq match fraction.
INFO:kinactive.db:Filtered to 2 out of 2 domains with at least one valid structures.
INFO:kinactive.db:Filtered to 2 chains out of 2 with at least one extracted domains.


CPU times: user 10 s, sys: 732 ms, total: 10.7 s
Wall time: 42.5 s


In [7]:
%%time

if len(db.chains) > 0:
    db.save(overwrite=True)

Writing objects: 0it [00:00, ?it/s]

INFO:kinactive.db:Saved summary file initial_seq_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file initial_str_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file domain_seq_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file domain_str_summary.csv to ../data/lXt-PK


CPU times: user 94.8 ms, sys: 67.7 ms, total: 162 ms
Wall time: 1.87 s
