# Install

## Via venv locally

```
python3.11 -m venv VE
source VE/bin/activate

# if repo cloned locally
cd missense_kinase_toolkit/schema
pip install -e .
pip install ipython, jupyter
cd ../../notebooks
jupyter-lab

# if not cloned locally
pip install git+https://github.com/choderalab/missense-kinase-toolkit.git#subdirectory=missense_kinase_toolkit/schema
pip install ipython, jupyter
#TODO: navigate to this notebook
```

## Via colab

In [None]:
!pip install git+https://github.com/choderalab/missense-kinase-toolkit.git#subdirectory=missense_kinase_toolkit/schema

# Load package

In [1]:
from mkt.schema import io_utils

# Load data

`mkt.schema` provides a Pydantic model `KinaseInfo` to store the data gathered and harmonized using our `databases` sub-package. We serialize the resulting `KinaseInfo` objects in  `mkt.schema` in the `KinaseInfo` sub-directory. These objects can be loaded as a dictionary where the HGNC gene name serves as the key using the following command.

In [2]:
dict_kinase = io_utils.deserialize_kinase_dict()

Deserializing KinaseInfo objects in memory...: 100%|█| 566/566 [00:02<00:00, 1


We also support serializing and deserializing as `json`, `yaml`, or `toml` objects. Note that `yaml` files take a long time to serialize and deserialize and are not demonstrated below. Additionally, `toml` serialization is not supported on Windows machines.

In [3]:
import shutil

for suffix in ["json", "toml"]:
    print(f"Format: {suffix}")
    io_utils.serialize_kinase_dict(dict_kinase, suffix=suffix, str_path=f"./{suffix}")
    dict_temp = io_utils.deserialize_kinase_dict(suffix=suffix, str_path=f"./{suffix}", bool_remove=True)
    print(f"Serialized object matches stored: {dict_kinase == dict_temp}")
    print()

Format: json


Serializing KinaseInfo objects...: 100%|████| 566/566 [00:12<00:00, 46.98it/s]
Deserializing KinaseInfo objects from files...: 100%|█| 566/566 [00:03<00:00, 


Serialized object matches stored: True

Format: toml


Serializing KinaseInfo objects...: 100%|████| 566/566 [00:33<00:00, 17.13it/s]
Deserializing KinaseInfo objects from files...: 100%|█| 566/566 [01:54<00:00, 


Serialized object matches stored: True



# Examine `KinaseInfo` object

The `KinaseInfo` object contains the following relevant:

| Field        | Description                                                                                                                  |
| :-:          | :-                                                                                                                           |
| `hgnc_name`  | Hugo Gene Nomenclature Commitee gene name                                                                                    |
| `uniprot_id` | UniProt ID                                                                                                                   |
| `kinhub`     | Information scraped from [KinHub](http://www.kinhub.org/)                                                                    |
| `uniprot`    | Canonical sequence from UniProt                                                                                              |
| `klifs`      | Information from KLIFS API query, including KLIFS pocket sequence                                                            |
| `pfam`       | Annotated kinase domain from Pfam (includes "Protein kinase domain" and "Protein tyrosine and serine/threonine kinase" only) |
| `kincore`    | Annotated kinase domain and active state structures from Dunbrack lab's [KinCore](http://dunbrack.fccc.edu/kincore/activemodels)                         |

In [4]:
[i for i in dir(dict_kinase["ABL1"]) if not i.startswith("_")]

['KLIFS2UniProtIdx',
 'KLIFS2UniProtSeq',
 'construct',
 'copy',
 'dict',
 'from_orm',
 'hgnc_name',
 'json',
 'kincore',
 'kinhub',
 'klifs',
 'model_computed_fields',
 'model_config',
 'model_construct',
 'model_copy',
 'model_dump',
 'model_dump_json',
 'model_extra',
 'model_fields',
 'model_fields_set',
 'model_json_schema',
 'model_parametrized_name',
 'model_post_init',
 'model_rebuild',
 'model_validate',
 'model_validate_json',
 'model_validate_strings',
 'parse_file',
 'parse_obj',
 'parse_raw',
 'pfam',
 'schema',
 'schema_json',
 'uniprot',
 'uniprot_id',
 'update_forward_refs',
 'validate',
 'validate_klifs2uniprotidx',
 'validate_klifs2uniprotseq']

## Missing fields

In [5]:
n_klifs = len([i.hgnc_name for i in dict_kinase.values() if i.klifs is not None])
print(f"{n_klifs} KLIFS entries\n")

n_pocket = len([i.hgnc_name for i in dict_kinase.values() \
                if i.klifs is not None and i.klifs.pocket_seq is not None])
print(f"{n_pocket} HGNC names\n")

dict_kincore = {k: v for k,v in dict_kinase.items() if v.kincore is not None}
print(f"{len(dict_kincore)} KinCore entries")
n_fasta = len([i.hgnc_name for i in dict_kincore.values() if i.kincore.fasta is not None])
print(f"{n_fasta} KinCore FASTA sequences")
n_cif = len([i.hgnc_name for i in dict_kincore.values() if i.kincore.cif is not None])
print(f"{n_cif} KinCore CIF structures\n")

n_pfam = len([i.hgnc_name for i in dict_kinase.values() if i.pfam is not None])
print(f"{n_pfam} Pfam kinase domain annotations\n")

n_klif2uniprot = len([i.hgnc_name for i in dict_kinase.values() if i.KLIFS2UniProtIdx is not None])
print(f"{n_klif2uniprot} KLIFS pocket to UniProt alignment")

555 KLIFS entries

519 HGNC names

492 KinCore entries
492 KinCore FASTA sequences
437 KinCore CIF structures

490 Pfam kinase domain annotations

519 KLIFS pocket to UniProt alignment


## Use UniProt ID as key

In [6]:
uniprot_id = "P00519"
dict_inv = {val.uniprot_id: val for val in dict_kinase.values()}
dict_inv[uniprot_id].hgnc_name

'ABL1'

## Contents per field

### API query or scraper fields

In [7]:
hgnc = "ABL1"

print(f"HGNC name: {dict_kinase[hgnc].hgnc_name}\n")
print(f"UniProt ID: {dict_kinase[hgnc].uniprot_id}\n")
print(f"KLIFS object:\n{dict_kinase[hgnc].klifs}\n")
print(f"KinCore sequence:\n{dict_kinase[hgnc].kincore.fasta}\n")
print(f"KinHub object:\n{dict_kinase[hgnc].kinhub}\n")
print(f"UniProt object:\n{dict_kinase[hgnc].uniprot}\n")
print(f"Pfam object:\n{dict_kinase[hgnc].pfam}\n")
print(f"KincCore CIF dictionary:\n{dict_kinase[hgnc].kincore.cif.cif}\n")

HGNC name: ABL1

UniProt ID: P00519

KLIFS object:
gene_name='ABL1' name='ABL1' full_name='ABL proto-oncogene 1, non-receptor tyrosine kinase' group='TK' family='Other' iuphar=1923 kinase_id=392 pocket_seq='HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITEFMTYGNLLDYLREYLEKKNFIHRDLAARNCLVVADFGLS'

KinCore sequence:
seq='KWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSIS' group='TK' hgnc={'ABL1'} swissprot='ABL1_HUMAN' uniprot='P00519' start_md=242 end_md=495 length_md=254 start_af2=234 end_af2=503 length_af2=270 length_uniprot=1130 source_file='Faezov-Dunbrack_2023' start=234 end=503 mismatch=None

KinHub object:
hgnc_name='ABL1' kinase_name='Tyrosine-protein kinase ABL1' manning_name='ABL' xname='ABL1' group='TK' family='Other'

UniProt object:
header='>sp|P00519|ABL1_HUMA

### KLIFS2UniProt Alignment

`KLIFS2UniProtSeq` includes the full continuous sequence of the KLIFS pocket.

Discontinuities are either between regions (denoted by `:`) or intra-region discontinuities (denoted by `_intra`).

In [8]:
dict_inv[uniprot_id].KLIFS2UniProtSeq

{'I': 'HKL',
 'g.l': 'GGGQYG',
 'II': 'EVYE',
 'II:III': 'GVWKKYSLT',
 'III': 'VAVKTL',
 'III:αC': 'KEDTMEVE',
 'αC': 'EFLKEAAVMKE',
 'b.l_1': 'IK',
 'b.l_intra': 'H',
 'b.l_2': 'PNLVQ',
 'IV': 'LLGV',
 'IV:V': 'CTREPPF',
 'V': 'YII',
 'GK': 'T',
 'hinge': 'EFM',
 'hinge:linker': None,
 'linker_1': 'T',
 'linker_intra': None,
 'linker_2': 'YGN',
 'αD': 'LLDYLRE',
 'αD:αE': 'CNRQEVNAVVLLYMATQISSAME',
 'αE': 'YLEKK',
 'αE:VI': None,
 'VI': 'NFI',
 'c.l': 'HRDLAARN',
 'VII': 'CLV',
 'VII:VIII': 'GENHLVK',
 'VIII': 'V',
 'xDFG': 'ADFG',
 'a.l': 'LS'}

Removing discontinuous segments using the keys will yield an 85-residue sequence that matches the KLIFS binding pocket.

In [9]:
str_dict = "".join([v for k, v in dict_kinase["ADCK1"].KLIFS2UniProtSeq.items() \
                    if v is not None and ":" not in k and "_intra" not in k])

print(str_dict)
print(dict_kinase["ADCK1"].klifs.pocket_seq)

TPLGTASLAQVHKVAVKVQDFLNEGRNAEKVSLKVPRIHWDERVLLMEFGQVNDRDYMEFVNG--FVHCDPHPGNVLVLLDHGLY
TPLGTASLAQVHKVAVKVQDFLNEGRNAEKVSLKVPRIHWDERVLLMEFGQVNDRDYMEFVNG--FVHCDPHPGNVLVLLDHGLY


`KLIFS2UniProtIdx` aligns 85-resiude KLIFS pocket to canonical UniProt sequence without any discontinuous regions.

In [10]:
dict_inv[uniprot_id].KLIFS2UniProtIdx

{'I:1': 246,
 'I:2': 247,
 'I:3': 248,
 'g.l:4': 249,
 'g.l:5': 250,
 'g.l:6': 251,
 'g.l:7': 252,
 'g.l:8': 253,
 'g.l:9': 254,
 'II:10': 255,
 'II:11': 256,
 'II:12': 257,
 'II:13': 258,
 'III:14': 268,
 'III:15': 269,
 'III:16': 270,
 'III:17': 271,
 'III:18': 272,
 'III:19': 273,
 'αC:20': 282,
 'αC:21': 283,
 'αC:22': 284,
 'αC:23': 285,
 'αC:24': 286,
 'αC:25': 287,
 'αC:26': 288,
 'αC:27': 289,
 'αC:28': 290,
 'αC:29': 291,
 'αC:30': 292,
 'b.l:31': 293,
 'b.l:32': 294,
 'b.l:33': 296,
 'b.l:34': 297,
 'b.l:35': 298,
 'b.l:36': 299,
 'b.l:37': 300,
 'IV:38': 301,
 'IV:39': 302,
 'IV:40': 303,
 'IV:41': 304,
 'V:42': 312,
 'V:43': 313,
 'V:44': 314,
 'GK:45': 315,
 'hinge:46': 316,
 'hinge:47': 317,
 'hinge:48': 318,
 'linker:49': 319,
 'linker:50': 320,
 'linker:51': 321,
 'linker:52': 322,
 'αD:53': 323,
 'αD:54': 324,
 'αD:55': 325,
 'αD:56': 326,
 'αD:57': 327,
 'αD:58': 328,
 'αD:59': 329,
 'αE:60': 353,
 'αE:61': 354,
 'αE:62': 355,
 'αE:63': 356,
 'αE:64': 357,
 'VI:65': 3