# Database module

The database module allows users to query relevant database APIs to extract clinically relevant mutational data and protein annotations from various sources, including:
+ **cBioPortal:**  multi-institutional repository of sequencing data for cancer genomics
+ **UniProt:** obtain canonical sequence information
+ **Pfam:** annotate protein domains of sequences
+ **HGNC:** standardize gene naming conventions
+ **KLIFS:** kinase-ligand interaction annotations
+ **KinHub:** curated list of human kinases

## KinHub

In this vignette, we will extract a list of human kinases obtained from [KinHub](http://www.kinhub.org/kinases.html) for which we wish to obtain additional protein annotations and extract corresponding mutations from a cBioPortal cohort.

In [9]:
from missense_kinase_toolkit.databases import scrapers

In [10]:
df_kinhub = scrapers.kinhub()

In [KinHub](http://www.kinhub.org/kinases.html), 13 kinases possess more than 1 kinase domain so are listed as separate entries despite possessing a single UniProt ID. To remedy this, we have aggregated entries by common HGNC gene names so that each entry represents a unique protein rather than a unique kinase domain.

In [11]:
df_kinhub.loc[df_kinhub["Manning Name"].apply(lambda x: "DOMAIN2" in x.upper()), ]

Unnamed: 0,HGNC Name,xName,Manning Name,Kinase Name,Group,Family,SubFamily,UniprotID
141,JAK1,"JAK1_b, JAK1","Domain2_JAK1, JAK1",Tyrosine-protein kinase JAK1,TK,"Jak, JakB",,P23458
142,JAK2,"JAK2, JAK2_b","JAK2, Domain2_JAK2",Tyrosine-protein kinase JAK2,TK,"Jak, JakB",,O60674
143,JAK3,"JAK3, JAK3_b","JAK3, Domain2_JAK3",Tyrosine-protein kinase JAK3,TK,"Jak, JakB",,P52333
191,RPS6KA5,"MSK1_b, MSK1","Domain2_MSK1, MSK1",Ribosomal protein S6 kinase alpha-5,"AGC, CAMK","RSK, RSKb","MSK, MSKb",O75582
192,RPS6KA4,"MSK2_b, MSK2","Domain2_MSK2, MSK2",Ribosomal protein S6 kinase alpha-4,"AGC, CAMK","RSK, RSKb","MSK, MSKb",O75676
268,RPS6KA2,"RSK3_b, RSK3","Domain2_RSK1, RSK1",Ribosomal protein S6 kinase alpha-2,"AGC, CAMK","RSK, RSKb","RSKp90, RSKb",Q15349
269,RPS6KA3,"RSK2_b, RSK2","Domain2_RSK2, RSK2",Ribosomal protein S6 kinase alpha-3,"AGC, CAMK","RSK, RSKb","RSKp90, RSKb",P51812
270,RPS6KA1,"RSK1, RSK1_b","Domain2_RSK3, RSK3",Ribosomal protein S6 kinase alpha-1,"AGC, CAMK","RSK, RSKb","RSKp90, RSKb",Q15418
305,TYK2,"TYK2, TYK2_b","TYK2, Domain2_TYK2",Non-receptor tyrosine-protein kinase TYK2,TK,"Jak, JakB",,P29597
395,EIF2AK4,"GCN2_b, GCN2","Domain2_GCN2, GCN2",Eukaryotic translation initiation factor 2-alp...,"Other, STE","PEK, STE-Unique","GCN2, nan",Q9P2K8


## Pfam

In [12]:
from missense_kinase_toolkit.databases import pfam

In [13]:
pfam.Pfam("P00519")._pfam

Unnamed: 0,uniprot,protein_length,source_database,organism,in_alphafold,pfam_accession,name,source_database.1,type,integrated,member_databases,go_terms,start,end,dc-status,representative,representative.1,model,score
0,p00519,1130,reviewed,9606,True,PF00017,SH2 domain,pfam,domain,IPR000980,,,127,202,CONTINUOUS,False,False,PF00017,9.3e-21
1,p00519,1130,reviewed,9606,True,PF00018,SH3 domain,pfam,domain,IPR001452,,,67,113,CONTINUOUS,False,False,PF00018,2e-09
2,p00519,1130,reviewed,9606,True,PF07714,Protein tyrosine and serine/threonine kinase,pfam,domain,IPR001245,,,242,492,CONTINUOUS,False,False,PF07714,1.4e-97
3,p00519,1130,reviewed,9606,True,PF08919,F-actin binding,pfam,domain,IPR015015,,,1026,1130,CONTINUOUS,False,False,PF08919,3.5000000000000004e-28


## Uniprot

In [14]:
from missense_kinase_toolkit.databases import uniprot