# Creating Datasets from the PDB

Graphein provides a utility for curating and splitting datasets from the [RCSB PDB](https://www.rcsb.org/).


Initialising a PDBManager will download PDB Metadata which we can use to make complex selections of protein structures.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb)

In [1]:
!pip install graphein
from rich import inspect
from graphein.ml.datasets import PDBManager

manager = PDBManager(root_dir=".")
inspect(manager, methods=True)

{'1bos': False, '1vvj': False, '1vy4': False, '1vy5': False, '1vy6': False, '1vy7': False, '2btj': False, '2vvj': False, '3j3q': False, '3j3y': False, '3j6b': False, '3j6x': False, '3j6y': False, '3j77': False, '3j78': False, '3j79': False, '3j7o': False, '3j7p': False, '3j7q': False, '3j7r': False, '3j8h': False, '3j92': False, '3j9g': False, '3j9k': False, '3j9l': False, '3j9m': False, '3j9r': False, '3j9w': False, '3j9y': False, '3j9z': False, '3ja1': False, '3jag': False, '3jah': False, '3jai': False, '3jaj': False, '3jan': False, '3jbn': False, '3jbo': False, '3jbp': False, '3jbu': False, '3jbv': False, '3jc1': False, '3jc8': False, '3jc9': False, '3jcd': False, '3jce': False, '3jcj': False, '3jcn': False, '3jco': False, '3jcp': False, '3jcs': False, '3jct': False, '3k1q': False, '3whe': False, '4abz': False, '4bp7': False, '4bts': False, '4ctf': False, '4ctg': False, '4d5y': False, '4d67': False, '4dcb': False, '4fqr': False, '4frt': False, '4l47': False, '4l71': False, '4lel': F

The manager wraps two dataframes:

* `manager.df` - your working selection.
* `manager.source` - A clean copy of the original metadata.

In [2]:
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,1,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
780157,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,1,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction,True
780158,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780159,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780160,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True


In [3]:
manager.source

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,1,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
780157,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,1,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction,True
780158,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780159,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780160,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True


## Selection Properties

In [4]:
print("Num chains: ", manager.get_num_chains())
print("Num unique pdbs: ", manager.get_num_unique_pdbs())
print("Longest chain: ", manager.get_longest_chain())
print("Shortest chain: ", manager.get_shortest_chain())
print("Best Resolution: ", manager.get_best_resolution())
print("Worst Resolution: ", manager.get_worst_resolution())
print("Experiment Types: ", manager.get_experiment_types())
print("Molecule Types: ", manager.get_molecule_types())

Num chains:  780162
Num unique pdbs:  202738
Longest chain:  16181
Shortest chain:  1
Best Resolution:  -1.0
Worst Resolution:  70.0
Experiment Types:  ['diffraction' 'NMR' 'other' 'EM']
Molecule Types:  ['na' 'protein']


# Making Selections

Selection functions return a pd.DataFrame. All selection functions provide an `update: bool` argument controlling whether or not `manager.df` is updated in place:

In [5]:
print("Number of chains: ", len(manager.df))

print("Hi res structure: ", len(manager.resolution_better_than_or_equal_to(2.0)))
print(f"Not modified: {len(manager.df)}")

# Update inplace
manager.resolution_better_than_or_equal_to(2.0, update=True)
print(f"Modified inplace: {len(manager.df)}")

Number of chains:  780162
Hi res structure:  202476
Not modified: 780162
Modified inplace: 202476


If you want to reset the selection:

In [6]:
manager.reset()
print(len(manager.df))

780162


Here is an example selection:

In [7]:
manager.length_shorter_than(6, update=True)
manager.length_longer_than(4, update=True)
manager.molecule_type("protein", update=True)
manager.resolution_better_than_or_equal_to(1.5, update=True)
manager.experiment_type("diffraction", update=True)
manager.remove_non_standard_alphabet_sequences(update=True)
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
19383,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,6,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
19384,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,6,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
19385,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,6,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
50598,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,2,[PTR],Rous sarcoma virus,1.50,1992-08-18,diffraction,True
50810,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,2,"[MOH, SO4]",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653289,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,4,"[ACE, BTB, GOL, NH2, SO4, WZG]",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction,True
682420,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,2,"[CMC, GOL, P6G]",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction,True
694077,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,2,"[CA, MG]",Enterococcus faecalis,1.40,2021-09-30,diffraction,True
745627,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,2,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction,True


## Creating train/val/test splits

Splits can be created with through a few different strategies:

* temporal
* sequence similarity (using MMSeqs2)
* structural similarity (coming soon)


Multiple types of splitting operations can be composed. For example, one can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits.

### Temporal splits

We can create train, val and test splits based on deposition time:

In [8]:
import numpy as np

splits = ["train", "val", "test"]

pdb_manager = PDBManager(
    root_dir=".",
    splits=splits,
    split_ratios=[0.8, 0.1, 0.1],
    split_time_frames=[
        np.datetime64("2020-01-01"),
        np.datetime64("2021-01-01"),
        np.datetime64("2023-03-01"),
    ],
)
pdb_manager.split_by_deposition_date(df=pdb_manager.df, update=True)
pdb_manager.df_splits["train"]

{'1bos': False, '1vvj': False, '1vy4': False, '1vy5': False, '1vy6': False, '1vy7': False, '2btj': False, '2vvj': False, '3j3q': False, '3j3y': False, '3j6b': False, '3j6x': False, '3j6y': False, '3j77': False, '3j78': False, '3j79': False, '3j7o': False, '3j7p': False, '3j7q': False, '3j7r': False, '3j8h': False, '3j92': False, '3j9g': False, '3j9k': False, '3j9l': False, '3j9m': False, '3j9r': False, '3j9w': False, '3j9y': False, '3j9z': False, '3ja1': False, '3jag': False, '3jah': False, '3jai': False, '3jaj': False, '3jan': False, '3jbn': False, '3jbo': False, '3jbp': False, '3jbu': False, '3jbv': False, '3jc1': False, '3jc8': False, '3jc9': False, '3jcd': False, '3jce': False, '3jcj': False, '3jcn': False, '3jco': False, '3jcp': False, '3jcs': False, '3jct': False, '3k1q': False, '3whe': False, '4abz': False, '4bp7': False, '4bts': False, '4ctf': False, '4ctg': False, '4d5y': False, '4d67': False, '4dcb': False, '4fqr': False, '4frt': False, '4l47': False, '4l71': False, '4lel': F

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,2,[SPM],,1.90,1994-12-05,diffraction,True
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,2,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,1,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
780157,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,1,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction,True
780158,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780159,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780160,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,4,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True


In [9]:
pdb_manager.df_splits["val"]

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
408858,5qub_A,5qub,A,229,protein,RadA,MATIGRISTGSKSLDKLLGGGIETQAITEVFGEFGSGKTQLAHTLA...,,1,"[D48, PO4]",Pyrococcus furiosus,1.35,2020-01-27,diffraction,True
408859,5quc_A,5quc,A,229,protein,RadA,MATIGRISTGSKSLDKLLGGGIETQAITEVFGEFGSGKTQLAHTLA...,,1,"[D48, PO4]",Pyrococcus furiosus,1.43,2020-01-27,diffraction,True
408860,5qud_A,5qud,A,229,protein,RadA,MATIGRISTGSKSLDKLLGGGIETQAITEVFGEFGSGKTQLAHTLA...,,1,"[D48, PO4]",Pyrococcus furiosus,1.30,2020-01-27,diffraction,True
408861,5que_A,5que,A,229,protein,RadA,MATIGRISTGSKSLDKLLGGGIETQAITEVFGEFGSGKTQLAHTLA...,,1,"[D48, PO4]",Pyrococcus furiosus,1.30,2020-01-27,diffraction,True
408862,5quf_A,5quf,A,229,protein,RadA,MATIGRISTGSKSLDKLLGGGIETQAITEVFGEFGSGKTQLAHTLA...,,1,"[D48, PO4]",Pyrococcus furiosus,1.35,2020-01-27,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656479,7l8t_B,7l8t,B,153,protein,BG505 SOSIP.v5.2 N241/N289 - gp41,AVGIGAVFLGFLGAAGSTMGAASMTLTVQARNLLSGIVQQQSNLLR...,,8,"[BMA, MAN, NAG]",Human immunodeficiency virus 1; Macaca mulatta,3.70,2020-12-31,EM,True
656480,7l8t_D,7l8t,D,153,protein,BG505 SOSIP.v5.2 N241/N289 - gp41,AVGIGAVFLGFLGAAGSTMGAASMTLTVQARNLLSGIVQQQSNLLR...,,8,"[BMA, MAN, NAG]",Human immunodeficiency virus 1; Macaca mulatta,3.70,2020-12-31,EM,True
656481,7l8t_F,7l8t,F,153,protein,BG505 SOSIP.v5.2 N241/N289 - gp41,AVGIGAVFLGFLGAAGSTMGAASMTLTVQARNLLSGIVQQQSNLLR...,,8,"[BMA, MAN, NAG]",Human immunodeficiency virus 1; Macaca mulatta,3.70,2020-12-31,EM,True
656482,7l8t_H,7l8t,H,116,protein,Rh.33311 pAbC-1 - Heavy Chain,XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...,,8,"[BMA, MAN, NAG]",Human immunodeficiency virus 1; Macaca mulatta,3.70,2020-12-31,EM,True


In [10]:
pdb_manager.df_splits["test"]

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
340899,5bkf_A,5bkf,A,364,protein,Glycine receptor subunit alpha-2,KDHDSRSGKQPSQTLSPSDFLDKLMGRTSGYDARIRPNFKGPPVNV...,,5,[NAG],Aequorea victoria; Homo sapiens,3.60,2021-03-19,EM,True
340900,5bkf_B,5bkf,B,364,protein,Glycine receptor subunit alpha-2,KDHDSRSGKQPSQTLSPSDFLDKLMGRTSGYDARIRPNFKGPPVNV...,,5,[NAG],Aequorea victoria; Homo sapiens,3.60,2021-03-19,EM,True
340901,5bkf_C,5bkf,C,364,protein,Glycine receptor subunit alpha-2,KDHDSRSGKQPSQTLSPSDFLDKLMGRTSGYDARIRPNFKGPPVNV...,,5,[NAG],Aequorea victoria; Homo sapiens,3.60,2021-03-19,EM,True
340902,5bkf_D,5bkf,D,364,protein,Glycine receptor subunit alpha-2,KDHDSRSGKQPSQTLSPSDFLDKLMGRTSGYDARIRPNFKGPPVNV...,,5,[NAG],Aequorea victoria; Homo sapiens,3.60,2021-03-19,EM,True
340903,5bkf_E,5bkf,E,702,protein,"Glycine receptor subunit beta,Green fluorescen...",GVAMPGAEDDVVAALEVLFQGPKSSKKGKGKKKQYLCPSQQSAEDL...,,5,[NAG],Aequorea victoria; Homo sapiens,3.60,2021-03-19,EM,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779966,8ika_Au,8ika,Au,265,protein,Type 1 encapsulin shell protein,MNNLYRDLAPVTEAAWAEIELEAARTFKRHIAGRRVVDVSDPGGPV...,,60,[],Mycobacterium tuberculosis (strain ATCC 25618 ...,2.75,2023-02-28,EM,False
779967,8ika_Av,8ika,Av,265,protein,Type 1 encapsulin shell protein,MNNLYRDLAPVTEAAWAEIELEAARTFKRHIAGRRVVDVSDPGGPV...,,60,[],Mycobacterium tuberculosis (strain ATCC 25618 ...,2.75,2023-02-28,EM,False
779968,8ika_Aw,8ika,Aw,265,protein,Type 1 encapsulin shell protein,MNNLYRDLAPVTEAAWAEIELEAARTFKRHIAGRRVVDVSDPGGPV...,,60,[],Mycobacterium tuberculosis (strain ATCC 25618 ...,2.75,2023-02-28,EM,False
779969,8ika_Ax,8ika,Ax,265,protein,Type 1 encapsulin shell protein,MNNLYRDLAPVTEAAWAEIELEAARTFKRHIAGRRVVDVSDPGGPV...,,60,[],Mycobacterium tuberculosis (strain ATCC 25618 ...,2.75,2023-02-28,EM,False


We can also perform additional structural filters:

In [11]:
pdb_manager.molecule_type(type="protein", update=True)
pdb_manager.experiment_type(type="diffraction", update=True)
pdb_manager.resolution_better_than_or_equal_to(2.0, update=True)
pdb_manager.length_longer_than(40, update=True)
pdb_manager.length_shorter_than(401, update=True)
pdb_manager.has_ligands(["SO4"], update=True) # Must contain an SO4 ligand
pdb_manager.oligomeric(1, update=True) # Get only monomers
print("Number of chains: ", len(pdb_manager.df))

Number of chains:  5547


We also want to restrict ourselves to proteins that contain standard residues & are available to download in PDB format

In [12]:
pdb_manager.remove_non_standard_alphabet_sequences(update=True)
pdb_manager.remove_unavailable_pdbs(update=True)
print("Number of chains: ", len(pdb_manager.df))

Number of chains:  5461


## Sequence Similarity Splits

We can cluster sequences by sequence similarity using [MMSeqs2](https://github.com/soedinglab/MMseqs2):

(install:  `conda install -c conda-forge -c bioconda mmseqs2`)

In [13]:
pdb_manager.cluster(update=True, overwrite=True)
print("Number of chains: ", len(pdb_manager.df))

easy-cluster pdb.fasta pdb_cluster tmp --min-seq-id 0.3 -c 0.8 --cov-mode 1 

MMseqs Version:                     	14.7e284
Substitution matrix                 	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix            	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                         	4
k-mer length                        	0
k-score                             	seq:2147483647,prof:2147483647
Alphabet size                       	aa:21,nucl:5
Max sequence length                 	65535
Max results per query               	20
Split database                      	0
Split mode                          	2
Split memory limit                  	0
Coverage threshold                  	0.8
Coverage mode                       	1
Compositional bias                  	1
Compositional bias                  	1
Diagonal scoring                    	true
Exact k-mer matching                	0
Mask residues                       	1
Mask residues probability           	0.9
Mask lower case r

Number of chains:  1346


In [16]:
splits = pdb_manager.split_clusters(pdb_manager.df)

In [18]:
splits["train"]

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
279258,4pqa_A,4pqa,A,381,protein,Succinyl-diaminopimelate desuccinylase,MTETQSLELAKELISRPSVTPDDRDCQKLLAERLHKIGFAAEELHF...,,1,"[SO4, X8Z, ZN]",Neisseria meningitidis,1.780,2014-03-01,diffraction,True
486946,6hpf_A,6hpf,A,312,protein,endo-b-mannanase,APSTTPVNEKATDAAKNLLSYLVEQAANGVTLSGQQDLESAQWVSD...,,1,"[ACY, BMA, CL, GLA, MAN, NAG, SO4]",Yunnania penicillata,1.360,2018-09-20,diffraction,True
93196,2hm7_A,2hm7,A,310,protein,Carboxylesterase,MPLDPVIQQVLDQLNRMPAPDYKHLSAQQFRSQQSLFPPVKKEPVA...,,1,[SO4],Alicyclobacillus acidocaldarius,2.000,2006-07-11,diffraction,True
209977,3u62_A,3u62,A,253,protein,Shikimate dehydrogenase,MKFCIIGYPVRHSISPRLYNEYFKRAGMNHSYGMEEIPPESFDTEI...,,1,[SO4],Thermotoga maritima,1.450,2011-10-12,diffraction,True
75453,2b1y_A,2b1y,A,104,protein,hypothetical protein Atu1913,MARPNFRYTHYDLKELRAGTTLEISLSSVNNVRLMTGANFQRFTEL...,,1,[SO4],Agrobacterium tumefaciens str.,1.800,2005-09-16,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
510388,6mgc_A,6mgc,A,361,protein,Capsule polysaccharide export protein KpsC,MGIGIYSPGIWRIPHLEKFLAQPCQKLSLLRPVPQEVNAIAVWGHR...,,1,"[C5P, CL, SO4]",Escherichia coli O1:K1 / APEC,1.350,2018-09-13,diffraction,True
772192,8dv0_A,8dv0,A,199,protein,Dephospho-CoA kinase,MAHHHHHHMLAIGITGSYASGKTFILDYLAEKGYKTFCADRCIKEL...,,1,"[EDO, SO4]",Rickettsia felis URRWXCal2,1.400,2022-07-27,diffraction,True
471269,6evl_A,6evl,A,102,protein,Prolyl 4-hydroxylase subunit alpha-2,MHHHHHHMLSVDDCFGMGRSAYNEGDYYHTVLWMEQVLKQLDAGEE...,,1,"[DMS, SO4]",Homo sapiens,1.870,2017-11-02,diffraction,True
151770,3exq_A,3exq,A,161,protein,NUDIX family hydrolase,MSLTRTQPVELVTMVMVTDPETQRVLVEDKVNVPWKAGHSFPGGHV...,,1,[SO4],Lactobacillus brevis ATCC 367,2.000,2008-10-16,diffraction,True


In [19]:
splits["val"]

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
131877,2z0z_A,2z0z,A,194,protein,Putative uncharacterized protein TTHA1799,MWAFPERFEGRHVRLEPLALAHLPAFLRHYDPEVYRFLSRAPVAPT...,,1,[SO4],Thermus thermophilus,2.000,2007-05-07,diffraction,True
43204,1py0_A,1py0,A,125,protein,Pseudoazurin,ASENIEVHMLNKGAEGAMVFEPAYIKANPGDTVTFIPVDKGHNVES...,,1,"[SO4, Y1, YMA, ZN]",Alcaligenes faecalis,2.000,2003-07-07,diffraction,True
4961,1bx0_A,1bx0,A,314,protein,PROTEIN (FERREDOXIN:NADP+ OXIDOREDUCTASE),QIASDVEAPPPAPAKVEKHSKKMEEGITVNKFKPKTPYVGRCLLNT...,,1,"[FAD, PO4, SO4]",Spinacia oleracea,1.900,1998-10-10,diffraction,True
436695,5xj5_A,5xj5,A,201,protein,Glycerol-3-phosphate acyltransferase,MGSALFLVIFAYLLGSITFGEVIAKLKGVDLRNVGSGNVGATNVTR...,,1,"[78M, FME, PGE, SO4]",Aquifex aeolicus,1.481,2017-04-30,diffraction,True
159885,3hbm_A,3hbm,A,282,protein,UDP-sugar hydrolase,MKVLFRSDSSSQIGFGHIKRDLVLAKQYSDVSFACLPLEGSLIDEI...,,1,[SO4],Campylobacter jejuni subsp. jejuni,1.800,2009-05-04,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
422066,5uqj_A,5uqj,A,221,protein,U6 snRNA phosphodiesterase,GMSRFWRSFTYFEWRPTPAIHRQLQKIICKYKETFMKQEYTNPYQL...,,1,"[ACT, GOL, SO4]",Saccharomyces cerevisiae,1.800,2017-02-08,diffraction,True
158414,3gy9_A,3gy9,A,150,protein,GCN5-related N-acetyltransferase,GMDVTIERVNDFDGYNWLPLLAKSSQEGFQLVERMLRNRREESFQE...,,1,"[COA, GOL, SO4]",Exiguobacterium sibiricum 255-15,1.520,2009-04-03,diffraction,True
540484,6qu6_A,6qu6,A,212,protein,Fiber,MRGSHHHHHHGSGDLVAWNKKDDRRTLWTTPDTSPNCKMSTEKDSK...,,1,"[EDO, PEG, SIA, SLB, SO4]",Human adenovirus 26,1.030,2019-02-26,diffraction,True
231831,4cg3_A,4cg3,A,313,protein,CUTINASE,MKYLLPTAAAGLLLLAAQPAMAMDIGINSDPANPYERGPNPTDALL...,,1,[SO4],THERMOBIFIDA FUSCA,1.550,2013-11-20,diffraction,True


In [20]:
splits["test"]

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,n_chains,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
137068,3al2_A,3al2,A,235,protein,DNA topoisomerase 2-binding protein 1,GPLGSLKKQYIFQLSSLNPQERIDYCHLIEKLGGLVIEKQCFDPTC...,,1,[SO4],Homo sapiens,2.000,2010-07-22,diffraction,True
25399,1jov_A,1jov,A,270,protein,HI1317,MKTTLLKTLTPELHLVQHNDIPVLHLKHAVGTAKISLQGAQLISWK...,,1,"[SO4, TRS]",Haemophilus influenzae,1.570,2001-07-31,diffraction,True
381776,5kta_A,5kta,A,188,protein,FdhC,MVNFNLKANTTYLRLVEENDAEFICTLRNNDKLNTYISKSTGDIKS...,,1,"[CSX, DMS, SO4]",Acinetobacter nosocomialis,1.890,2016-07-11,diffraction,True
165255,3iwl_A,3iwl,A,68,protein,Copper transport protein ATOX1,MPKHEFSVDMTCGGCAEAVSRVLNKLGGVKYDIDLPNKKVCIESEH...,,1,"[PT, SO4, TCE]",Homo sapiens,1.600,2009-09-02,diffraction,True
228154,4bi7_A,4bi7,A,257,protein,TRIOSEPHOSPHATE ISOMERASE,MPARRPFIGGNFKCNGSLDFIKSHVAAIAAHKIPDSVDVVIAPSAV...,,1,"[PGA, SO4]",GIARDIA INTESTINALIS,1.600,2013-04-09,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
454968,6btd_A,6btd,A,222,protein,Fuculose phosphate aldolase,MLLQKEREEIVAYGKKMISSGLTKGTGGNISIFNREQGLVAISPSG...,,1,"[MN, SO4]",Bacillus thuringiensis,1.551,2017-12-06,diffraction,True
479055,6gg1_A,6gg1,A,154,protein,Interleukin-24,AFHFGPCRVEGVVPQELWEAFWAVRDTLQAQDNITDVRLLRAEVLQ...,,1,"[NI, SO4]",Homo sapiens,1.300,2018-05-02,diffraction,True
696311,7q68_A,7q68,A,172,protein,Thioredoxin domain-containing protein,MAPLQPGDSFPANVVFSYIPPTGSLDLTVCGRPIEYNASEALAKGT...,,1,"[GOL, SO4]",Chaetomium thermophilum var. thermophilum DSM ...,1.750,2021-11-05,diffraction,True
269201,4n02_A,4n02,A,357,protein,Isopentenyl-diphosphate delta-isomerase,MRGSHHHHHHGSGSGSGIEGRITTNRKDEHILYALEQKSSYNSFDE...,,1,"[FNR, SO4]",Streptococcus pneumoniae,1.400,2013-09-30,diffraction,True


Finally, we can export and download our splits

In [21]:
pdb_manager.download_pdbs("./pdb", splits=splits)
pdb_manager.export_pdbs(
    pdb_dir="./pdb", splits=splits, max_num_chains_per_pdb_code=3
)

  0%|          | 0/994 [00:00<?, ?it/s]

# I/O

We can write our selections as FASTA files or download and write the relevant PDBs in our selection to disk:

## CSV

In [15]:
import os
import pandas as pd

os.makedirs("tmp/", exist_ok=True)
# Write selection to disk
manager.to_csv("tmp/test.csv")

# Read selection from disk
sel = pd.read_csv("tmp/test.csv")
sel

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
1,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
2,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
3,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,['PTR'],Rous sarcoma virus,1.50,1992-08-18,diffraction,True
4,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,"['MOH', 'SO4']",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,"['ACE', 'BTB', 'GOL', 'NH2', 'SO4', 'WZG']",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction,True
76,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,"['CMC', 'GOL', 'P6G']",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction,True
77,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,"['CA', 'MG']",Enterococcus faecalis,1.40,2021-09-30,diffraction,True
78,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction,True


## FASTA

In [17]:
from graphein.protein.utils import read_fasta
# Write selection to a fasta file
manager.to_fasta("tmp/test.fasta")

# Load selection from a fasta file
fs = read_fasta("tmp/test.fasta")
print(fs)

{'1hhz_D': 'AEKAA', '1hhz_E': 'AEKAA', '1hhz_F': 'AEKAA', '1sha_B': 'YVPML', '1skg_B': 'VAFRS', '1tjk_I': 'FLSTK', '2d5w_C': 'ASKPK', '2d5w_D': 'ASKTK', '3drj_B': 'AHAKA', '3hds_E': 'ASWSA', '3hds_F': 'ASWSA', '4j78_B': 'KTKLL', '4j82_C': 'KSHQE', '4j82_D': 'KSHQE', '4j84_C': 'ARKLD', '4j84_D': 'ARKLD', '4l9p_C': 'KCVVM', '4olr_A': 'YVVFV', '4olr_B': 'YVVFV', '4qxx_Z': 'GNLVS', '4v3i_B': 'DLTRP', '4x3o_C': 'PKKTG', '4zhb_B': 'VDAVN', '5ctv_C': 'AEKAA', '5ctv_E': 'AEKAA', '5n99_C': 'NQPWQ', '5n99_E': 'NQPWQ', '5n99_F': 'NQPWQ', '5n99_H': 'NQPWQ', '5n99_J': 'NQPWQ', '5n99_L': 'NQPWQ', '5n99_N': 'NQPWQ', '5n99_P': 'NQPWQ', '5n99_R': 'NQPWQ', '5n99_T': 'NQPWQ', '5n99_V': 'NQPWQ', '5n99_X': 'NQPWQ', '5nf0_F': 'GGGGG', '5njf_E': 'AAAAA', '5njf_F': 'AAAAA', '5onp_B': 'GAIIG', '5onq_B': 'GAIIG', '5r42_E': 'TPGVY', '5r43_E': 'TPGVY', '5r44_E': 'TPGVY', '5r45_E': 'TPGVY', '5r46_E': 'TPGVY', '5r47_E': 'TPGVY', '5r48_E': 'TPGVY', '5r49_E': 'TPGVY', '5r4a_E': 'TPGVY', '5r4b_E': 'TPGVY', '5r4c_E': '

## Downloading PDBs

In [18]:
manager.download_pdbs("tmp/pdbs/")

  0%|          | 0/55 [00:00<?, ?it/s]

In [19]:
os.listdir("tmp/pdbs")

['5r46.pdb',
 '4j84.pdb',
 '6y3m.pdb',
 '5ctv.pdb',
 '4qxx.pdb',
 '5njf.pdb',
 '7etu.pdb',
 '6f4s.pdb',
 '6z00.pdb',
 '7oju.pdb',
 '1hhz.pdb',
 '4j82.pdb',
 '5r48.pdb',
 '1skg.pdb',
 '5r45.pdb',
 '4zhb.pdb',
 '5r42.pdb',
 '4v3i.pdb',
 '5onp.pdb',
 '5r47.pdb',
 '1sha.pdb',
 '4x3o.pdb',
 '5r4c.pdb',
 '5onq.pdb',
 '5n99.pdb',
 '3drj.pdb',
 '7kd7.pdb',
 '5r4a.pdb',
 '7z5z.pdb',
 '7kpu.pdb',
 '5r4b.pdb',
 '5r4d.pdb',
 '3hds.pdb',
 '5r44.pdb',
 '6diy.pdb',
 '6rd2.pdb',
 '6fbb.pdb',
 '5r43.pdb',
 '6f4r.pdb',
 '2d5w.pdb',
 '7ett.pdb',
 '5r49.pdb',
 '7etv.pdb',
 '6eaw.pdb',
 '7pul.pdb',
 '6slg.pdb',
 '6ax4.pdb',
 '6f4t.pdb',
 '1tjk.pdb',
 '4l9p.pdb',
 '5nf0.pdb',
 '4olr.pdb',
 '4j78.pdb',
 '7x70.pdb']

## Writing Individual Chains

We can also extract the individual chains from the PDB files in our selection

In [21]:
manager.remove_unavailable_pdbs(update=True)
manager.write_chains()

  0%|          | 0/54 [00:00<?, ?it/s]

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
100%|██████████| 54/54 [00:05<00:00, 10.80it/s]


[PosixPath('pdb/1hhz_D.pdb'),
 PosixPath('pdb/1hhz_E.pdb'),
 PosixPath('pdb/1hhz_F.pdb'),
 PosixPath('pdb/1sha_B.pdb'),
 PosixPath('pdb/1skg_B.pdb'),
 PosixPath('pdb/1tjk_I.pdb'),
 PosixPath('pdb/2d5w_C.pdb'),
 PosixPath('pdb/2d5w_D.pdb'),
 PosixPath('pdb/3drj_B.pdb'),
 PosixPath('pdb/3hds_E.pdb'),
 PosixPath('pdb/3hds_F.pdb'),
 PosixPath('pdb/4j78_B.pdb'),
 PosixPath('pdb/4j82_C.pdb'),
 PosixPath('pdb/4j82_D.pdb'),
 PosixPath('pdb/4j84_C.pdb'),
 PosixPath('pdb/4j84_D.pdb'),
 PosixPath('pdb/4l9p_C.pdb'),
 PosixPath('pdb/4olr_A.pdb'),
 PosixPath('pdb/4olr_B.pdb'),
 PosixPath('pdb/4qxx_Z.pdb'),
 PosixPath('pdb/4v3i_B.pdb'),
 PosixPath('pdb/4x3o_C.pdb'),
 PosixPath('pdb/4zhb_B.pdb'),
 PosixPath('pdb/5ctv_C.pdb'),
 PosixPath('pdb/5ctv_E.pdb'),
 PosixPath('pdb/5n99_C.pdb'),
 PosixPath('pdb/5n99_E.pdb'),
 PosixPath('pdb/5n99_F.pdb'),
 PosixPath('pdb/5n99_H.pdb'),
 PosixPath('pdb/5n99_J.pdb'),
 PosixPath('pdb/5n99_L.pdb'),
 PosixPath('pdb/5n99_N.pdb'),
 PosixPath('pdb/5n99_P.pdb'),
 PosixPath