# Creating Datasets from the PDB

Graphein provides a utility for curating and splitting datasets from the [RCSB PDB](https://www.rcsb.org/).


Initialising a PDBManager will download PDB Metadata which we can use to make complex selections of protein structures.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb)

In [1]:
from rich import inspect
from graphein.ml.datasets import PDBManager

manager = PDBManager(root_dir=".")
inspect(manager, methods=True)

{'1bos': False, '1vvj': False, '1vy4': False, '1vy5': False, '1vy6': False, '1vy7': False, '2btj': False, '2vvj': False, '3j3q': False, '3j3y': False, '3j6b': False, '3j6x': False, '3j6y': False, '3j77': False, '3j78': False, '3j79': False, '3j7o': False, '3j7p': False, '3j7q': False, '3j7r': False, '3j8h': False, '3j92': False, '3j9g': False, '3j9k': False, '3j9l': False, '3j9m': False, '3j9r': False, '3j9w': False, '3j9y': False, '3j9z': False, '3ja1': False, '3jag': False, '3jah': False, '3jai': False, '3jaj': False, '3jan': False, '3jbn': False, '3jbo': False, '3jbp': False, '3jbu': False, '3jbv': False, '3jc1': False, '3jc8': False, '3jc9': False, '3jcd': False, '3jce': False, '3jcj': False, '3jcn': False, '3jco': False, '3jcp': False, '3jcs': False, '3jct': False, '3k1q': False, '3whe': False, '4abz': False, '4bp7': False, '4bts': False, '4ctf': False, '4ctg': False, '4d5y': False, '4d67': False, '4dcb': False, '4fqr': False, '4frt': False, '4l47': False, '4l71': False, '4lel': F

The manager wraps two dataframes:

* `manager.df` - your working selection.
* `manager.source` - A clean copy of the original metadata.

In [2]:
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction,True
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction,True
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
780157,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction,True
780158,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780159,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780160,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True


In [3]:
manager.source

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction,True
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction,True
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction,True
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
780157,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction,True
780158,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780159,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True
780160,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction,True


## Selection Properties

In [4]:
print("Num chains: ", manager.get_num_chains())
print("Num unique pdbs: ", manager.get_num_unique_pdbs())
print("Longest chain: ", manager.get_longest_chain())
print("Shortest chain: ", manager.get_shortest_chain())
print("Best Resolution: ", manager.get_best_resolution())
print("Worst Resolution: ", manager.get_worst_resolution())
print("Experiment Types: ", manager.get_experiment_types())
print("Molecule Types: ", manager.get_molecule_types())

Num chains:  780162
Num unique pdbs:  202738
Longest chain:  16181
Shortest chain:  1
Best Resolution:  -1.0
Worst Resolution:  70.0
Experiment Types:  ['diffraction' 'NMR' 'other' 'EM']
Molecule Types:  ['na' 'protein']


# Making Selections

Selection functions return a pd.DataFrame. All selection functions provide an `update: bool` argument controlling whether or not `manager.df` is updated in place:

In [5]:
print("Number of chains: ", len(manager.df))

print("Hi res structure: ", len(manager.resolution_better_than_or_equal_to(2.0)))
print(len(manager.df))

# Update inplace
print("Modified inplace:")
manager.resolution_better_than_or_equal_to(2.0, update=True)
print(len(manager.df))

Number of chains:  780162
Hi res structure:  202476
780162
Modified inplace:
202476


If you want to reset the selection:

In [6]:
manager.reset()
print(len(manager.df))

780162


Here is an example selection:

In [7]:
manager.length_shorter_than(6, update=True)
manager.length_longer_than(4, update=True)
manager.molecule_type("protein", update=True)
manager.resolution_better_than_or_equal_to(1.5, update=True)
manager.experiment_type("diffraction", update=True)
manager.remove_non_standard_alphabet_sequences(update=True)
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
19383,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
19384,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
19385,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction,True
50598,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,[PTR],Rous sarcoma virus,1.50,1992-08-18,diffraction,True
50810,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,"[MOH, SO4]",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653289,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,"[ACE, BTB, GOL, NH2, SO4, WZG]",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction,True
682420,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,"[CMC, GOL, P6G]",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction,True
694077,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,"[CA, MG]",Enterococcus faecalis,1.40,2021-09-30,diffraction,True
745627,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction,True


## Creating train/val/test splits

### Temporal splits

We can create train, val and test splits based on deposition time:

In [8]:
import numpy as np

splits = ["train", "val", "test"]

pdb_manager = PDBManager(
    root_dir=".",
    splits=splits,
    split_ratios=[0.8, 0.1, 0.1],
    split_time_frames=[
        np.datetime64("2020-01-01"),
        np.datetime64("2021-01-01"),
        np.datetime64("2023-03-01"),
    ],
)
pdb_manager.split_time_frames

{'1bos': False, '1vvj': False, '1vy4': False, '1vy5': False, '1vy6': False, '1vy7': False, '2btj': False, '2vvj': False, '3j3q': False, '3j3y': False, '3j6b': False, '3j6x': False, '3j6y': False, '3j77': False, '3j78': False, '3j79': False, '3j7o': False, '3j7p': False, '3j7q': False, '3j7r': False, '3j8h': False, '3j92': False, '3j9g': False, '3j9k': False, '3j9l': False, '3j9m': False, '3j9r': False, '3j9w': False, '3j9y': False, '3j9z': False, '3ja1': False, '3jag': False, '3jah': False, '3jai': False, '3jaj': False, '3jan': False, '3jbn': False, '3jbo': False, '3jbp': False, '3jbu': False, '3jbv': False, '3jc1': False, '3jc8': False, '3jc9': False, '3jcd': False, '3jce': False, '3jcj': False, '3jcn': False, '3jco': False, '3jcp': False, '3jcs': False, '3jct': False, '3k1q': False, '3whe': False, '4abz': False, '4bp7': False, '4bts': False, '4ctf': False, '4ctg': False, '4d5y': False, '4d67': False, '4dcb': False, '4fqr': False, '4frt': False, '4l47': False, '4l71': False, '4lel': F

[numpy.datetime64('2020-01-01'),
 numpy.datetime64('2021-01-01'),
 numpy.datetime64('2023-03-01')]

We can then perform additional structural filters:

In [9]:
pdb_manager.molecule_type(type="protein", update=True)
pdb_manager.experiment_type(type="diffraction", update=True)
pdb_manager.resolution_better_than_or_equal_to(2.0, update=True)
pdb_manager.length_longer_than(40, update=True)
pdb_manager.length_shorter_than(401, update=True)
pdb_manager.has_ligands(["SO4"], update=True) # Must contain an SO4 ligand
pdb_manager.oligomeric(1, update=True) # Get only monomers
print("Number of chains: ", len(pdb_manager.df))

Number of chains:  5959


We also want to restrict ourselves to proteins that contain standard residues & are available to download in PDB format

In [10]:
pdb_manager.remove_non_standard_alphabet_sequences(update=True)
pdb_manager.remove_unavailable_pdbs(update=True)
print("Number of chains: ", len(pdb_manager.df))

Number of chains:  5873


We can then cluster by sequence similarity using [MMSeqs2](https://github.com/soedinglab/MMseqs2):

(install:  `conda install -c conda-forge -c bioconda mmseqs2`)

In [11]:
pdb_manager.cluster(update=True, overwrite=True)
print("Number of chains: ", len(pdb_manager.df))

easy-cluster pdb.fasta pdb_cluster tmp --min-seq-id 0.3 -c 0.8 --cov-mode 1 

MMseqs Version:                     	14.7e284
Substitution matrix                 	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix            	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                         	4
k-mer length                        	0
k-score                             	seq:2147483647,prof:2147483647
Alphabet size                       	aa:21,nucl:5
Max sequence length                 	65535
Max results per query               	20
Split database                      	0
Split mode                          	2
Split memory limit                  	0
Coverage threshold                  	0.8
Coverage mode                       	1
Compositional bias                  	1
Compositional bias                  	1
Diagonal scoring                    	true
Exact k-mer matching                	0
Mask residues                       	1
Mask residues probability           	0.9
Mask lower case r

Number of chains:  1433


In [25]:
#manager.

AttributeError: 'PDBManager' object has no attribute 'splits'

In [21]:
#split_data = pdb_manager.filter_by_deposition_date(max_deposition_date=np.datetime64('2023-03-01'), update=True)
#split_data["train"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_split[list_column] = df_split[list_column].apply(tuple)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_split[list_column] = df_split[list_column].apply(list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_split[list_column] = df_split[list_column].apply(tuple)
A value is trying to be set o

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available,split
0,1wd5_A,1wd5,A,208,protein,hypothetical protein TT1426,MRFRDRRHAGALLAEALAPLGLEAPVVLGLPRGGVVVADEVARRLG...,"(MES,)",Thermus thermophilus,2.000,2004-05-11,diffraction,True,train
1,2vdf_A,2vdf,A,253,protein,OUTER MEMBRANE PROTEIN,AQELQTANEFTVHTDLSSISSTRAFLKEKHKAAKHIGVRADIPFDA...,"(OCT, SO4)",NEISSERIA MENINGITIDIS,1.950,2007-10-05,diffraction,True,train
2,4gak_A,4gak,A,250,protein,Acyl-ACP thioesterase,SNAMAFIQTDTFTLRGYECDAFGRMSIPALMNLMQESANRNAIDYG...,"(CL, GOL)",Spirosoma linguale,1.900,2012-07-25,diffraction,True,train
3,4doi_A,4doi,A,246,protein,Chalcone--flavonone isomerase 1,MSSSNACASPSPFPAVTKLHVDSVTFVPSVKSPASSNPLFLGGAGV...,"(NO3,)",Arabidopsis thaliana,1.550,2012-02-09,diffraction,True,train
4,4mi4_A,4mi4,A,197,protein,Spermidine n1-acetyltransferase,MHHHHHHSSGVDLGTENLYFQSNAMNSQLTLRALERGDLRFIHNLN...,"(SPM,)",Vibrio cholerae O1 biovar El tor,1.848,2013-08-30,diffraction,True,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7974,5wcx_A,5wcx,A,383,protein,Phosphoglycerol transferase GacH,GAMEKPTNYSQETIASIAQKYQKLAEDINKDRKNNIADQTVIYLLS...,"(1GP, CA, MN)",Streptococcus pyogenes serotype M1,2.000,2017-07-02,diffraction,True,train
7975,3k3v_A,3k3v,A,100,protein,Protein SMY2,GSNGMSQLPAPVSVESSWRYIDTQGQIHGPFTTQMMSQWYIGGYFA...,(),Saccharomyces cerevisiae,1.800,2009-10-05,diffraction,True,train
7976,5wfl_A,5wfl,A,336,protein,Kelch-like ECH-associated protein 1,MGSSHHHHHHSSGGENLYFQGHMKPTQVMPSRAPKVGRLIYTAGGY...,"(SO4,)",Homo sapiens,1.930,2017-07-12,diffraction,True,train
7977,4rw0_A,4rw0,A,185,protein,Lipolytic protein G-D-S-L family,SNAMEIICFGDSITRGYDVPYGRGWVEICDASIENVNFTNYGEDGC...,"(GOL, NA)",Veillonella parvula DSM 2008,2.000,2014-11-30,diffraction,True,train


In [14]:
#split_data["val"].head()

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available,split
0,6w34_A,6w34,A,315,protein,Beta-lactamase,SNAMEGMMILKNKRMLKIGICVGILGLSITSLEAFTGGALQVEAKE...,"(CL, EDO, SO4)",Bacillus cereus (strain ATCC 14579 / DSM 31 / ...,1.45,2020-03-08,diffraction,True,val
1,6xfu_A,6xfu,A,70,protein,ABC transporter ATP-binding protein,GSHMSDTTIVTVDHKDFDRTEKYLAEHFQLQNVDKADGHLMINAQK...,"(SCN,)",Staphylococcus aureus,1.4,2020-06-16,diffraction,True,val
2,7aqx_A,7aqx,A,364,protein,Variant surface glycoprotein MITAT 1.2,AAEKGFKQAFWQPLCQVSEELDDQPKGALFTLQAAASKIQKMRDAA...,"(BMA, CIT, MAN, NA, NAG)",Lama glama; Trypanosoma brucei brucei,1.499,2020-10-23,diffraction,True,val
3,6was_G,6was,G,151,protein,1FD6 16055 V1V2 scaffold,MTTFKLAACVTLECRQVNTTNATSSVNVTNGEEIKNCSFNATTEIR...,"(NAG, PCA)",Homo sapiens; SYNTHETIC CONSTRUCT,1.904,2020-03-26,diffraction,True,val
4,6ze0_A,6ze0,A,270,protein,alcohol dehydrogenase,MTNLPTALITGASSGIGATYAERFARRGHNLVMVARDKVRMDVLAS...,(),Comamonas sp. 26,1.99,2020-06-15,diffraction,True,val
5,6wfv_A,6wfv,A,247,protein,Multifunctional procollagen lysine hydroxylase...,MGSSHHHHHHASDPVNPEKLLVITVATAETEGYLRFLRSAEFFNYT...,"(MG, MN, TRS, UDP)",Homo sapiens,1.7,2020-04-04,diffraction,True,val
6,6z3j_C,6z3j,C,96,protein,RGM domain family member B,ETGQCRIQKCTTDFVSLTSHLNSAVDGFDSEFCKALRAYAGCTQRT...,"(CL, EDO, GOL, NAG, SO4)",Homo sapiens,1.65,2020-05-20,diffraction,True,val
7,6yud_A,6yud,A,111,protein,Csx3/Crn3,GANAMASMKFAVIDRKNFTLIHFEIEKPIKPEILKEIEIPSVDTRK...,(),Archaeoglobus fulgidus; SYNTHETIC CONSTRUCT,1.84,2020-04-27,diffraction,True,val
8,7d2w_A,7d2w,A,190,protein,PRESAN domain-containing protein,MGRNEIHKNSLGFGKNDNIYSRNLAQLKLSNYSLLKDSKFESVTEQ...,"(HG,)",Plasmodium falciparum (isolate 3D7),1.97,2020-09-17,diffraction,True,val
9,6ype_A,6ype,A,205,protein,Neuronal pentraxin-1,GDKFQLTFPLRTNYMYAKVKKSLPEMYAFTVCMWLKSSATPGVGTP...,"(CA, CAC)",Homo sapiens,1.45,2020-04-15,diffraction,True,val


In [15]:
#split_data["test"].head()

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available,split
0,7u0u_A,7u0u,A,268,protein,Serine/threonine-protein phosphatase 2B cataly...,MSGSHHHHHHHHGGENLYFQGSTPHPYWLPNFMDVFTWSLPFVGEK...,"(CA, FK5, PG4, PO4)",Aspergillus fumigatus,1.90000,2022-02-18,diffraction,True,test
1,8ehe_B,8ehe,B,152,protein,Potempin C (PotC),MKQKIILWISTLLLLTAGAGCKKETLPPNQAKGKVLGPTGPCQGYA...,"(CA, EDO, GOL, NA, PCA)",Tannerella forsythia,1.10000,2022-09-14,diffraction,True,test
2,7ml9_A,7ml9,A,325,protein,Insecticidal protein,MKKFASLILTSVFLFSSTQFVHASSTDVQERLRDLAREDEAGTFNE...,"(ACT, EDO, NA, SM)",Brevibacillus laterosporus,1.94000,2021-04-27,diffraction,True,test
3,7prs_DDD,7prs,DDD,104,protein,"Heat-labile enterotoxin IIA, B chain",GVSKTFKDKCASTTAKLVQSVQLVNISSDVNKDSKGIYISSSAGKT...,"(GAL, GLC, NAG, SIA, SO4)",Escherichia coli,2.00000,2021-09-22,diffraction,False,test
4,7rlm_B,7rlm,B,142,protein,Coat protein,PSSETFVFTKDNLVGNTQGSFTFGPSLSDCPAFKDGILKAYHEYKI...,"(BR,)",Potato leafroll virus,1.80000,2021-07-25,diffraction,True,test
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,7fct_A,7fct,A,246,protein,Metal beta-lactamase,MFKFIGCGSAFNTRLGNNSAYIKEDGILFMIDCGSANFDRIMRSDL...,"(GOL, MPD, ZN)",Bacillus phage vB_BtM_BMBsp2,1.80000,2021-07-15,diffraction,True,test
82,7vtk_A,7vtk,A,373,protein,"bata-1,3-glucanase",ATCELALENKSLPGTVHAYVTGHEQGTDRWVLLRPDGSVYRPDSPG...,(),Streptomyces pratensis,1.90697,2021-10-29,diffraction,True,test
83,7w4z_B,7w4z,B,162,protein,Actin-binding protein fragmin P,GPMQKQKEYNIADSNIANLGTELEKKVKLEASQHEDAWKGAGKQVG...,"(ANP, CA, EDO, HIC, MG, PO4)",Gallus gallus; Physarum polycephalum,1.15000,2021-11-29,diffraction,True,test
84,7wgu_A,7wgu,A,358,protein,Iron uptake system protein EfeO,MADVPQVKVTVTDKQCEPMTITVNAGKTQFIIQNHSQKALEWEILK...,"(ACT, EDO, PGE, SO4, ZN)",Escherichia coli,1.85000,2021-12-29,diffraction,True,test


In [12]:
#print(f"num_unique_pdbs: {pdb_manager.get_num_unique_pdbs(splits=splits)}")
#print(f"best_resolution: {pdb_manager.get_best_resolution(splits=splits)}")

AssertionError: Requested splits must be non-empty.

Finally, we can export and download our splits

In [14]:
pdb_manager.download_pdbs("./pdb", splits=splits)
pdb_manager.export_pdbs(
    pdb_dir="./pdb", splits=splits, max_num_chains_per_pdb_code=3
)

AssertionError: Requested splits must be non-empty.

# I/O

We can write our selections as FASTA files or download and write the relevant PDBs in our selection to disk:

## CSV

In [15]:
import os
import pandas as pd

os.makedirs("tmp/", exist_ok=True)
# Write selection to disk
manager.to_csv("tmp/test.csv")

# Read selection from disk
sel = pd.read_csv("tmp/test.csv")
sel

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type,pdb_file_available
0,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
1,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
2,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction,True
3,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,['PTR'],Rous sarcoma virus,1.50,1992-08-18,diffraction,True
4,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,"['MOH', 'SO4']",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,"['ACE', 'BTB', 'GOL', 'NH2', 'SO4', 'WZG']",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction,True
76,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,"['CMC', 'GOL', 'P6G']",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction,True
77,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,"['CA', 'MG']",Enterococcus faecalis,1.40,2021-09-30,diffraction,True
78,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction,True


## FASTA

In [17]:
from graphein.protein.utils import read_fasta
# Write selection to a fasta file
manager.to_fasta("tmp/test.fasta")

# Load selection from a fasta file
fs = read_fasta("tmp/test.fasta")
print(fs)

{'1hhz_D': 'AEKAA', '1hhz_E': 'AEKAA', '1hhz_F': 'AEKAA', '1sha_B': 'YVPML', '1skg_B': 'VAFRS', '1tjk_I': 'FLSTK', '2d5w_C': 'ASKPK', '2d5w_D': 'ASKTK', '3drj_B': 'AHAKA', '3hds_E': 'ASWSA', '3hds_F': 'ASWSA', '4j78_B': 'KTKLL', '4j82_C': 'KSHQE', '4j82_D': 'KSHQE', '4j84_C': 'ARKLD', '4j84_D': 'ARKLD', '4l9p_C': 'KCVVM', '4olr_A': 'YVVFV', '4olr_B': 'YVVFV', '4qxx_Z': 'GNLVS', '4v3i_B': 'DLTRP', '4x3o_C': 'PKKTG', '4zhb_B': 'VDAVN', '5ctv_C': 'AEKAA', '5ctv_E': 'AEKAA', '5n99_C': 'NQPWQ', '5n99_E': 'NQPWQ', '5n99_F': 'NQPWQ', '5n99_H': 'NQPWQ', '5n99_J': 'NQPWQ', '5n99_L': 'NQPWQ', '5n99_N': 'NQPWQ', '5n99_P': 'NQPWQ', '5n99_R': 'NQPWQ', '5n99_T': 'NQPWQ', '5n99_V': 'NQPWQ', '5n99_X': 'NQPWQ', '5nf0_F': 'GGGGG', '5njf_E': 'AAAAA', '5njf_F': 'AAAAA', '5onp_B': 'GAIIG', '5onq_B': 'GAIIG', '5r42_E': 'TPGVY', '5r43_E': 'TPGVY', '5r44_E': 'TPGVY', '5r45_E': 'TPGVY', '5r46_E': 'TPGVY', '5r47_E': 'TPGVY', '5r48_E': 'TPGVY', '5r49_E': 'TPGVY', '5r4a_E': 'TPGVY', '5r4b_E': 'TPGVY', '5r4c_E': '

## Downloading PDBs

In [18]:
manager.download_pdbs("tmp/pdbs/")

  0%|          | 0/55 [00:00<?, ?it/s]

In [19]:
os.listdir("tmp/pdbs")

['5r46.pdb',
 '4j84.pdb',
 '6y3m.pdb',
 '5ctv.pdb',
 '4qxx.pdb',
 '5njf.pdb',
 '7etu.pdb',
 '6f4s.pdb',
 '6z00.pdb',
 '7oju.pdb',
 '1hhz.pdb',
 '4j82.pdb',
 '5r48.pdb',
 '1skg.pdb',
 '5r45.pdb',
 '4zhb.pdb',
 '5r42.pdb',
 '4v3i.pdb',
 '5onp.pdb',
 '5r47.pdb',
 '1sha.pdb',
 '4x3o.pdb',
 '5r4c.pdb',
 '5onq.pdb',
 '5n99.pdb',
 '3drj.pdb',
 '7kd7.pdb',
 '5r4a.pdb',
 '7z5z.pdb',
 '7kpu.pdb',
 '5r4b.pdb',
 '5r4d.pdb',
 '3hds.pdb',
 '5r44.pdb',
 '6diy.pdb',
 '6rd2.pdb',
 '6fbb.pdb',
 '5r43.pdb',
 '6f4r.pdb',
 '2d5w.pdb',
 '7ett.pdb',
 '5r49.pdb',
 '7etv.pdb',
 '6eaw.pdb',
 '7pul.pdb',
 '6slg.pdb',
 '6ax4.pdb',
 '6f4t.pdb',
 '1tjk.pdb',
 '4l9p.pdb',
 '5nf0.pdb',
 '4olr.pdb',
 '4j78.pdb',
 '7x70.pdb']

## Writing Individual Chains

We can also extract the individual chains from the PDB files in our selection

In [21]:
manager.remove_unavailable_pdbs(update=True)
manager.write_chains()

  0%|          | 0/54 [00:00<?, ?it/s]

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
100%|██████████| 54/54 [00:05<00:00, 10.80it/s]


[PosixPath('pdb/1hhz_D.pdb'),
 PosixPath('pdb/1hhz_E.pdb'),
 PosixPath('pdb/1hhz_F.pdb'),
 PosixPath('pdb/1sha_B.pdb'),
 PosixPath('pdb/1skg_B.pdb'),
 PosixPath('pdb/1tjk_I.pdb'),
 PosixPath('pdb/2d5w_C.pdb'),
 PosixPath('pdb/2d5w_D.pdb'),
 PosixPath('pdb/3drj_B.pdb'),
 PosixPath('pdb/3hds_E.pdb'),
 PosixPath('pdb/3hds_F.pdb'),
 PosixPath('pdb/4j78_B.pdb'),
 PosixPath('pdb/4j82_C.pdb'),
 PosixPath('pdb/4j82_D.pdb'),
 PosixPath('pdb/4j84_C.pdb'),
 PosixPath('pdb/4j84_D.pdb'),
 PosixPath('pdb/4l9p_C.pdb'),
 PosixPath('pdb/4olr_A.pdb'),
 PosixPath('pdb/4olr_B.pdb'),
 PosixPath('pdb/4qxx_Z.pdb'),
 PosixPath('pdb/4v3i_B.pdb'),
 PosixPath('pdb/4x3o_C.pdb'),
 PosixPath('pdb/4zhb_B.pdb'),
 PosixPath('pdb/5ctv_C.pdb'),
 PosixPath('pdb/5ctv_E.pdb'),
 PosixPath('pdb/5n99_C.pdb'),
 PosixPath('pdb/5n99_E.pdb'),
 PosixPath('pdb/5n99_F.pdb'),
 PosixPath('pdb/5n99_H.pdb'),
 PosixPath('pdb/5n99_J.pdb'),
 PosixPath('pdb/5n99_L.pdb'),
 PosixPath('pdb/5n99_N.pdb'),
 PosixPath('pdb/5n99_P.pdb'),
 PosixPath