**Table of contents**<a id='toc0_'></a>    
- [Importing and Reading Data](#toc1_1_)    
  - [Fragmenting PDB below Tanimoto 0.85](#toc1_2_)    
  - [Saving](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Importing and Reading Data](#toc0_)

In [9]:
import sys
sys.path.append('../utils/')
import pandas as pd
from rdkit.Chem.PandasTools import AddMoleculeColumnToFrame
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import HTML
from ring_fragmenter import get_ring_systems, get_ring_adjacent, get_ring_fragments

IPythonConsole.molSize = (600,300)

# to show rdkit.Chem.Mol objects
def show_df(df):
    return HTML(df.to_html(notebook=True))

In [10]:
pdb_below_0_85_df = pd.read_csv('../../data/negative_datasets/cleaned_datasets/pdb_cleaned.csv')

print(pdb_below_0_85_df.shape)
pdb_below_0_85_df.columns

(12246, 1)


Index(['clean_smiles'], dtype='object')

In [11]:
list_of_smiles = pdb_below_0_85_df['clean_smiles'].to_list()

## <a id='toc1_2_'></a>[Fragmenting PDB below Tanimoto 0.85](#toc0_)

In [12]:
ring_fragments_pdb = get_ring_fragments(list_of_smiles=list_of_smiles, no_rings_list=False)
ring_fragments_pdb

Unnamed: 0,parent_smiles,ring_fragment
0,CC1(C)O[C@H]2[C@@H]3OS(=O)(=O)O[C@@H]3CO[C@@]2...,CC1(C)O[C@H]2[C@@H]3OS(=O)(=O)O[C@@H]3CO[C@@]2...
1,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Cc1cccc(C)c1N
2,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Nc1nc(N)nc(N)n1
3,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,CN1CCN(C)CC1
4,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Nc1ccccc1Cl
...,...,...
29222,NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=...,Cc1ccccc1
29223,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1ccc2ncc(O)cc2c1
29224,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1cc(F)c2nnc(C)n2c1
29225,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1cc(C)on1


In [13]:
# amostra_ring_fragments = ring_fragments_pdb[0:1000]
# AddMoleculeColumnToFrame(smilesCol='parent_smiles', molCol='parent_mol', frame=amostra_ring_fragments)
# AddMoleculeColumnToFrame(smilesCol='ring_fragment', molCol='fragment_mol', frame=amostra_ring_fragments)

# show_df(amostra_ring_fragments)

Keep in mind that if a molecule has two of the same ring there would be two ring fragments from the same parent structure.

## <a id='toc1_3_'></a>[Saving](#toc0_)

Saving the non-unique relation (parent x ring fragment):

In [14]:
ring_fragments_pdb.to_csv('../../data/fragments/non_unique/pdb_below_0_85_fragments_non_unique.csv', index=False)

Saving with the realtion (one parent and unique ring fragments)

In [15]:
ring_fragments_pdb_unique = ring_fragments_pdb[~ring_fragments_pdb.duplicated()].reset_index(drop=True)
ring_fragments_pdb_unique

Unnamed: 0,parent_smiles,ring_fragment
0,CC1(C)O[C@H]2[C@@H]3OS(=O)(=O)O[C@@H]3CO[C@@]2...,CC1(C)O[C@H]2[C@@H]3OS(=O)(=O)O[C@@H]3CO[C@@]2...
1,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Cc1cccc(C)c1N
2,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Nc1nc(N)nc(N)n1
3,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,CN1CCN(C)CC1
4,CNC(=O)c1cccc(C)c1Nc1nc(N2CCN(c3ccccc3Cl)CC2)n...,Nc1ccccc1Cl
...,...,...
27410,NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=...,Cc1ccccc1
27411,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1ccc2ncc(O)cc2c1
27412,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1cc(F)c2nnc(C)n2c1
27413,COCCOc1cnc2ccc([C@H](C)c3nnc4c(F)cc(-c5cc(C)no...,Cc1cc(C)on1


In [16]:
ring_fragments_pdb_unique.to_csv('../../data/fragments/unique/pdb_below_0_85_fragments_no_duplicated.csv', index=False)