**Table of contents**<a id='toc0_'></a>    
- [Importing and Reading Data](#toc1_1_)    
  - [Fragmenting Approved Drugs](#toc1_2_)    
  - [Saving](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Importing and Reading Data](#toc0_)

In [1]:
import sys
sys.path.append('../utils/')
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import HTML
from ring_fragmenter import get_ring_fragments

IPythonConsole.molSize = (600,300)

# to show rdkit.Chem.Mol objects
def show_df(df):
    return HTML(df.to_html(notebook=True))

In [2]:
fda_drugs_df = pd.read_csv('../../data/fda_approved_datasets/fda_approved_drugs.csv')
list_of_smiles = fda_drugs_df['clean_smiles'].to_list()
print(f'The approved drugs dataframe has the following shape {fda_drugs_df.shape}')
print(f'Number of duplicated SMILES: {fda_drugs_df.clean_smiles.duplicated().sum()}')
fda_drugs_df.head()

The approved drugs dataframe has the following shape (1895, 14)
Number of duplicated SMILES: 0


Unnamed: 0,name,chembl_id,clean_smiles,first_approval_year,indication_class,molecule_type,withdrawn_flag,therapeutic_flag,polymer_flag,inorganic_flag,natural_product_flag,oral,parenteral,topical
0,GUANIDINE HYDROCHLORIDE,CHEMBL1200728,N=C(N)N,1939,,Small molecule,False,True,False,False,False,True,False,False
1,ACETOHYDROXAMIC ACID,CHEMBL734,CC(=O)NO,1983,Enzyme Inhibitor (urease),Small molecule,False,True,False,False,False,True,False,False
2,HYDROXYUREA,CHEMBL467,NC(=O)NO,1967,Antineoplastic,Small molecule,False,True,False,False,False,True,False,False
3,CYSTEAMINE,CHEMBL602,NCCS,1994,CYSTEAMINE HYDROCHLORIDE,Small molecule,False,True,False,False,False,True,False,True
4,DIMETHYL SULFOXIDE,CHEMBL504,C[S+](C)[O-],1978,Anti-Inflammatory (topical),Small molecule,False,True,False,False,False,False,True,False


## <a id='toc1_2_'></a>[Fragmenting Approved Drugs](#toc0_)

In [3]:
ring_fragments_drugs = get_ring_fragments(list_of_smiles=list_of_smiles, no_rings_list=False)
ring_fragments_drugs

Unnamed: 0,parent_smiles,ring_fragment
0,Cc1cn[nH]c1,Cc1cn[nH]c1
1,C1CNCCN1,C1CNCCN1
2,Nc1ccncc1,Nc1ccncc1
3,N[C@@H]1CONC1=O,N[C@@H]1CONC1=O
4,Nc1ccncc1N,Nc1ccncc1N
...,...,...
3764,CCCCCCCCCCNCCN[C@@]1(C)C[C@H](O[C@H]2[C@H](Oc3...,Cc1c(O)cc2c(c1O)-c1cc(ccc1O)[C@H]1NC(=O)[C@@H]...
3765,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12
3766,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12
3767,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12


In [4]:
fda_drugs_df['parent_smiles'] = fda_drugs_df['clean_smiles']

# Mergin with chembl_id on parent_smiles for both dataframes for later retrieval of information
ring_fragments_drugs = ring_fragments_drugs.merge(fda_drugs_df[['parent_smiles', 'chembl_id']], on='parent_smiles', how='left')

# Final dataframe (with chembl_id)
ring_fragments_drugs

Unnamed: 0,parent_smiles,ring_fragment,chembl_id
0,Cc1cn[nH]c1,Cc1cn[nH]c1,CHEMBL1308
1,C1CNCCN1,C1CNCCN1,CHEMBL1412
2,Nc1ccncc1,Nc1ccncc1,CHEMBL284348
3,N[C@@H]1CONC1=O,N[C@@H]1CONC1=O,CHEMBL771
4,Nc1ccncc1N,Nc1ccncc1N,CHEMBL354077
...,...,...,...
3764,CCCCCCCCCCNCCN[C@@]1(C)C[C@H](O[C@H]2[C@H](Oc3...,Cc1c(O)cc2c(c1O)-c1cc(ccc1O)[C@H]1NC(=O)[C@@H]...,CHEMBL507870
3765,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12,CHEMBL1201469
3766,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12,CHEMBL1201469
3767,CC(C)CC(NC(=O)C(C)NC(=O)CNC(=O)C(NC=O)C(C)C)C(...,Cc1c[nH]c2ccccc12,CHEMBL1201469


Keep in mind that if a molecule has two of the same ring there would be two ring fragments from the same parent structure.

## <a id='toc1_3_'></a>[Saving](#toc0_)

Saving the non-unique relation unique ring fragment with parent smiles

In [5]:
ring_fragments_drugs.to_csv('../../data/fragments/non_unique/drug_fragments_non_unique.csv', index=False)

Saving with the relation one parent with unique ring fragments (non-duplicated)

In [6]:
ring_fragments_drugs_unique = ring_fragments_drugs[~ring_fragments_drugs.duplicated()].reset_index(drop=True)
ring_fragments_drugs_unique

Unnamed: 0,parent_smiles,ring_fragment,chembl_id
0,Cc1cn[nH]c1,Cc1cn[nH]c1,CHEMBL1308
1,C1CNCCN1,C1CNCCN1,CHEMBL1412
2,Nc1ccncc1,Nc1ccncc1,CHEMBL284348
3,N[C@@H]1CONC1=O,N[C@@H]1CONC1=O,CHEMBL771
4,Nc1ccncc1N,Nc1ccncc1N,CHEMBL354077
...,...,...,...
3585,CCC(C)CCCCC(=O)NC(CCNCS(=O)(=O)O)C(=O)NC(C(=O)...,CC1NC(=O)C(N)CCNC(=O)C(C)NC(=O)C(C)NC(=O)C(C)N...,CHEMBL1201441
3586,CCCCCCCCCCNCCN[C@@]1(C)C[C@H](O[C@H]2[C@H](Oc3...,C[C@@H]1O[C@@H](O)C[C@](C)(N)[C@@H]1O,CHEMBL507870
3587,CCCCCCCCCCNCCN[C@@]1(C)C[C@H](O[C@H]2[C@H](Oc3...,C[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O,CHEMBL507870
3588,CCCCCCCCCCNCCN[C@@]1(C)C[C@H](O[C@H]2[C@H](Oc3...,Cc1c(O)cc2c(c1O)-c1cc(ccc1O)[C@H]1NC(=O)[C@@H]...,CHEMBL507870


In [7]:
ring_fragments_drugs_unique.to_csv('../../data/fragments/unique/drug_fragments_no_duplicated.csv', index=False)