## Double Check! We only need to gather the small molecules

We retrieved a .csv from drug bank containing only the DrugBank IDs of Small Molecules! So we do not need to worry about manually curating the biological ones because we removed it in the first step (1_lipinski_fda_1997_2021)

First we import the pandas library since we're going to need only this for this data wrangling.

In [1]:
import pandas as pd

We load the `drug_bank_small_molecules` dataset from our local folder.

In [2]:
small_molecules = pd.read_csv("../data/RAW_datasets/RAW_drug_bank_small_molecules.csv")
small_molecules.head(5)

Unnamed: 0,DrugBank ID,Name,Drug Type
0,DB00006,Bivalirudin,SmallMoleculeDrug
1,DB00007,Leuprolide,SmallMoleculeDrug
2,DB00014,Goserelin,SmallMoleculeDrug
3,DB00027,Gramicidin D,SmallMoleculeDrug
4,DB00035,Desmopressin,SmallMoleculeDrug


We have 4 variables: the `DrugBank ID`, the `Name` of the Molecule and the `Drug Type`. Then we load our manually curated dataset generated in the first step. The small_molecules dataset has the shape/form:

In [3]:
print(f"The small_molecules dataset retrieved from DB has the shape/form: {small_molecules.shape}")

The small_molecules dataset retrieved from DB has the shape/form: (11912, 3)


So now we're going to merge the FDA but only the small molecules by `DrugBank ID` retrieved from DrugBank. We load he FDA (with the smiles and approved between 1997 and 2021)

In [4]:
#final_fda = pd.read_excel("https://github.com/arturcgs/shared-side-projects/blob/main/_Lipinski/data/manually_curated_datasets/fda_approved_1997_2021_with_all_smiles.xlsx?raw=true", sheet_name = "fda_approved_97_21")
final_fda = pd.read_csv("../data/manually_curated_datasets/fda_approved_1997_2021_only_small_molecules.csv")
final_fda.head(5)

Unnamed: 0.1,Unnamed: 0,active_ingredient_moiety,nda_bla,approval_year,active,DrugBank ID,Drug Groups,SMILES
0,0,troglitazone,NDA,1997,troglitazone,DB00197,approved; investigational; withdrawn,CC1=C(C)C2=C(CCC(C)(COC3=CC=C(CC4SC(=O)NC4=O)C...
1,1,imiquimod,NDA,1997,imiquimod,DB00724,approved; investigational,CC(C)CN1C=NC2=C1C1=C(C=CC=C1)N=C2N
2,2,anagrelide hydrochloride,NDA,1997,anagrelide,DB00261,approved,ClC1=CC=C2N=C3NC(=O)CN3CC2=C1Cl
3,3,nelfinavir mesylate,NDA,1997,nelfinavir,DB00220,approved,[H][C@@]12CCCC[C@]1([H])CN(C[C@@H](O)[C@H](CSC...
4,4,delavirdine mesylate,NDA,1997,delavirdine,DB00705,approved,CC(C)NC1=C(N=CC=C1)N1CCN(CC1)C(=O)C1=CC2=C(N1)...


We shall remove the `Unnamed: 0` variable because we have saved it to csv without the argument `"index = False"`

In [5]:
final_fda = final_fda.drop("Unnamed: 0", axis = 1)
print(final_fda.shape)
final_fda.head(5)

(565, 7)


Unnamed: 0,active_ingredient_moiety,nda_bla,approval_year,active,DrugBank ID,Drug Groups,SMILES
0,troglitazone,NDA,1997,troglitazone,DB00197,approved; investigational; withdrawn,CC1=C(C)C2=C(CCC(C)(COC3=CC=C(CC4SC(=O)NC4=O)C...
1,imiquimod,NDA,1997,imiquimod,DB00724,approved; investigational,CC(C)CN1C=NC2=C1C1=C(C=CC=C1)N=C2N
2,anagrelide hydrochloride,NDA,1997,anagrelide,DB00261,approved,ClC1=CC=C2N=C3NC(=O)CN3CC2=C1Cl
3,nelfinavir mesylate,NDA,1997,nelfinavir,DB00220,approved,[H][C@@]12CCCC[C@]1([H])CN(C[C@@H](O)[C@H](CSC...
4,delavirdine mesylate,NDA,1997,delavirdine,DB00705,approved,CC(C)NC1=C(N=CC=C1)N1CCN(CC1)C(=O)C1=CC2=C(N1)...


We then remove the structures that do not have `DrugBank IDs` in our dataset:

In [6]:
print(f"Before removal of the NA rows of DrugBank IDs, the final_fda dataframe has: {final_fda.shape}\n")
final_fda = final_fda[~final_fda["DrugBank ID"].isna()]
print(f"\nAfter the removal of the NA rows, we're left with: {final_fda.shape}")

Before removal of the NA rows of DrugBank IDs, the final_fda dataframe has: (565, 7)


After the removal of the NA rows, we're left with: (563, 7)


There is only two lines that do not have any SMILES: <b>fish oil triglycerides</b>, <b>air polymer-type A</b>;
Merging the two datasets based on DrugBank IDs so we're left only with the small_molecules

In [7]:
final_fda_only_small = pd.merge(final_fda, small_molecules, how = "left", on = "DrugBank ID")
final_fda_only_small.head(5)
print(f"After merging the two datasets with: {final_fda_only_small.shape}")

After merging the two datasets with: (563, 9)


Checking the not found "Drug Type" column (about 9 structures are found NA):

In [8]:
final_fda_only_small[final_fda_only_small["Drug Type"].isna()]

Unnamed: 0,active_ingredient_moiety,nda_bla,approval_year,active,DrugBank ID,Drug Groups,SMILES,Name,Drug Type
530,eptifibatide,NDA,1998,eptifibatide,DB00063,approved; investigational,NC(N)=NCCCCC1NC(=O)CCSSCC(NC(=O)C2CCCN2C(=O)C(...,,
531,fomivirsen sodium,NDA,1998,fomivirsen,DB06759,approved; investigational; withdrawn,CC1=CN([C@H]2C[C@H](O[P](O)(=S)OC[C@H]3O[C@H](...,,
537,unoprostone isopropyl,NDA,2000,unoprostone isopropyl,DBSALT001760,approved,CCCCCCCC(=O)CC[C@H]1[C@@H](C[C@@H]([C@@H]1C/C=...,,
539,cefditoren pivoxil,NDA,2001,cefditoren pivoxil,DBSALT001811,approved,[H][C@]12SCC(\C=C/C3=C(C)N=CS3)=C(N1C(=O)[C@H]...,,
543,enfuvirtide,NDA,2003,enfuvirtide,DB00109,approved; investigational,CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H]...,,
544,pentetate calcium trisodium,NDA,2004,pentetate,DB06806,approved,[O-]C(=O)CN(CCN(CC([O-])=O)CC([O-])=O)CCN(CC([...,,
549,pegaptanib sodium,NDA,2004,pegaptanib,DB04895,approved,COCCOC(=O)NCCCC[C@@H](C(=O)NCCCCCCCOP(=O)(C)O)...,,


In [9]:
final_fda_only_small.shape

(563, 9)

Now we remove the weird molecules we have seen manually.

In [10]:
biological_to_remove = [531, 543, 549]
final_fda_only_small.drop(biological_to_remove, axis = 0, inplace = True)

In [11]:
final_fda_only_small.shape

(560, 9)

Dropping the non small molecules of our final dataset:

In [12]:
final_fda_only_small.reset_index(inplace = True)

Now we finally filter the biological tagged ones by sanity check:

In [13]:
print(f'The shape of the final csv with only small molecules curated has the shape: {final_fda_only_small.shape}')

The shape of the final csv with only small molecules curated has the shape: (560, 10)


In [14]:
filters = final_fda_only_small["Drug Groups"].str.contains(pat = "biological|polymer")
final_fda_only_small[~filters].reset_index(drop = True)
print(f"The shape of the final csv is: {final_fda_only_small.shape}")
#saving the file
final_fda_only_small.to_csv("../data/manually_curated_datasets/fda_small_molecules_smiles.csv")

The shape of the final csv is: (560, 10)
