## <a id='toc1_1_'></a>[Merging the curated data](#toc0_)
In this dataset we are going to concatenate the two datasets we did the manual curation earlier into the main dataset.

**Table of contents**<a id='toc0_'></a>      
  - [Importing datasets:](#toc1_2_)    
  - [Concatenate](#toc1_3_)    
  - [Saving the dataset](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_2_'></a>[Importing datasets:](#toc0_)

In [1]:
# Importing libraries
import pandas as pd
import session_info 

In [2]:
# We need to import three datasets:
main_df = pd.read_csv('../data/pre_manual_curation/main_df.csv')
manual_fragments = pd.read_csv('../data/manual_curated_datasets/curated_fragments.csv')
manual_smiles_year = pd.read_csv('../data/manual_curated_datasets/curated_smiles_year.csv')

In [3]:
main_df.query("pref_name == 'MESNA'")

Unnamed: 0,pref_name,SMILES,molecule_chembl_id,first_approval,molecule_type,indication_class,polymer_flag,withdrawn_flag,inorganic_flag,therapeutic_flag,natural_product,oral,parenteral,topical,mw
39,MESNA,O=S(=O)(O)CCS,CHEMBL1098319,1988,Small molecule,Detoxifying Agent,0,False,0,True,0,True,True,False,141.975836


In [4]:
main_df.iloc[39, 1]

'O=S(=O)(O)CCS'

In [5]:
print("Main dataframe columns: \n",main_df.shape,main_df.columns,"\n")
print("Manual fragments columns: \n",manual_fragments.shape,manual_fragments.columns,"\n")
print("Manual smiles and year columns: \n",manual_smiles_year.shape,manual_smiles_year.columns,"\n")

Main dataframe columns: 
 (2047, 15) Index(['pref_name', 'SMILES', 'molecule_chembl_id', 'first_approval',
       'molecule_type', 'indication_class', 'polymer_flag', 'withdrawn_flag',
       'inorganic_flag', 'therapeutic_flag', 'natural_product', 'oral',
       'parenteral', 'topical', 'mw'],
      dtype='object') 

Manual fragments columns: 
 (341, 15) Index(['pref_name', 'SMILES', 'molecule_chembl_id', 'first_approval',
       'molecule_type', 'indication_class', 'polymer_flag', 'withdrawn_flag',
       'inorganic_flag', 'therapeutic_flag', 'natural_product', 'oral',
       'parenteral', 'topical', 'mw'],
      dtype='object') 

Manual smiles and year columns: 
 (436, 15) Index(['pref_name', 'SMILES', 'molecule_chembl_id', 'first_approval',
       'molecule_type', 'indication_class', 'polymer_flag', 'withdrawn_flag',
       'inorganic_flag', 'therapeutic_flag', 'natural_product', 'oral',
       'parenteral', 'topical', 'mw'],
      dtype='object') 



In [6]:
main_df.head(3)

Unnamed: 0,pref_name,SMILES,molecule_chembl_id,first_approval,molecule_type,indication_class,polymer_flag,withdrawn_flag,inorganic_flag,therapeutic_flag,natural_product,oral,parenteral,topical,mw
0,ACETOHYDROXAMIC ACID,CC(=O)NO,CHEMBL734,1983,Small molecule,Enzyme Inhibitor (urease),0,False,0,True,0,True,False,False,75.032028
1,HYDROXYUREA,NC(=O)NO,CHEMBL467,1967,Small molecule,Antineoplastic,0,False,0,True,0,True,False,False,76.027277
2,CYSTEAMINE,NCCS,CHEMBL602,1994,Small molecule,Anti-Urolithic (cystine calculi),0,False,0,True,0,True,False,True,77.02992


In [7]:
manual_fragments.head(3)

Unnamed: 0,pref_name,SMILES,molecule_chembl_id,first_approval,molecule_type,indication_class,polymer_flag,withdrawn_flag,inorganic_flag,therapeutic_flag,natural_product,oral,parenteral,topical,mw
0,MERCAPTOPURINE,O.S=c1[nH]cnc2nc[nH]c12,CHEMBL1200751,1953.0,Small molecule,Antineoplastic,0,False,0,True,0,True,False,False,170.026232
1,CARBACHOL,C[N+](C)(C)CCOC(N)=O.[Cl-],CHEMBL14,1972.0,Small molecule,Cholinergic (ophthalmic),0,False,0,True,0,False,True,False,182.082205
2,AMPHETAMINE SULFATE,CC(N)Cc1ccccc1.O=S(=O)(O)O,CHEMBL501,1960.0,Small molecule,Stimulant (central),0,False,0,True,0,True,False,False,233.072179


In [8]:
manual_smiles_year.head(3)

Unnamed: 0,pref_name,SMILES,molecule_chembl_id,first_approval,molecule_type,indication_class,polymer_flag,withdrawn_flag,inorganic_flag,therapeutic_flag,natural_product,oral,parenteral,topical,mw
0,BISMUTH SUBSALICYLATE,O[Bi]1OC(=O)C2=CC=CC=C2O1,CHEMBL1120,1939.0,Small molecule,Antidiarrheal; Anti-Ulcerative; Antacid,0,False,0,True,0,True,False,False,
1,CARBOPLATIN,[H][N]([H])([H])[Pt]1(OC(=O)C2(CCC2)C(=O)O1)[N...,CHEMBL1351,1989.0,Small molecule,Antineoplastic,0,False,0,True,0,False,True,False,
2,AUROTHIOGLUCOSE,OC[C@H]1O[C@H](S[Au])[C@H](O)[C@@H](O)[C@@H]1O,CHEMBL2354773,not found,Small molecule,Rheumatoid Arthritis,0,False,0,True,1,False,False,False,


## <a id='toc1_3_'></a>[Concatenate](#toc0_)


In [9]:
dataset_list = [main_df, manual_fragments, manual_smiles_year]
final_df = pd.concat(dataset_list)

print(f"The previous datasets shapes: \nMain df: {main_df.shape}\nManual fragments: {manual_fragments.shape}\nManual Smiles and Year: {manual_smiles_year.shape}\n")
print("After concatenating the final shape is:",(main_df.shape[0]+manual_fragments.shape[0]+manual_smiles_year.shape[0]))

print("Final shape:", final_df.shape)

The previous datasets shapes: 
Main df: (2047, 15)
Manual fragments: (341, 15)
Manual Smiles and Year: (436, 15)

After concatenating the final shape is: 2824
Final shape: (2824, 15)


In [10]:
#Reseting index
final_df = final_df.reset_index(drop=True)

#Dropping molecular weight to be calculated only afterwards
final_df.drop(["mw"], axis=1)

Unnamed: 0,pref_name,SMILES,molecule_chembl_id,first_approval,molecule_type,indication_class,polymer_flag,withdrawn_flag,inorganic_flag,therapeutic_flag,natural_product,oral,parenteral,topical
0,ACETOHYDROXAMIC ACID,CC(=O)NO,CHEMBL734,1983,Small molecule,Enzyme Inhibitor (urease),0,False,0,True,0,True,False,False
1,HYDROXYUREA,NC(=O)NO,CHEMBL467,1967,Small molecule,Antineoplastic,0,False,0,True,0,True,False,False
2,CYSTEAMINE,NCCS,CHEMBL602,1994,Small molecule,Anti-Urolithic (cystine calculi),0,False,0,True,0,True,False,True
3,DIMETHYL SULFOXIDE,C[S+](C)[O-],CHEMBL504,1978,Small molecule,Anti-Inflammatory (topical),0,False,0,True,0,False,True,False
4,FOMEPIZOLE,Cc1cn[nH]c1,CHEMBL1308,1997,Small molecule,Antidote (alcohol dehydrogenase inhibitor),0,False,0,True,0,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2819,VINDESINE SULFATE,CC[C@]1(O)C[C@@H]2CN(CCc3c([nH]c4ccccc34)[C@@]...,CHEMBL3989543,not found,Small molecule,Antineoplastic,0,False,0,True,1,False,False,False
2820,CYCLOGUANIL PAMOATE,CC1(C)N=C(N)N=C(N)N1c1ccc(Cl)cc1.CC1(C)N=C(N)N...,CHEMBL3989825,not found,Small molecule,,0,False,0,True,0,True,False,False
2821,LENACAPAVIR,CC(C)(C#Cc1ccc(-c2ccc(Cl)c3c(NS(C)(=O)=O)nn(CC...,CHEMBL4594438,2022.0,Small molecule,Antiretroviral,0,False,0,True,0,True,False,False
2822,LANATOSIDE C,CC(=O)O[C@H]1C[C@H](O[C@H]2[C@@H](O)C[C@H](O[C...,CHEMBL506569,not found,Small molecule,,0,False,0,True,1,False,False,False


## <a id='toc1_4_'></a>[Saving the dataset](#toc0_)

In [None]:
final_df.to_csv("../data/manual_curated_datasets/concatenated_dataset.csv", index=False)

In [12]:
session_info.show()