# Creating and adding to the multitask database
Individual datasets and targets are cleaned and connected to a compounds codex containing unique molecules and already computed features. This minimizes the required time featurizing such that data in the database can be quickly called upon or even to speed up the use of data not in the dataset.

In [1]:
import cytoxnet.dataprep.io as io
import pandas as pd

### Initialize the compounds database

In [2]:
help(io.create_compound_codex)

Help on function create_compound_codex in module cytoxnet.dataprep.io:

create_compound_codex(db_path='./database', id_col='smiles', featurizers=None, **kwargs)
    Create a compound codex for a combined database.
    
    Creates a master csv file that tracks the unique canonicalized smiles of
    all data in the database, and stores deatures for those data.
    
    Parameters
    ----------
    db_path : str
        The path to the folder to contain database files. Will create direcory
        if it does not exist.
    id_col : str
        The column in all dataframes representing the compound id.
    featurizers : str or list of str
        The featurizer/s to initialize the compounds codex with.



- A folder to store the cleaned data files and the master compounds codex must be specified, default './database'
- The column name to contain unique identifiers must be specified, default 'smiles'
- Features initially specified will be computed for future data.

In [3]:
io.create_compound_codex(featurizers=['CircularFingerprint'])

In [4]:
pd.read_csv('./database/compounds.csv', index_col=0)

Unnamed: 0,smiles,CircularFingerprint


It is an empty file ready to have datasets added

## Adding data to the database

In [5]:
help(io.add_datasets)

Help on function add_datasets in module cytoxnet.dataprep.io:

add_datasets(dataframes, names, id_col='smiles', db_path='./database', new_featurizers=None, **kwargs)
    Add a new set of data to the tracked database.
    
    Update the compounds csv with new dataset/s canonicalized, and saves
    csvs to the database folder with foreign keys tracked.
    
    Parameters
    ----------
    dataframes : dataframe or string or list of them
        The datasets to add. If it is a string object, we will attempt to load
        the file at the string path or a file in the package data.
    id_col : str
        The column in all dataframes representing the compound id
    db_path : str
        The path to the folder containing database files.
    new_featurizers : str or list of str, default None
        Featurizer names to apply to the new data as well as all current data.



- Datasets can be added from dataframes in memory, csv files, or datasets already in the package
- New features can be asked for
- datasets added must be names

In [6]:
## adding datasets from the package
io.add_datasets(['zhu_rat_LD50', 'lunghini_algea_EC50'], names = ['rat', 'algea'])

We can see that the compounds we added to the codex and featurized

In [7]:
codex = pd.read_csv('./database/compounds.csv', index_col=0)
codex

Unnamed: 0,smiles,CircularFingerprint
0,[O-][N+](=Nc1ccccc1)c1ccccc1,[0. 0. 0. ... 0. 0. 0.]
1,BrC(Br)Br,[0. 1. 0. ... 0. 0. 0.]
2,C=CBr,[0. 0. 0. ... 0. 0. 0.]
3,Brc1ccc(-c2ccc(Br)c(Br)c2Br)c(Br)c1Br,[0. 0. 0. ... 0. 0. 0.]
4,S=C=Nc1ccc(Br)cc1,[0. 0. 0. ... 0. 0. 0.]
...,...,...
8239,c1ccc2c3c(ccc2c1)-c1cccc2cccc-3c12,[0. 0. 0. ... 0. 0. 0.]
8240,c1ccc2cc3c(ccc4ccccc43)cc2c1,[0. 0. 0. ... 0. 0. 0.]
8241,c1ccoc1,[0. 0. 0. ... 0. 0. 0.]
8242,c1ccc2[nH]cnc2c1,[0. 0. 0. ... 0. 0. 0.]


Additionally, the cleaned datasets were added to the database under the specified names, and contain the foreign key that matches them to the compounds codex.

In [8]:
algea = pd.read_csv('./database/algea.csv', index_col=0)
algea

Unnamed: 0,chemical_formula,smiles,casnum,molecular_weight,species,algea_EC50,units,source,foreign_key
0,C10H10Br2O2,BrC(Br)c1ccccc1OCC1CO1,30171-80-3,321.993195,algea,-0.879477,log(mg/L),"NITE, Literature set",7342
1,C8H7Br,BrC=Cc1ccccc1,103-64-0,183.045181,algea,3.919991,log(mg/L),ECHA,4438
2,C9H15Br6O4P,O=P(OCC(Br)CBr)(OCC(Br)CBr)OCC(Br)CBr,126-72-7,697.610779,algea,0.875469,log(mg/L),"NITE, ECOTOX, OASIS, Literature set",10
3,C9H9Br,BrCC=Cc1ccccc1,4392-24-9,197.071762,algea,2.940220,log(mg/L),Literature set,7343
4,C2H4Br2,BrCCBr,106-93-4,187.861160,algea,3.255786,log(mg/L),"ECHA, ECOTOX",12
...,...,...,...,...,...,...,...,...,...
1435,C4H4S,c1ccsc1,110-02-1,84.139557,algea,4.382027,log(mg/L),"NITE, VEGA, Literature set",3468
1436,C7H6N2,c1ccc2[nH]cnc2c1,51-17-2,118.135941,algea,3.288402,log(mg/L),ECHA,8242
1437,C7H5NS,c1ccc2scnc2c1,95-16-9,135.186295,algea,3.885679,log(mg/L),"ECHA, NITE, ECOTOX, OASIS, Literature set, VEGA",3469
1438,C2H3N3,c1nc[nH]n1,288-88-0,69.065323,algea,3.942552,log(mg/L),"ECHA, ECOTOX",2862


In [9]:
## a specific example in the dataset
print('Example in dataset: ')
print('=====================')
print(algea.iloc[1200])
print('Foreign key: ')
print('=====================')
print(algea.iloc[1200]['foreign_key'])
print('That compound with features in the codex: ')
print('=====================')
print(codex.loc[algea.iloc[1200]['foreign_key']])

Example in dataset: 
chemical_formula                        C9H9N5
smiles                 Nc1nc(N)nc(-c2ccccc2)n1
casnum                                 91-76-9
molecular_weight                       187.201
species                                  algea
algea_EC50                             4.23048
units                                log(mg/L)
source              ECHA, NITE, Literature set
foreign_key                               8085
Name: 1200, dtype: object
Foreign key: 
8085
That compound with features in the codex: 
smiles                 Nc1nc(N)nc(-c2ccccc2)n1
CircularFingerprint    [0. 0. 0. ... 0. 0. 0.]
Name: 8085, dtype: object


We can also add datasets in memory an add new features

In [10]:
my_data = pd.DataFrame({'smiles': ['O', 'C'], 'target': [1, 2]})

In [11]:
io.add_datasets([my_data], names = ['my_data'], new_featurizers=['RDKitDescriptors'])