# Enrich the data with compound attributes extracted from the PubChem data base

In particular, the so called "fingerprint" of the compound might yield a good distance measure between the compounds that is informative w.r.t. its toxicity.

In [115]:
import pubchempy as pcp
import pandas as pd

final_db = pd.read_csv('data/processed/final_db_processed.csv')
cas_to_pubchemcid = pd.read_csv('data/processed/cas_to_pubchemcid.csv')
cas_to_pubchemcid = cas_to_pubchemcid.drop(columns=cas_to_pubchemcid.columns[0])
cas_to_pubchemcid.cid = cas_to_pubchemcid.cid.fillna(-1).astype(int)
cas_to_pubchemcid.head()

Unnamed: 0,cas,cid
0,100-00-5,7474
1,100-01-6,7475
2,100-02-7,980
3,100-44-7,7503
4,100-47-0,7505


Extract the fingerprint from the PubChem database. This might take a few hours. Writing to a file ensures that we can keep the progress we made so far if something goes wrong.

In [100]:
def get_finger(x):
    if x!=-1:
        y = pcp.Compound.from_cid(x).fingerprint
    else:
        y = "none"
    with open('data/processed/pubchemcid_finger.csv','a') as fd:
        fd.write(','.join([str(x),''.join([str(y),'\n'])]))
    return y

c = cas_to_pubchemcid.cid.apply(func=get_finger)

KeyboardInterrupt: 

Here we read from a dummy-file that contains part of the results that we got from the code above (since the code above is not finished yet).

In [101]:
tmp = pd.read_csv('data/processed/pubchemcid_finger_temp.csv', names=['cid','fingerprint'])
tmp.head()

Unnamed: 0,cid,fingerprint
0,7474,0000037180623000040000000000000000000000000000...
1,7475,0000037180633000000000000000000000000000000000...
2,980,0000037180623000000000000000000000000000000000...
3,7503,0000037180600000040000000000000000000000000000...
4,7505,0000037180620000000000000000000000000000000000...


Merge the CAS with the fingerprint based on the CID, then merge the ecotoxicological data with the fingerprint based on the CAS.

In [113]:
merged = pd.merge(cas_to_pubchemcid, tmp, on='cid').drop(columns=['cid']).rename(columns={"cas": "test_cas"})
merged.head()
final_db_update = pd.merge(final_db, merged, on='test_cas')
final_db_update.columns

Index(['test_cas', 'species', 'conc1_type', 'exposure_type',
       'obs_duration_mean', 'conc1_mean', 'atom_number', 'alone_atom_number',
       'bonds_number', 'doubleBond', 'tripleBond', 'ring_number', 'Mol',
       'MorganDensity', 'LogP', 'class', 'tax_order', 'family', 'genus',
       'fingerprint'],
      dtype='object')