## Purpose of this notebook

Once we have completed conformer generation, we wish to analyze the activity of the building blocks in each position. The activity of each building block is defined by a "P(active)" value, which is just a ratio of the number of times a building block occurs in active compounds compared to all compounds. 

$$ P(active) = \frac{n_{active}}{n_{active} + n_{inactive}} $$

With sufficient number of observations, this metric should be quite robust. The analysis notebooks provide more intuition for the metric and demonstrate its value.

Input:
- `total_compounds.csv`: list of all compounds that made it through the data cleaning procedure
- `bb1_info_row.csv`: SMILES of building blocks in position 1 that made it through conformer generation
- `bb2_info_row.csv`: SMILES of building blocks in position 2 that made it through conformer generation
- `bb3_info_row.csv`: SMILES of building blocks in position 3 that made it through conformer generation

(if matrix is not symmetric, will also need to load in `*info_col.csv` files too)

Output:
- `bb1_info.csv`: adds the P(active) values for each compound to `bb1_info_row.csv` 
- `bb2_info.csv`: adds the P(active) values for each compound to `bb2_info_row.csv` 
- `bb3_info.csv`: adds the P(active) values for each compound to `bb3_info_row.csv`

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Split total compounds into two separate dataframes, active and inactive compounds
total_compounds = pd.read_csv('../../output/total_compounds.csv')
actives = total_compounds.loc[total_compounds['read_count'] > 0]
inactives = total_compounds.loc[total_compounds['read_count'] == 0]
print(len(actives))
print(len(inactives))

105543
4745191


#### Calculating P(active) value for each building block at each position

Code source: `notebooks/04_04_22_laplaces_rule_of_succession`

Calculating all the different P(active) values takes a little bit of time to complete.

In [9]:
# Calculate the P(active) value of each building block by 
bb1 = pd.read_csv('../files/bb1_info_row.csv')
bb1_pactive = {}
print(len(bb1))
for index, bb in enumerate(bb1['SMILES']):
    if index % 100 == 0:
        print(index)
    n_actives = len(actives.loc[actives['bb1'] == bb])
    n_inactives = len(inactives.loc[inactives['bb1'] == bb])
    bb1_pactive[bb] = n_actives/(n_actives + n_inactives)

645
0
100
200
300
400
500
600


In [10]:
bb2 = pd.read_csv('../files/bb2_info_row.csv')
bb2_pactive = {}
print(len(bb2))
for index, bb in enumerate(bb2['SMILES']):
    if index % 100 == 0:
        print(index)
    n_actives = len(actives.loc[actives['bb2'] == bb])
    n_inactives = len(inactives.loc[inactives['bb2'] == bb])
    bb2_pactive[bb] = n_actives/(n_actives + n_inactives)

349
0
100
200
300


In [11]:
bb3 = pd.read_csv('../files/bb3_info_row.csv')
# this is a building block that keeps giving issues, so we remove it
#np.argwhere(bb3 == 'NC1[C@H]2CC3C[C@H]1CC(O)(C3)C2')
#bb3 = bb3.delete()
bb3_pactive = {}
print(len(bb3))
for index, bb in enumerate(bb3['SMILES']):
    if index % 100 == 0:
        print(index)
    n_actives = len(actives.loc[actives['bb3'] == bb])
    n_inactives = len(inactives.loc[inactives['bb3'] == bb])
    bb3_pactive[bb] = n_actives/(n_actives + n_inactives)

4572
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500


In [12]:
# Save the compound SMILES and the SMILES of the best stereoisomer (if applicable) along with the P(active) value
bb1_info = pd.DataFrame(list(bb1_pactive.items()), columns=['SMILES', 'P(active)'])
bb1_info['stereo_SMILES'] = bb1['enumerated_SMILES']

bb2_info = pd.DataFrame(list(bb2_pactive.items()), columns=['SMILES', 'P(active)'])
bb2_info['stereo_SMILES'] = bb2['enumerated_SMILES']

bb3_info = pd.DataFrame(list(bb3_pactive.items()), columns=['SMILES', 'P(active)'])
bb3_info['stereo_SMILES'] = bb3['enumerated_SMILES']

In [13]:
bb1_info.to_csv('../../output/bb1_pactive.csv', index=False)
bb2_info.to_csv('../../output/bb2_pactive.csv', index=False)
bb3_info.to_csv('../../output/bb3_pactive.csv', index=False)