<center><font color="blue"> <h1>Mutagenicity Dataset</h1></center>


    
In this eperiment I will look at the Mutagenicity Dataset as initially described in the following PNAS paper:
    
R.D. King, S.H. Muggleton, A. Srinivasan, and M.J.E. Sternberg. Structure-activity relationships derived by machine learning: the use of atoms and their bond connectives to predict mutagenicity by Inductive Logic Programming. Proceedings of the National Academy of Sciences, 93:438-442, 1996.
https://www.doc.ic.ac.uk/~shm/Papers/pnas96.pdf
    
Some drugs are mutagenicity active which means that they could lead to cancer. The aim of the machine learning task is to build a set of rules for predicting Mutagenicity of chemical compounds, using a set of known active and inactive molecules as positive and negative examples and the properties of the molecules (e.g. atom-bound structures) as the background knowledge. 
    
<img src="image-1.png" style="width:200px;height:300px;">

The examples and the background knowledge have been encoded as relations in First-Order Logic. For examples, the following molecule (d1):
    
<img src="image-2.png" style="width:200px;height:300px;">
    
has been encoded using the following relations:
    
```    
atm(1, cl).
atm(2, c). 
atm(3, c). 
atm(4, c).
atm(5, c). 
atm(6, c).
atm(8, o).
...
bond(3, 4, s). 
bond(1, 2, s). 
bond(2, 3, d).
…
```  

and after Adding molecule id, atom type (21, 52, ..), e-charge (0.297, ..) and bond type (single or double) we have:   
    
```  
atm(d1, d1_1, cl, 21, 0.297).
atm(d1, d1_2, c, 21, 0187). 
atm(d1, d1_3, c, 21, -0.143). 
atm(d1, d1_4, c, 21, -0.143).
atm(d1, d1_5, c, 21, -0.143). 
atm(d1, d1_6, c, 21, -0.143).
atm(d1, d1_8, o, 52, 0.98).
...
bond(d1, d1_3, d1_4, s). 
bond(d1, d1_1, d1_2, s). 
bond(d1, d1_2, d1_3, d).
...
```    
 
Additional chemical functional groups and background knowledge, e.g Benzene rings can be also added. Table below shows some the rules from the 188 dataset used in PNAS paper mentioned above (the head of each rule is 'a molecule is mutagenecity active if'):
    
<img src="image-3.png" style="width:700px;height:500px;">

### Import package from root folder

In [1]:
import sys
sys.path.insert(0, '../../')
from pygol import *

### Pre-Processing
1. Define the constants
2. Generate the bottom clauses

In [2]:
constant_1=['a','c','h','o','n','cl']
#P, N = bottom_clause_generation(constant_set = constant_1,  container = "dict",positive_example="pos_example.f", negative_example="neg_example.n")

### Fold Preperation

In [3]:
folds=pygol_folds(folds=10)

### Modelling

In [4]:
model=pygol_cross_validation(folds, file="BK.pl",  k_fold=10, min_pos=2,  
                             constant_set=constant_1, 
                             set_chain=True, max_literals=2,  
                             distinct=True,   max_neg=5)

100%|██████████| 113/113 [00:18<00:00,  6.24it/s]
100%|██████████| 57/57 [00:04<00:00, 12.62it/s]
100%|██████████| 113/113 [00:00<00:00, 1342652.56it/s]
100%|██████████| 57/57 [00:00<00:00, 863087.83it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--


100%|██████████| 113/113 [00:00<00:00, 1431892.30it/s]
100%|██████████| 57/57 [00:00<00:00, 1106830.22it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--


100%|██████████| 113/113 [00:00<00:00, 1481113.60it/s]
100%|██████████| 57/57 [00:00<00:00, 866214.96it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--

100%|██████████| 113/113 [00:00<00:00, 1377780.09it/s]
100%|██████████| 57/57 [00:00<00:00, 671559.91it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--


100%|██████████| 113/113 [00:00<00:00, 1436231.37it/s]
100%|██████████| 57/57 [00:00<00:00, 584536.25it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--

100%|██████████| 113/113 [00:00<00:00, 1419030.99it/s]
100%|██████████| 57/57 [00:00<00:00, 591770.61it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active(A):-atm(A,B,n,38,0.817)&bond(A,B,C,1)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 103              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 10               | 53               |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.918 |
+-------------+-------+
| Precision   | 0.963 |
+-------------+-------+
| Sensitivity | 0.912 |
+-------------+-------+
| Specificity | 0.930 |
+-------------+-------+
| F1 Score    | 0.936 |
+-------------+--

100%|██████████| 113/113 [00:00<00:00, 1224693.42it/s]
100%|██████████| 57/57 [00:00<00:00, 882196.78it/s]

+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 11               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 1                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.889 |
+-------------+-------+
| Precision   | 0.917 |
+-------------+-------+
| Sensitivity | 0.917 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.917 |
+-------------+-------+
+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active


100%|██████████| 113/113 [00:00<00:00, 1043956.72it/s]
100%|██████████| 57/57 [00:00<00:00, 310487.44it/s]

+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 11               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 1                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.889 |
+-------------+-------+
| Precision   | 0.917 |
+-------------+-------+
| Sensitivity | 0.917 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.917 |
+-------------+-------+
+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active


100%|██████████| 113/113 [00:00<00:00, 740556.80it/s]
100%|██████████| 57/57 [00:00<00:00, 499113.42it/s]

+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 11               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 1                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.889 |
+-------------+-------+
| Precision   | 0.917 |
+-------------+-------+
| Sensitivity | 0.917 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.917 |
+-------------+-------+
+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108)&bond(A,C,D,2)', 'active(A):-logp(A,3.52)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,c,22,-0.116)&bond(A,C,B,1)', 'active(A):-logp(A,2.52)', 'active(A):-atm(A,B,c,21,-0.115)', 'active(A):-logp(A,3)', 'active




+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 11               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 1                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.889 |
+-------------+-------+
| Precision   | 0.917 |
+-------------+-------+
| Sensitivity | 0.917 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.917 |
+-------------+-------+


In [5]:
np.mean(model.accuracy)

0.8889999999999999