<center><font color="blue"> <h1>Mutagenicity Dataset</h1></center>


    
In this eperiment I will look at the Mutagenicity Dataset as initially described in the following PNAS paper:
    
R.D. King, S.H. Muggleton, A. Srinivasan, and M.J.E. Sternberg. Structure-activity relationships derived by machine learning: the use of atoms and their bond connectives to predict mutagenicity by Inductive Logic Programming. Proceedings of the National Academy of Sciences, 93:438-442, 1996.
https://www.doc.ic.ac.uk/~shm/Papers/pnas96.pdf
    
Some drugs are mutagenicity active which means that they could lead to cancer. The aim of the machine learning task is to build a set of rules for predicting Mutagenicity of chemical compounds, using a set of known active and inactive molecules as positive and negative examples and the properties of the molecules (e.g. atom-bound structures) as the background knowledge. 
    
<img src="image-1.png" style="width:200px;height:300px;">

The examples and the background knowledge have been encoded as relations in First-Order Logic. For examples, the following molecule (d1):
    
<img src="image-2.png" style="width:200px;height:300px;">
    
has been encoded using the following relations:
    
```    
atm(1, cl).
atm(2, c). 
atm(3, c). 
atm(4, c).
atm(5, c). 
atm(6, c).
atm(8, o).
...
bond(3, 4, s). 
bond(1, 2, s). 
bond(2, 3, d).
…
```  

and after Adding molecule id, atom type (21, 52, ..), e-charge (0.297, ..) and bond type (single or double) we have:   
    
```  
atm(d1, d1_1, cl, 21, 0.297).
atm(d1, d1_2, c, 21, 0187). 
atm(d1, d1_3, c, 21, -0.143). 
atm(d1, d1_4, c, 21, -0.143).
atm(d1, d1_5, c, 21, -0.143). 
atm(d1, d1_6, c, 21, -0.143).
atm(d1, d1_8, o, 52, 0.98).
...
bond(d1, d1_3, d1_4, s). 
bond(d1, d1_1, d1_2, s). 
bond(d1, d1_2, d1_3, d).
...
```    
 
Additional chemical functional groups and background knowledge, e.g Benzene rings can be also added. Table below shows some the rules from the 188 dataset used in PNAS paper mentioned above (the head of each rule is 'a molecule is mutagenecity active if'):
    
<img src="image-3.png" style="width:700px;height:500px;">

### Import package from root folder

In [1]:
import sys
sys.path.insert(0, '../../')
from pygol import *

### Pre-Processing
1. Define the constants
2. Generate the bottom clauses

In [2]:
constant_1=['a','c','h','o','n','cl']
P, N = bottom_clause_generation(constant_set = constant_1,  container = "dict",positive_example="pos_example.f", negative_example="neg_example.n")

100%|██████████| 125/125 [00:04<00:00, 28.09it/s]
100%|██████████| 63/63 [00:01<00:00, 38.38it/s]


### Fold Preperation

In [3]:
folds=pygol_folds(folds=10)

### Modelling

In [4]:
model=pygol_cross_validation(folds, file="BK.pl",  k_fold=10, min_pos=2,  
                             constant_set=constant_1, 
                             set_chain=True, max_literals=2,  
                             distinct=True, optimize=True, max_neg=5)

100%|██████████| 113/113 [00:02<00:00, 47.18it/s]
100%|██████████| 57/57 [00:00<00:00, 61.67it/s]
100%|██████████| 113/113 [00:00<00:00, 1377780.09it/s]
100%|██████████| 57/57 [00:00<00:00, 658609.72it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------


100%|██████████| 113/113 [00:00<00:00, 1462828.25it/s]
100%|██████████| 57/57 [00:00<00:00, 651431.41it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------


100%|██████████| 113/113 [00:00<00:00, 1431892.30it/s]
100%|██████████| 57/57 [00:00<00:00, 574700.31it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1499861.87it/s]
100%|██████████| 57/57 [00:00<00:00, 919520.49it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1533839.33it/s]
100%|██████████| 57/57 [00:00<00:00, 594714.75it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1538819.32it/s]
100%|██████████| 57/57 [00:00<00:00, 415061.33it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1490428.78it/s]
100%|██████████| 57/57 [00:00<00:00, 346988.87it/s]


+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1176070.35it/s]
100%|██████████| 57/57 [00:00<00:00, 791640.16it/s]

+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------




+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'active(A):-atm(A,B,n,38,0.814),bond(A,C,D,2)', 'active(A):-atm(A,B,o,40,-0.382),bond(A,C,D,2)', 'active(A):-atm(A,B,c,21,-0.115),bond(A,C,C,7)', 'active(A):-logp(A,1.49)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,B,C,2)', 'active(A):-atm(A,B,c,29,0.017),bond(A,C,D,1)', 'active(A):-atm(A,B,n,38,0.817),bond(A,C,B,2)', 'active(A):-atm(A,B,o,40,-0.383),bond(A,C,D,2)']
+---------------------+------------------+------------------+
|       n = 170       | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 107              | 4                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 6                | 53               |
+-------------------

100%|██████████| 113/113 [00:00<00:00, 1131160.74it/s]
100%|██████████| 57/57 [00:00<00:00, 565189.90it/s]

+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 10               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 2                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.833 |
+-------------+-------+
| Precision   | 0.909 |
+-------------+-------+
| Sensitivity | 0.833 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.870 |
+-------------+-------+
+----------+ Training +----------+
['active(A):-ind1(A,1.0)', 'active(A):-atm(A,B,c,10,0.108),bond(A,C,C,2)', 'active(A):-atm(A,B,c,29,0.01),bond(A,C,D,1)', 'active(A):-atm(A,B,o,40,-0.388),bond(A,C,D,2)', 'active(A):-atm(A,B,n,38,0.813),bond(A,B,C,2)', 'active(A):-atm(A,B,o,40,-0.386),bond(A,C,D,7)', 'ac




+----------+ Testing +----------+
+---------------------+------------------+------------------+
|       n = 18        | Positive(Actual) | Negative(Actual) |
| Positive(Predicted) | 10               | 1                |
+---------------------+------------------+------------------+
| Negative(Predicted) | 2                | 5                |
+---------------------+------------------+------------------+
+-------------+-------+
|   Metric    |   #   |
| Accuracy    | 0.833 |
+-------------+-------+
| Precision   | 0.909 |
+-------------+-------+
| Sensitivity | 0.833 |
+-------------+-------+
| Specificity | 0.833 |
+-------------+-------+
| F1 Score    | 0.870 |
+-------------+-------+
