# Tutorial: Bag of Bonds 
To learn more about the Bag of Bonds representation, please read the literature reference below:
- DOI: 10.1021/acs.jpclett.5b00831


The first thing you need to do to make your Bag of Bonds representation using chemreps is import the `BagMaker` from `chemreps.bagger` as well as `chemreps.bag_of_bonds` as seen below.

In [1]:
from chemreps.bagger import BagMaker
from chemreps.bag_of_bonds import bag_of_bonds

The first thing we need to do when using Bag of Bonds is to make the bags for our dataset. The dataset that we will be using can be found in the data directory of this repository. If you cloned this repository locally you should be able to set the path as '../data/sdf/'. Once we have the path to our dataset we need to pass it to the `BagMaker` along with the type of representation we want. In this case we want to make the Bag of Bonds representation so we will pass `BagMaker` the string 'BoB'. 

Note: For larger datasets this may take a little time to run as it needs to iterate through the entire dataset and find the proper bag sizes for the entirety of the dataset.

In [2]:
dataset = '../data/sdf/'
bagger = BagMaker('BoB', dataset)

Now that we have made our bags and stored them in the object called `bagger`, we can get our empty bags by calling `bagger.bags` as well as the size of our bags with `bagger.bag_sizes`.

In [3]:
bagger.bags

{'C': [],
 'CC': [],
 'CH': [],
 'H': [],
 'HH': [],
 'O': [],
 'OC': [],
 'OH': [],
 'OO': []}

In [4]:
bagger.bag_sizes

OrderedDict([('C', 7),
             ('CC', 21),
             ('CH', 42),
             ('H', 10),
             ('HH', 45),
             ('O', 2),
             ('OC', 14),
             ('OH', 12),
             ('OO', 1)])

Once we have the bags and bag sizes for the dataset, we can start making our representations. To make a Bag of Bonds representation using `chemreps` all we need to do is pass `bag_of_bonds` the molecule file, the `bagger.bags`, and the `bagger.bag_sizes`. 

In [5]:
mfiles = dataset + 'butane.sdf'
print(mfiles)
rep = bag_of_bonds(mfiles, bagger.bags, bagger.bag_sizes)
rep

../data/sdf/butane.sdf


array([36.84  , 36.84  , 36.84  , 36.84  ,  0.    ,  0.    ,  0.    ,
        0.    , 23.38  , 23.38  , 23.33  , 14.15  , 14.15  ,  9.195 ,
        0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
        0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
        0.    ,  0.    ,  5.492 ,  5.492 ,  5.492 ,  5.492 ,  5.492 ,
        5.492 ,  5.492 ,  5.492 ,  5.492 ,  5.492 ,  2.775 ,  2.775 ,
        2.775 ,  2.775 ,  2.762 ,  2.762 ,  2.762 ,  2.762 ,  2.76  ,
        2.76  ,  2.752 ,  2.752 ,  2.752 ,  2.752 ,  2.162 ,  2.162 ,
        2.162 ,  2.162 ,  2.145 ,  2.145 ,  2.145 ,  2.145 ,  1.717 ,
        1.717 ,  1.42  ,  1.42  ,  1.42  ,  1.42  ,  1.272 ,  1.272 ,
        0.    ,  0.    ,  0.    ,  0.5   ,  0.5   ,  0.5   ,  0.5   ,
        0.5   ,  0.5   ,  0.5   ,  0.5   ,  0.5   ,  0.5   ,  0.    ,
        0.567 ,  0.567 ,  0.565 ,  0.565 ,  0.565 ,  0.565 ,  0.5635,
        0.5635,  0.4016,  0.4016,  0.4016,  0.4016,  0.3982,  0.3982,
        0.3982,  0.3

### Make representations for multiple molecules
Disclaimer: There may be better ways to accomplish the same objective. You are welcome to use your method as well as submit a issue/PR if you think we should use that method

To make representations for all the molecules in our directory we are going to need to use `glob` and `pandas`. To find out more about these libraries you can go to the [glob documentation](https://docs.python.org/3/library/glob.html) or [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). We are going to first create an empty list called `rep_list` in which we will store information such as the filename and the representation. Next we loop over all of the files in the directory using glob to match our pattern (eg. we want all sdf files from our data/sdf/ directory). In this loop we use the same method as above in order to make our representations. We store the name of the file and the representation in a dictionary that is then appended to our rep_list. Once the loop is complete, we store the information in a pandas dataframe.


In [6]:
import glob
import pandas as pd

rep_list = []
for i in sorted(glob.iglob(dataset + '/*.sdf')):
    fname = i
    print(fname)
    rep = bag_of_bonds(fname, bagger.bags, bagger.bag_sizes)
    dict1 = {}
    dict1.update({'Name': fname})
    dict1.update({'Rep': rep})
    rep_list.append(dict1)

df = pd.DataFrame(rep_list, columns=['Name', 'Rep'])
df

../data/sdf/benzoic_acid.sdf
../data/sdf/butane.sdf
../data/sdf/water.sdf


Unnamed: 0,Name,Rep
0,../data/sdf/benzoic_acid.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
1,../data/sdf/butane.sdf,"[36.84, 36.84, 36.84, 36.84, 0.0, 0.0, 0.0, 0...."
2,../data/sdf/water.sdf,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Once our representation information is stored in the pandas dataframe, we can use numpy in order to make an array of our representations that we can finally pass to our machine learning method.

In [7]:
import numpy as np
reps = np.asarray(df['Rep'])
reps

array([array([36.84  , 36.84  , 36.84  , 36.84  , 36.84  , 36.84  , 36.84  ,
        0.    , 25.81  , 25.81  , 25.81  , 25.81  , 25.81  , 25.81  ,
       24.75  , 14.9   , 14.9   , 14.9   , 14.9   , 14.9   , 14.9   ,
       14.59  , 14.586 , 12.91  , 12.91  , 12.91  ,  9.61  ,  9.6   ,
        8.48  ,  0.    ,  5.523 ,  5.523 ,  5.523 ,  5.523 ,  5.52  ,
        3.074 ,  2.82  ,  2.8   ,  2.785 ,  2.785 ,  2.785 ,  2.785 ,
        2.785 ,  2.785 ,  2.77  ,  2.752 ,  2.219 ,  2.203 ,  1.87  ,
        1.773 ,  1.769 ,  1.765 ,  1.765 ,  1.765 ,  1.765 ,  1.765 ,
        1.765 ,  1.761 ,  1.757 ,  1.643 ,  1.549 ,  1.549 ,  1.549 ,
        1.549 ,  1.548 ,  1.363 ,  1.299 ,  1.299 ,  1.188 ,  1.125 ,
        1.069 ,  1.0205,  0.    ,  0.5   ,  0.5   ,  0.5   ,  0.5   ,
        0.5   ,  0.5   ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
        0.4097,  0.406 ,  0.403 ,  0.403 ,  0.3047,  0.234 ,  0.2334,
        0.2328,  0.2308,  0.2162,  0.2015,  0.2015,  0.1765,  0.1528,
        0.144