# Tutorial: Morgan Fingerprints
Morgan fingerprints are commonly used in cheminformatics to represent the topology of the molecule in a hashed fixed length vector. More information on the construction of Morgan fingerprints and how to use them with RDKit can be found below:

   - DOI: 10.1021/ci100050t    
   - [RDKit Morgan Fingerprints](http://rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints "RDKit Morgan Fingerprints")

We begin by importing the Morgan Fingerpint bit vector constructor from `chemreps.fingerprints`

In [None]:
from chemreps.fingerprints import morganfp  # RDKit dependency
import pandas as pd
import numpy as np
import glob
import sys

np.set_printoptions(threshold=sys.maxsize)

As an example, we will build a Morgan fingerprint bit vector for butane. The data set that we will be using can be found in the data directory of this repository. If you cloned this repository locally, then you should be able to set the path as '../data/sdf/'.

To create the bit vector, the sdf or mol file containing coordinates and connectivity is passed to the `morganfp` function. Additionally, the size of the radius which controls the size of the atom environment (radius of 2 bonds = diameter of 4 bonds) and the number of bits (the size of the hash) in the vector can be changed by manipulating `radius` and `nbits`.

If we want to recreate the common ECFP4 (diameter of 4 bonds) representation below we need a radius of 2 and 1024 bits.

In [None]:
mfile = '../data/sdf/butane.sdf'
fp = morganfp(mfile, radius=2, nBits=1024)

The bit vector we have now created is stored in the variable `fp`.

In [None]:
print(fp)

## Make representations for multiple molecules
Disclaimer: There may be better ways to accomplish the same objective. You are welcome to use your method as well as submit a issue/PR if you think we should use that method

To make representations for all the molecules in our directory we are going to need to use `glob` and `pandas`. To find out more about these libraries you can go to the [glob documentation](https://docs.python.org/3/library/glob.html) or [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). We are going to first create an empty list called `rep_list` in which we will store information such as the filename and the representation. Next we loop over all of the files in the directory using glob to match our pattern (eg. we want all sdf files from our data/sdf/ directory). In this loop we use the same method as above in order to make our representations. We store the name of the file and the representation in a dictionary that is then appended to our rep_list. Once the loop is complete, we store the information in a pandas dataframe.


In [None]:
dataset = '../data/sdf/'

rep_list = []
for i in sorted(glob.iglob(dataset + '/*')):
    fname = i
    print(fname)
    rep = morganfp(mfile, radius=2, nBits=1024)
    dict1 = {}
    dict1.update({'Name': fname})
    dict1.update({'Rep': rep})
    rep_list.append(dict1)

df = pd.DataFrame(rep_list, columns=['Name', 'Rep'])
df

Once our representation information is stored in the pandas dataframe, we can use numpy in order to make an array of our representations that we can finally pass to our machine learning method.

In [None]:
reps = np.asarray(df['Rep'])
reps