# Step 1)
Download the ElasticSearch version thats compatible with your system from Lucene here https://www.elastic.co/downloads/elasticsearch

I also installed the python bindings for elasticsearch using pip. The python bindings of elasticsearch allows for rapid integration with all the other python packages that I am using in chemistry (RDKIT, sklearn, etc...)

Once you have downloaded elasticsearch, make sure the elasticsearch server is up and running (For my windows machine it's on port 9200)


# Step 2)

In this step we define a python object that contains information about the fragmentized molecule. By working with python objects, this can help us in later steps when we perform the bulk loading of data. Just remember you can customize this however you want and add as many properties such as molecular weight, ALogP, etc... as you want. By having more properties, this may help in indexing your molecule of choice by ElasticSearch

In [4]:
import textwrap
class FragmentData():
    """
        We define a class that contains basic info about the fragmentized molecule, in this case we have the smiles data, list of fragments
        the chembl id, the standard inchi, and mol regno
    """
    def __init__(self, chembl_id, smiles, fragments, standard_inchi, mol_regno):
        self.id = chembl_id
        self.smiles = smiles
        self.fragments = fragments
        self.standard_inchi = standard_inchi
        self.mol_regno = mol_regno

    def __str__(self):
        return textwrap.dedent("""\
            Id: {}          
            smiles: {}   
            fragments: {}
            standard_inchi: {}
            mol_regno:{}
        """).format(self.id, self.smiles, self.fragments, self.standard_inchi,
                    self.mol_regno)

# Step 3) 

In the following step we create the elasticsearch index and document type where we store our fragmentized molecules. You can learn more about the basics of elasticsearch here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html

In [14]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from typing import List
from elasticsearch.helpers import bulk
import json

In [7]:
INDEX_NAME = 'chembl_data'  ### this would be analogous to a database
DOC_TYPE = 'mol_frags'   ### this would be analogous to a collection within a database of similar "documents"
es = Elasticsearch()

In [8]:
es.indices.delete(index=INDEX_NAME, ignore=404)   ## delete the INDEX if present
### creates the index with dynamic mappings, meaning we dont define the properties of the document fields when we create the index
es.indices.create(             
        index=INDEX_NAME,
        body={
            'mappings': {},
            'settings': {},
        },
    )

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'chembl_data'}

### Here we define some helper functions to help load the data

In [9]:
def load(es, input_data):
    success, _ = bulk(es, set_data(input_data))
    print(success)

In [10]:
def set_data(fragment_data):
    for frag in fragment_data:
        yield {
            "_op_type": "create",
            "_index": INDEX_NAME,
            "_type": DOC_TYPE,
            "_id":frag.id,
            "_source":{
                "smiles": frag.smiles,
                "fragments": frag.fragments,
                "standard_inchi": frag.standard_inchi,
                "mol_regno": frag.mol_regno,
            }
        }

In [11]:
def fragment_loader(fragment_json):
    _all_frags = []
    with open(fragment_json) as fragment_file:
        for frag in json.load(fragment_file):
            json_data = {key.lower():val for key,val in frag.items()}
            frag_data = FragmentData(**json_data)
            _all_frags.append(frag_data)
    return _all_frags

### the set_data function is a generator that takes what the fragment_loader function read and uses that to create json documents on the fly, the example below is what it would look like

In [15]:
next(set_data(fragment_loader('CHEMBL_Molecules_fragments_950000_1000000.json')))

{'_op_type': 'create',
 '_index': 'chembl_data',
 '_type': 'mol_frags',
 '_id': 'CHEMBL1644781',
 '_source': {'smiles': 'C\\C(=C/c1cc(F)c(OCCC(F)F)cc1F)\\C(=O)N[C@@H]2[C@H](O)[C@@H](O)[C@H]3OCO[C@H]3[C@@H]2O',
  'fragments': ['[Xe]O[Xe]',
   '[Xe]c1cc(F)c([Xe])cc1F',
   '[Xe]C[Xe]',
   '[Xe][C@@H]1[C@H](O)[C@@H](O)[C@H]2OCO[C@H]2[C@@H]1O',
   '[Xe]C(=O)C([Xe])C',
   '[Xe]N[Xe]',
   '[Xe]CCC(F)F'],
  'standard_inchi': 'InChI=1S/C20H23F4NO7/c1-8(4-9-5-11(22)12(6-10(9)21)30-3-2-13(23)24)20(29)25-14-15(26)17(28)19-18(16(14)27)31-7-32-19/h4-6,13-19,26-28H,2-3,7H2,1H3,(H,25,29)/b8-4+/t14-,15+,16-,17-,18+,19-/m1/s1',
  'mol_regno': 1059788}}

### Finally, we bulk load the saved fragmentized molecules into elasticsearch 

In [16]:
fragment_json = 'CHEMBL_Molecules_fragments_950000_1000000.json'
load(es,fragment_loader(fragment_json))

48520


Yay! it means 48520 documents (Same number as entries in the file) were loaded into our elasticsearch database!