# Getting Started with MolAlchemy Bingo - SQLAlchemy ORM

This tutorial demonstrates how to use MolAlchemy with the Bingo PostgreSQL cartridge to perform chemical informatics operations using SQLAlchemy ORM. You'll learn how to:

- Set up a PostgreSQL database with Bingo extension
- Define SQLAlchemy models with chemical data types
- Store and query molecular structures
- Perform substructure searches
- Calculate molecular fingerprints and similarity scores

SQLAlchemy ORM provides a high-level, Pythonic interface for interacting with databases, while MolAlchemy extends this functionality to handle chemical data types and operations seamlessly using the powerful Bingo cartridge.

## Prerequisites

Before starting this tutorial, make sure you have:
- A PostgreSQL database with Bingo extension installed
- MolAlchemy installed
- A running PostgreSQL instance (you can use the provided Docker setup)

The Bingo cartridge provides fast and efficient chemical search capabilities including exact structure search, substructure search, and similarity search with various fingerprint algorithms.

## Database Setup and Connection

First, let's establish a connection to PostgreSQL and ensure the Bingo extension is enabled. We'll import the necessary modules from MolAlchemy and SQLAlchemy, then create a database engine and session.

In [1]:
from sqlalchemy import (
    Boolean,
    Integer,
    String,
    engine,
    select,
    text,
)
from sqlalchemy.orm import (
    DeclarativeBase,
    Mapped,
    MappedAsDataclass,
    mapped_column,
    sessionmaker,
)

from molalchemy.bingo import functions, index, types

eng = engine.create_engine(
    "postgresql+psycopg://postgres:example@localhost:5432/postgres"
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=eng)
with SessionLocal() as session:
    print(session.execute(text("SELECT bingo.getversion()")).all())

[('1.34.0.0-g7a99033d2',)]


## Defining a Molecule Model with Bingo

Now let's create a SQLAlchemy model to store molecular data. The model uses:

- `molalchemy.bingo.types.BingoMol` for storing molecular structures
- `molalchemy.bingo.index.BingoIndex` to create a GiST index for efficient chemical searches
- Standard SQLAlchemy columns for metadata like name and properties

The GiST index on the `mol` column enables fast substructure and similarity searches using Bingo's optimized algorithms.

In [2]:
class Base(MappedAsDataclass, DeclarativeBase):
    pass


class Molecule(Base):
    __tablename__ = "molecules"
    __table_args__ = (index.BingoMolIndex("mol_gist_idx", "mol"),)
    id: Mapped[int] = mapped_column(
        Integer, primary_key=True, autoincrement=True, init=False
    )
    name: Mapped[str] = mapped_column(String(100), unique=True)
    mol: Mapped[bytes] = mapped_column(types.BingoMol)
    is_nsaid: Mapped[bool] = mapped_column(Boolean, default=False)


Molecule.__table__.drop(eng, checkfirst=True)
Molecule.metadata.create_all(eng, checkfirst=False)

## Adding Sample Molecular Data

Let's insert some sample pharmaceutical compounds with their SMILES representations. Our dataset includes:

- **Aspirin**: A common NSAID (non-steroidal anti-inflammatory drug)
- **Loratadine**: An antihistamine for allergy treatment  
- **Rofecoxib**: A withdrawn COX-2 inhibitor NSAID
- **Captopril**: An ACE inhibitor for treating hypertension
- **Thalidomide**: A medication with a complex history, now used for certain cancers

Each molecule is stored with its SMILES string, name, and NSAID classification.

In [3]:
data = [
    {"name": "Aspirin", "mol": "CC(=O)OC1=CC=CC=C1C(=O)O", "is_nsaid": True},
    {
        "name": "Loratadine",
        "mol": "O=C(OCC)N4CC/C(=C2/c1ccc(Cl)cc1CCc3cccnc23)CC4",
        "is_nsaid": False,
    },
    {
        "name": "Rofecoxib",
        "mol": "O=C2OC\\C(=C2\\c1ccccc1)c3ccc(cc3)S(C)(=O)=O",
        "is_nsaid": True,
    },
    {"name": "Captopril", "mol": "C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O", "is_nsaid": False},
    {
        "name": "Talidomide",
        "mol": "O=C1c2ccccc2C(=O)N1C3CCC(=O)NC3=O",
        "is_nsaid": False,
    },
]

session = SessionLocal()
mols = [Molecule(**d) for d in data]
session.add_all(mols)
session.commit()

## Basic Chemical Queries

Now let's explore different ways to query our molecular data:

### 1. Standard Database Queries
First, a simple query using regular SQLAlchemy operations to find non-NSAID molecules:

In [4]:
# simple query to get all non-nsaid molecules
session.execute(select(Molecule).where(Molecule.is_nsaid == False)).all()

[(Molecule(id=2, name='Loratadine', mol='O=C(OCC)N4CC/C(=C2/c1ccc(Cl)cc1CCc3cccnc23)CC4', is_nsaid=False),),
 (Molecule(id=4, name='Captopril', mol='C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O', is_nsaid=False),),
 (Molecule(id=5, name='Talidomide', mol='O=C1c2ccccc2C(=O)N1C3CCC(=O)NC3=O', is_nsaid=False),)]

### 2. Exact Molecular Structure Matching
Use `molalchemy.bingo.functions.mol.equals()` to find molecules that exactly match a given SMILES string:

In [5]:
session.execute(
    select(Molecule).where(
        functions.mol.equals(Molecule.mol, "CC(=O)OC1=CC=CC=C1C(=O)O")
    )
).all()

[(Molecule(id=1, name='Aspirin', mol='CC(=O)OC1=CC=CC=C1C(=O)O', is_nsaid=True),)]

### 3. Substructure Searches
Find molecules containing a specific substructure. Here we search for molecules containing sulfur (`S`):

In [6]:
session.execute(
    select(Molecule).where(functions.mol.has_substructure(Molecule.mol, "S"))
).all()

[(Molecule(id=3, name='Rofecoxib', mol='O=C2OC\\C(=C2\\c1ccccc1)c3ccc(cc3)S(C)(=O)=O', is_nsaid=True),),
 (Molecule(id=4, name='Captopril', mol='C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O', is_nsaid=False),)]

## Similarity Searching

In Bingo case we don't need to create a fingerprint column, we can search directly on the `BingoMol` column instead.

In [7]:
class Base(MappedAsDataclass, DeclarativeBase):
    pass


class MoleculeFP(Base):
    __tablename__ = "molecules_fp"
    __table_args__ = (index.BingoMolIndex("mol_gist_idx_2", "mol"),)
    id: Mapped[int] = mapped_column(
        Integer, primary_key=True, autoincrement=True, init=False
    )
    name: Mapped[str] = mapped_column(String(100), unique=True)
    mol: Mapped[bytes] = mapped_column(types.BingoMol)
    is_nsaid: Mapped[bool] = mapped_column(Boolean, default=False)


MoleculeFP.__table__.drop(eng, checkfirst=True)
MoleculeFP.__table__.create(eng, checkfirst=True)

### Inserting Data with Automatic Fingerprint Generation

When we insert molecules into this enhanced model, the fingerprints are automatically computed:

In [8]:
session = SessionLocal()
session.add_all([MoleculeFP(**d) for d in data])
session.commit()

### Similarity Searching with Tanimoto Coefficient

Now we can perform similarity searches using the Tanimoto coefficient, which compares molecular fingerprints. 

In this example, we:
1. Generate a fingerprint for a query molecule (Desloratadine's SMILES - active metabolite of Loratadine)
2. Calculate Tanimoto similarity between the query and all stored fingerprints
3. Rank results by similarity score in descending order

In [9]:
target_smi = "Clc4cc2c(C(c1ncccc1CC2)=C3CCNCC3)cc4"
similarity_expr = functions.mol.get_similarity(MoleculeFP.mol, target_smi).label(
    "similarity"
)
final_query = select(similarity_expr, MoleculeFP).order_by(similarity_expr.desc())
session.execute(final_query).all()

[(0.8, MoleculeFP(id=2, name='Loratadine', mol='O=C(OCC)N4CC/C(=C2/c1ccc(Cl)cc1CCc3cccnc23)CC4', is_nsaid=False)),
 (0.28846154, MoleculeFP(id=5, name='Talidomide', mol='O=C1c2ccccc2C(=O)N1C3CCC(=O)NC3=O', is_nsaid=False)),
 (0.24369748, MoleculeFP(id=3, name='Rofecoxib', mol='O=C2OC\\C(=C2\\c1ccccc1)c3ccc(cc3)S(C)(=O)=O', is_nsaid=True)),
 (0.17821783, MoleculeFP(id=1, name='Aspirin', mol='CC(=O)OC1=CC=CC=C1C(=O)O', is_nsaid=True)),
 (0.17117117, MoleculeFP(id=4, name='Captopril', mol='C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O', is_nsaid=False))]

### Inspecting the Generated SQL

We can examine the actual SQL query that SQLAlchemy generates for our similarity search:

In [12]:
print(final_query.compile(eng, compile_kwargs={"literal_binds": True}))

SELECT bingo.getsimilarity(molecules_fp.mol, 'Clc4cc2c(C(c1ncccc1CC2)=C3CCNCC3)cc4', 'Tanimoto') AS similarity, molecules_fp.id, molecules_fp.name, molecules_fp.mol, molecules_fp.is_nsaid 
FROM molecules_fp ORDER BY similarity DESC


## Summary

This tutorial demonstrated the key features of MolAlchemy with Bingo cartridge:

1. **Database Setup**: Connecting to PostgreSQL with Bingo extension
2. **Model Definition**: Creating SQLAlchemy models with chemical data types
3. **Data Storage**: Storing molecular structures with metadata
4. **Basic Queries**: Exact matching and substructure searches  
5. **Similarity Search**: Using Tanimoto coefficients to rank molecular similarity


### Key Differences from RDKit

TODO

### Next Steps

- Explore more Bingo functions available in `molalchemy.bingo.functions`
- Try different fingerprint types and similarity metrics
- Implement custom molecular descriptors as computed columns
- Scale up with larger molecular databases
- Compare performance between Bingo and RDKit cartridges

### Additional Resources

- [Bingo PostgreSQL Manual](https://lifescience.opensource.epam.com/bingo/bingo-postgres.html)
- [SQLAlchemy Documentation](https://docs.sqlalchemy.org/)
- [EPAM Indigo Toolkit](https://github.com/epam/Indigo)