This is the first part of iSynthesis tutorial

# Prepare database with initial reactants via [CGRdb](https://github.com/Pandylandy/CGRdb#readme)

## Setup PostgreSQL
1. Install PostgreSQL10 :

```
sudo apt-get update && sudo apt-get install postgresql-server-dev-10 postgresql-plpython3-10
```

2. Create schema im PSQL:

```
sudo -u postgres psql

create schema "zinc";
\q
```
or 
```
psql -U postgres -h localhost -c 'create schema "zinc";'
```

3. Build patched smlar extension:


```
git clone https://github.com/stsouko/smlar.git
cd smlar
sudo su
export USE_PGXS=1
make
make install
```

4. Uncomment and change next line in /etc/postgresql/10/main/postgresql.conf

```deadlock_timeout = 10s```

5. Add line into /etc/postgresql/10/main/environment

```PATH = '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'```

6. Restart PSQL:

```
sudo systemctl restart postgresql
```

### WARNING: PostgreSQL uses system Python, so it is neccesary to install this library into system Python (>=3.7) via sudo:
```
sudo pip install CGRtools==4.1.34
sudo pip install compress-pickle
sudo pip install git+https://github.com/cimm-kzn/CIMtools.git@master
```

## Setup CGRdb
1. Initialize CGRdb (only once)

```
cgrdb init -c '{"host": "localhost", "password": "your password", "user": "postgres"}'
```

2. Create CGRdb schema with all tables: 

```
cgrdb create -c '{"host": "localhost", "password": "password", "user": "postgres"}' -n zinc -f config.json 
```

Example of config.json:

```
{
 "molecule": {"bits_count": 4, "max_length": 6, "min_length": 2, "bits_active": 2, "fingerprint_size": 12},
 "reaction": {"bits_count": 4, "max_length": 6, "min_length": 2, "bits_active": 2, "cgr_dynbonds": 1, "fingerprint_size": 12},
 "packages": ["CGRdbData==4.0.0"],
 "cache_size": 1024,
 "environment": "/path/to/your/project/.venv"
}
```

**NOTE**: env in config.json should be the path of poetry created environment in your project!!! 


## 1. Fill database with building blocks

In [1]:
from CGRdb import load_schema
from CGRtools import SMILESRead, SDFRead
from pony.orm import db_session
from multiprocessing import Queue, Process
from time import sleep

#### Configure settings/paths:

In [2]:
N_CPU = 12
STANDARDIZED_ALL = '../data/standardized_zinc/zinc.smi'
functional_groups_path = '../data/rules/groups.pickle'

### 1.1 Create database connection

In [3]:
db = load_schema('zinc', user='postgres', password='password', host='localhost', database='postgres', port=5432)

  warn('for many-to-many relationship if schema used NEED to define m2m table name')


In [4]:
db.Molecule.select().count()

0

### 1.2 Insert molecules from sdf file into the CGRdb database

In [5]:
for n, molecule in enumerate(SMILESRead(STANDARDIZED_ALL)):
    try:
        with db_session:
            db.Molecule(molecule)
    except Exception as e:
        print(n, e)

## 2. Index building blocks for functional groups from reaction rules


In [6]:
from pickle import dump, load

### 2.1 Load functional groups from downloaded or earlier prepared binary pickled file:


In [7]:
group_dict = load(open(functional_groups_path, 'rb'))

### 2.2 Insert functional groups into CGRdb database


Check out number of molecule in database:

In [10]:
db.Molecule.select().count()

604785

In [15]:
from CGRtools import MoleculeContainer

In [16]:
def query2molecule(queries):
    molecules = []
    for q in queries:
        m = MoleculeContainer()
        for a in q.atoms():
            m.add_atom(a[1].atomic_symbol, a[0])
        for b in q.bonds():
            # m.add_bond(*b)  # for CGRtools <=4.0
            m.add_bond(b[0], b[1], b[2].order[0])  # for CGRtools >=4.1
        molecules.append(m)
    return molecules

@db_session
def insert_groups():
    i = 0
    for n, group in group_dict.items():
        i += 1
        db.MoleculeClass(id=i, name='{}, {}'.format(str(query2molecule([group])[0]), group), _type=0)

insert_groups()

### 2.3 Indexing building blocks

In [17]:
groups = load(open(functional_groups_path, 'rb'))

with db_session:
    mss = {m.id: m.structure for m in db.Molecule.select()}


def worker(q):
    for k, v in mss.items():
        q.put((k, v))
    for _ in range(N_CPU):
        q.put('done')
    print('worker done')


def index_mol(q):
    d = iter(q.get, 'done')
    for mid, ms in d:
        # ms.reset_query_marks()
        classes_db = [c.id for c in db.Molecule[mid].classes]  # indexed groups if existed
        classes = [i for i, g in group_dict.items() if g <= ms]  # new groups
        classes.extend(classes_db)  # add new
        with db_session:
            db.Molecule[mid].classes = [db.MoleculeClass[i] for i in classes]


q = Queue()
Process(target=worker, args=[q]).start()
pr = [Process(target=index_mol, args=[q]) for _ in range(N_CPU)]
for p in pr:
    p.start()

worker done


## 3. Preload molecule fingerprints for quick initial reactants selection based on Tversky index

In [18]:
from StructureFingerprint import LinearFingerprint


Create dictionary with molecule id and its fingerprint:

In [19]:
fingerprints = {}
for mid, ms in mss.items():
    fingerprints[mid] = set(LinearFingerprint(max_radius=6, length=4096, number_bit_pairs=4).transform_bitset([ms])[0])

Save preloaded fingerprints into file:

In [20]:
with open('../data/zinc/zinc_fps.pickle', 'wb') as file:
    dump(fingerprints, file)

In [21]:
len(fingerprints)

604785