# DB Starter documentation

### About
This IPython sample notebook is a generator tool for the database folder-structure and input files for the supported quantum-chemistry packages. Note that this DB Starter version is set up to perform a chemical space scan.<br>

Since writing all structures / inputs at once is not necessarily useful before one knows if any of the properties can be learned efficiently, all generator steps are performed in random samples of the Database entries. The notebook has one general "Input Section" (which has to be run every time to initialize the Generator), as well as three different Parts (A-C) that may be used independently. A guide to which calculation types are available for auto-generation and how you can include new calculation types can be found in the documentation. Also, you can find an explanation on how to correctly set up your required "CoreStructure.xyz" file and the Fragment library there.
<br><br>
Note that several code cells can be deactivated (i.e. "commented out") after they have been performed once - which will save tremendous amounts of time. Therefore, please make sure to carefully read all cell's contents as well as their commentaries, as this will help you properly executing the notebook.
<br><br>

### Input Section
A section where you define how the database folder-structure should be set up. This part includes generating/loading of all possible fingerprints. Also, in here the symmetry-reduction of the fingerprints is to be performed.
<br><br>

### Part A
Part A samples a random part of all available database entries for performing unbiased, randomized training. This way you can stepwise increase the amount of training data. The output of this part is a library file, containing the desired number of samples in a format compatible to everything else that ArchOnML requires.
<br><br>

### Part B
Part B sets up the database folder-structure and generates the starting Guess xyz-structures for all subsequent machine-learning tasks of a specific sample, that was obtained within Part A.
<br><br>

### Part C
Part C can be used to set up specific calculation inputs for the external quantum chemistry program packages that are supported by ArchOnML. It reads a user-specified library file to write new input files that can carry out your desired calculation type in a way that can be directly read by ArchOnML's Machine Learning routines.

In [None]:
import os, sys, time
import concurrent.futures
from copy import deepcopy as dc

import numpy as np
from IPython.display import display, clear_output
np.set_printoptions(suppress=True)

from archonml import generate
from archonml.utils import timeconv
from archonml.common import PSEDICT, PSEIDICT
from archonml.generate import full_junc_once, junction

# Input Section Cells
These have to be run any time you want to use the Generator.

In [None]:
# The following _has to be_ user defined.
ProjName     = "PROJ"                                   # Name Prefix for the Database.
FragStrucLib = "../Substituents.library"                   # File containing the Fragment folders and names.
CoreStrucFID = "../Sample_CoreStructure.xyz"                   # File containing the substituent-marked core structure.

# This _can_ be changed by the user, if desired.
StrucLib     = "./Guesses_{}.library".format(ProjName)  # This file contains the names and addresses of all Guess structures to be written.
                                                        # It is created _after_ database-reduction, so that the reduction may be skipped after performing it once.
MergePath    = "../SemiEmpData_{}/".format(ProjName)    # Path for the MergeFiles.
GuessPath    = "../Guesses_{}/".format(ProjName)        # Path for the Guess structures to be generated.
DBPath       = "../../DATABASE/"                        # Main Path for the QC Data.

# Folder generation for the Database. The parameters Fold_L1 and Fold_L2 are giving the depths of the folder hierarchy.
# If deeper hierarchies are required, this should be implemented here at some point.
# A reasonable choice is to keep level (L1) one at 1, and choose level 2 (L2) to be half of the maximum fingerprint length.
Fold_L1      = 1                                        # Tree hierarchy for subfolder-generation; level one depth.
Fold_L2      = 4                                        # Tree hierarchy for subfolder-generation; level two depth.

GenInstance  = generate.DBGenerator(ProjName, FragStrucLib, CoreStrucFID, StrucLib, MergePath, GuessPath, DBPath, Fold_L1, Fold_L2)
FINGERPRINTS = GenInstance.gen_fingers()

### Database-reduction Input
In case the core structure follows some symmetry rules or you want to specify additional rules for the database generation ("any structure should have one phenyl group at most"), please use the templates below to adapt the Fingerprints of the database. Note, that reduction of the Database is only necessary to be performed once, since afterwards, all entries can be read from the "Guesses" library. 

In [None]:
# Helper function for database symmetry-reduction
# This code cell can be commented out after reduction has been performed.
def TransformFingerprint(FingP):
    Tuple = ()
    for ii in range(len(FingP)):
        CurInt = int(FingP[ii])-1
        Tuple = Tuple + (CurInt,)
    return Tuple

In [None]:
# Symmetry Reduction Example for anthraquinone Below
# This code cell can be commented out after reduction has been performed.

# Initialize a mapping-matrix with all possible combinations of position and substient as "active".
# Every mapping address that reads "one" after reduction will be considered "alive" for generation.
ValidMatrix = np.ones((7, 7, 7, 7, 7, 7, 7, 7))

# Now, go through all FINGERPRINTS and translate the current ii-th fingerprint a mapping address.
for ii in range(len(FINGERPRINTS)):
    CURMAP = TransformFingerprint(FINGERPRINTS[ii])
    # See, if this entry is currently "alive"
    if ValidMatrix[CURMAP] == 1.0:
        # If yes, generate the symmetry equivalent maps from the current map by defining what map each symmetry element would result in.
        
        # Symmetry Element #1 - Mirror 1
        # R_1 would become R_8, R_2 -> R_7, ... for anthracene example, see publication. Remember, python indices start from 0.
        MIR1 = (CURMAP[7], CURMAP[6], CURMAP[5], CURMAP[4], CURMAP[3], CURMAP[2], CURMAP[1], CURMAP[0])
        
        # Symmetry Element #2 - Mirror 2; 1 -> 4, 2 -> 3, ...
        MIR2 = (CURMAP[3], CURMAP[2], CURMAP[1], CURMAP[0], CURMAP[7], CURMAP[6], CURMAP[5], CURMAP[4])
        
        # Symmetry Element #3 - Inversion
        INV  = (CURMAP[4], CURMAP[5], CURMAP[6], CURMAP[7], CURMAP[0], CURMAP[1], CURMAP[2], CURMAP[3])
        
        # Rotation around Oxygen-Oxygen Axis == MIR1
        # Rotation around perpendicular Axis == MIR2
        # Rotation inside the z-plane        == INV
        
        # Remove all "redundant" structures that can be generated by symmetry from the current fingerprint:
        # if the symmetry-map is _not_ the same as the current map, set its ValidMatrix element to 0.
        if MIR1 != CURMAP:
            ValidMatrix[MIR1] = 0
        if MIR2 != CURMAP:
            ValidMatrix[MIR2] = 0
        if INV  != CURMAP:
            ValidMatrix[INV]  = 0

# Collect all non-zero maps and transform back to a list of valid fingerprints in the same, previous data format
# (i.e. a list of nparrays that contain floats)
Valids = np.where(ValidMatrix == 1.0)
ReFing = []
for ii in range(len(Valids[:][0])):
    CurValid = [int(Valids[:][0][ii])*1.0, int(Valids[:][1][ii])*1.0, int(Valids[:][2][ii])*1.0, int(Valids[:][3][ii])*1.0,
                int(Valids[:][4][ii])*1.0, int(Valids[:][5][ii])*1.0, int(Valids[:][6][ii])*1.0, int(Valids[:][7][ii])*1.0]
    ReFing.append(np.asanyarray(CurValid))
    
# Overwrite the fingerprints - _both_ the local copy as well as the object attribute!
FINGERPRINTS = dc(ReFing)
GenInstance.FINGERPRINTS = dc(ReFing)

In [None]:
# Write the Guess Structure Library to file after database reduction.
# This code cell can be commented out after reduction has been performed.
StrucLib = "./Guesses_{}.library".format(ProjName)
LID   = open(StrucLib, "w")
for ii in range(len(FINGERPRINTS)):
    LocFing    = FINGERPRINTS[ii]
    FingString = ""
    LocLead    = FINGERPRINTS[ii][0]
    for jj in range(len(LocFing)-1):
        FingString += str(int(LocFing[jj]))+","
    FingString += str(int(LocFing[-1]))
    # Generate the Level 1 path depending on Fold_L1
    Aux = ""
    for jj in range(Fold_L1):
        Aux += "{}_".format(int(FINGERPRINTS[ii][jj]))
    LocPL1 = Aux.rstrip(Aux[-1])+"/"
    # Generate the Level 2 path depending on Fold_L2
    Aux = ""
    for jj in range(Fold_L2):
        Aux += "{}_".format(int(FINGERPRINTS[ii][jj]))
    LocPL2 = Aux.rstrip(Aux[-1])+"/"
    LocPath = GuessPath+LocPL1+LocPL2+"Guess_"+FingString+".xyz"
    LocLib  = "{}{}\t\t\t{}\t\t\t{}\n".format(ProjName, ii+1, FingString, LocPath)
    LID.write(LocLib)
LID.close()

### Database retrieval
In case that the database was reduced by symmetry or other rules, we can skip this lengthy process after it has been performed once. 

In [None]:
# # Skip symmetry determination, if all (symmetry)-reduced structures have been written to GuessLib file already.
# FID  = open(StrucLib, "r")
# FLoc = FID.readlines()
# FID.close()
# FList = []
# FingLen = len((FLoc[0].split()[1]).split(","))
# for ii in range(len(FLoc)):
#     Aux   = (FLoc[ii].split()[1]).split(",")
#     LocFP = [int(x)*1.0 for x in Aux]
#     FList.append(np.asanyarray(LocFP))
# FINGERPRINTS = dc(FList)
# GenInstance.FINGERPRINTS = dc(FList)
# print("Keeping {} entries that have unique fingerprints.".format(len(FList)))

# PART A Cells

### Sampling N random structures for Training / Testing / Predictions
This part may be run independently from Part A or C.

In [None]:
# Specify the static "Sampled" library. This file will keep track of which structures have already been sampled before.
# Do not change this name after the first sampling.
SampLib   = "./SampleLib_{}".format(ProjName)

# Specify the local subset library name. These should be unique every time you want to sample a subset.
LocSetLib = "./Sample_1k_1_{}".format(ProjName)

# Specify the number of desired samples drawn from the full library.
NSamp     = 1000

# Flavor of Randomness. False means that system time is used for ensuring randomness. True sets a fixed RandomSeed; which is only really useful for debugging purposes.
FixedSeed  = False

In [None]:
# Generate a sample.
GenInstance.sample(SampLib, LocSetLib, NSamp, FixedSeed)

# PART B Cells

### Generation of Guess Structures
Parallelised, recursive Generation of .xyz files for all Fingerprints given in the current sample. To save memory, each fingerprint is resolved in an "individual iterative way" before dumping the structure to disk.

In [None]:
# Read in a selected Sample File to limit the generator to a subset.
GenSample  = "./Sample_1k_1_{}".format(ProjName)              # Sample, for which Guess structures shall be generated.
SubFings   = []

FID  = open(GenSample, "r")
FLoc = FID.readlines()
FID.close()
for ii in range(len(FLoc)):
    LocFing = FLoc[ii].split()[1]
    Aux     = [float(LocFing.split(",")[jj]) for jj in range(len(LocFing.split(",")))]
    SubFings.append(Aux)

In [None]:
# This cell generates all xyz structures in memory.
cnt   = 0
tTot  = 0
incnt = 0
now   = time.time()
JUNCS = ["0"]*len(SubFings)
MAPS  = ["0"]*len(SubFings)

# Parallelized in-memory generation of all structures. This could be re-wirtten to a direct dumping, theoretically...
with concurrent.futures.ProcessPoolExecutor() as executor:
    results = executor.map(full_junc_once, [(SubFings[ii], GenInstance.CORE, GenInstance.FRAGMENTS, GenInstance.NSub, GenInstance.FLAGS) for ii in range(len(SubFings))])
    cnt     = 0
    for result in results:
        JUNCS[cnt] = result[0]
        MAPS[cnt]  = result[1]
        cnt  += 1
        Perc  = cnt / len(SubFings)
        UpdThrsh = 0.001
        if Perc > UpdThrsh+(incnt*UpdThrsh):
            then  = time.time()
            tReq  = then-now
            tTot += tReq
            mReq  = tTot / cnt
            Rem = float(len(SubFings)-cnt)*mReq
            clear_output(wait=True)
            STR1 = "Finished {:.2f} % ({}) of all structures ({}) in {:.1f} seconds ({}).\n".format(Perc*100, cnt, len(SubFings), tTot, timeconv(tTot),)
            STR2 = "Required {:.3f} seconds on average for each structure.\n".format(mReq)
            STR3 = "Expecting {:.1f} seconds remaining.({})\n".format(Rem, timeconv(Rem))
            print(STR1+STR2+STR3)
            incnt += 1
            now    = time.time()

In [None]:
# This cell will save all xyz structures to hard-disk. On-the-fly writing of the structure library at the same time.
cnt   = 0
tTot  = 0
incnt = 0
now   = time.time()

for ii in range(len(JUNCS)):
    LocNAt     = len(JUNCS[ii])
    LocFing    = MAPS[ii]
    LocGeom    = JUNCS[ii]
    FingString = ""
    LocLead    = MAPS[ii][0]
    for jj in range(len(LocFing)-1):
        FingString += str(int(LocFing[jj]))+","
    FingString += str(int(LocFing[-1]))
    # Generate the Level 1 path depending on Fold_L1
    Aux = ""
    for jj in range(Fold_L1):
        Aux += "{}_".format(int(MAPS[ii][jj]))
    LocPL1 = Aux.rstrip(Aux[-1])+"/"
    # Generate the Level 2 path depending on Fold_L2
    Aux = ""
    for jj in range(Fold_L2):
        Aux += "{}_".format(int(MAPS[ii][jj]))
    LocPL2 = Aux.rstrip(Aux[-1])+"/"
    LocPath = GuessPath+LocPL1+LocPL2+"Guess_"+FingString+".xyz"
    OID = open(LocPath, 'w')
    OID.write('{}\n'.format(LocNAt))
    OID.write(FingString+"\n")
    for jj in range(len(LocGeom)):
        OID.write('{}   {:>10.7f}   {:>10.7f}   {:>10.7f}   !{}\n'.format(PSEIDICT[LocGeom[jj][3]], LocGeom[jj][0], LocGeom[jj][1],
                                                                          LocGeom[jj][2], int(LocGeom[jj][4])))
    OID.close()
    cnt  += 1
    Perc  = cnt / len(SubFings)
    UpdThrsh = 0.001
    if Perc > UpdThrsh+(incnt*UpdThrsh):
        then  = time.time()
        tReq  = then-now
        tTot += tReq
        mReq  = tTot / cnt
        Rem   = float(len(SubFings)-cnt)*mReq
        clear_output(wait=True)
        STR1  = "Finished {:.2f} % ({}) of all structures ({}) in {:.1f} seconds ({}).\n".format(Perc*100, cnt, len(SubFings), tTot, timeconv(tTot),)
        STR2  = "Required {:.3f} seconds on average for each structure.\n".format(mReq)
        STR3  = "Expecting {:.1f} seconds remaining.({})\n".format(Rem, timeconv(Rem))
        print(STR1+STR2+STR3)
        incnt += 1
        now    = time.time()

# PART C Cells
This part may be run independently from Part A or B - but requires the existence of some libraries that were generated with Part B at some point.

#### Step 1 - Use a selected Subset for generating Pre-Optimization calculations

In [None]:
# Specify, for which (sub)library to generate the input files.
GenLib     = "./Sample_1k_1_{}".format(ProjName)
# Specify, for which external quantum-chemistry program to generate calculation inputs.
QCPack     = "g16"
# Specify, which calculation type to generate.
CalType    = "PreOpt"
# Specify, which calculation flavor to use.
CalFlav    = "PM6"
# Specify a name of the calculation path library that is to be written. (This may make it easier for you to start thousands of calculations on a HPC system)
CalPathLib = "./SampleCalcs_1k_1_{}".format(ProjName)

In [None]:
# Generate the input files for PreOpts and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib)

In [None]:
# Run PreOpt Calculations with the external software now.

#### Step 2 - Run the Semi-Empirical Orbital Energy calculations

In [None]:
# Specify, which calculation type to generate.
CalType    = "OrbEns"
# Specify, which calculation flavor to use.
CalFlav    = "PM6"

In [None]:
# Generate the input files and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib)

In [None]:
# Run OrbEns Calculations with the external software now.

#### Step 3 - After Pre-Optimization, run the Optimization

In [None]:
# Specify, which calculation type to generate.
CalType    = "Opt_Solv"
# Specify, which calculation flavor to use.
CalFlav    = "B3LGDS"
CalSolv    = "Benzene"

In [None]:
# Generate the input files and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib, solvent=CalSolv)

In [None]:
# Run Optimization Calculations with the external software now.

#### Step 4 - Run the TDDFT calculations (here, singlets only)

In [None]:
# Specify, which calculation type to generate.
CalType    = "TDSn_Solv"
# Specify, which calculation flavor to use.
CalFlav    = "B3LGDS"
CalSolv    = "Benzene"
# Specify further arguments for your desired calculation. For example the number of excited states to calculate.
# Number of states to be calculated in the TDDFT calculation.
nstates    = 6

In [None]:
# Generate the input files and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib, nstates=nstates, solvent=CalSolv)

In [None]:
# Run Optimization Calculations with the external software now.