# DB Starter documentation

### About
This IPython sample notebook is a generator tool for the database folder-structure and input files for the supported quantum-chemistry packages. Note that this DB Starter version will perform a conformer scan.<br>

Since writing all structures / inputs at once is not necessarily useful before one knows if any of the properties can be learned efficiently, all generator steps are performed in random subsets of the Database entries. The notebook has one general "Input Section" (which has to be run every time to initialize the Generator), as well as three different Parts (A-C) that may be used independently. A guide to which calculation types are available for auto-generation and how you can include new calculation types can be found in the documentation. Also, you can find an explanation on how to correctly set up your required "CoreStructure.xyz" file and the Fragment library there.
<br><br>
Please, make sure to carefully read all cell's contents as well as their commentaries, as this will help you properly executing the notebook.
<br><br>

### Input Section
A section that defines how the database folder-structure should be set up.
<br><br>

### Part A
Part A samples a random part of all available database entries for performing unbiased, randomized training. This way you can stepwise increase the amount of training data. The output of this part is a library file, containing the desired number of samples in a format compatible to everything else that ArchOnML requires.
<br><br>

### Part B
Part B sets up the database folder-structure and generates the starting Guess xyz-structures for all subsequent machine-learning tasks of a specific sample, that was obtained within Part A.
<br><br>

### Part C
Part C can be used to set up specific calculation inputs for the external quantum chemistry program packages that are supported by ArchOnML. It reads a user-specified library file to write new input files that can carry out your desired calculation type in a way that can be directly read by ArchOnML's Machine Learning routines.

In [None]:
import os
import sys
import time
import concurrent.futures
from copy import deepcopy as dc

import numpy as np
from IPython.display import display, clear_output
np.set_printoptions(suppress=True)

from archonml import generate
from archonml.utils import timeconv
from archonml.common import PSEDICT, PSEIDICT
from archonml.generate import full_junc_once, junction

# Input Section Cells
These have to be run any time you want to use the Generator.

In [None]:
# The following _has to be_ user defined.
ProjName          = "PROJ"                                   # Name Prefix for the Database.
FragStrucLib      = []                                       # Library of Substituent Fragments.
                                                             # Leave empty for conformer scan !!!
    
ConformerStrucFID = "../Sample_Conformers.xyz"               # File containing the Conformer Scan from (for example) a crest calculation.
                                                             # Expects xyz format with all structures in one file. Energetic ordering not required.

StrucLib     = "./Guesses_{}.library".format(ProjName)  # This file contains the names and addresses of all individual structures to be written.
MergePath    = "../SemiEmpData_{}/".format(ProjName)    # Path for the MergeFiles.
GuessPath    = "../Guesses_{}/".format(ProjName)        # Path for the Guess structures to be generated.
DBPath       = "../../DATABASE/"                        # Main Path for the QC Data.

# Folder generation for the Database.
# Note that for a conformer scan, the "fingerprints" get replaced by mere pointers. For a conformer scan, the parameters Fold_Lx will just break down the amount of structures evenly.
# Example : Setting to 3 and 4 will create 3 folders on the first level, and 4 subfolders in each of the levels.
#           All structures are then equally distributed into the 3 * 4 = 12 total folders.

Fold_L1      = 3                                        # Tree hierarchy for subfolder-generation; level one depth.
Fold_L2      = 4                                        # Tree hierarchy for subfolder-generation; level two depth.

GenInstance  = generate.DBGenerator(ProjName, FragStrucLib, ConformerStrucFID, StrucLib, MergePath, GuessPath, DBPath, Fold_L1, Fold_L2)
FINGERPRINTS = GenInstance.gen_fingers()

In [None]:
# This cell needs to be run only once at Database initialisation. Afterwards, it should be "commented out" with a "#" for this project folder.
GenInstance.gen_db_folders()

### Guess storage
Entries are written to the Guesses.library file. This also needs to be done only once!

In [None]:
# Write the Guess Structure Library to file after database reduction - needs to be done only once!
LID   = open(StrucLib, "w")
for ii in range(len(FINGERPRINTS)):
    LocFing    = FINGERPRINTS[ii]
    FingString = ""
    LocLead    = FINGERPRINTS[ii][0]
    for jj in range(len(LocFing)-1):
        FingString += str(int(LocFing[jj]))+","
    FingString += str(int(LocFing[-1]))
    # Generate the Level 1 path depending on Fold_L1
    LocPL1 = str(int(FINGERPRINTS[ii][0]))+"/"
    # Generate the Level 2 path depending on Fold_L2
    LocPL2 = str(int(FINGERPRINTS[ii][0]))+"_"+str(int(FINGERPRINTS[ii][1]))+"/"
    LocPath = GuessPath+LocPL1+LocPL2+"Guess_"+FingString+".xyz"
    LocLib  = "{}{}\t\t\t{}\t\t\t{}\n".format(ProjName, ii+1, FingString, LocPath)
    LID.write(LocLib)
LID.close()

### Database retrieval
In case that the database was reduced by symmetry or other rules, we can skip this lengthy process after it has been performed once. 

In [None]:
# Skip symmetry determination, if all (symmetry)-reduced structures have been written to GuessLib file already.
FID  = open(StrucLib, "r")
FLoc = FID.readlines()
FID.close()
FList = []
FingLen = len((FLoc[0].split()[1]).split(","))
for ii in range(len(FLoc)):
    Aux   = (FLoc[ii].split()[1]).split(",")
    LocFP = [int(x)*1.0 for x in Aux]
    FList.append(np.asanyarray(LocFP))
FINGERPRINTS = dc(FList)
GenInstance.FINGERPRINTS = dc(FList)
print("Keeping {} entries that have unique fingerprints.".format(len(FList)))

# PART A Cells

### Sampling N random structures for Training / Testing / Predictions
This part may be run independently from Part A or C.

In [None]:
# Specify the static "Sampled" library. This file will keep track of which structures have already been sampled before.
# Do not change this name after the first sampling.
SampLib   = "./SampleLib_{}".format(ProjName)

# Specify the local subset library name. These should be unique every time you want to sample a subset.
LocSetLib = "./Sample_1_1k_{}".format(ProjName)

# Specify the number of desired samples drawn from the full library.
NSamp     = 1000

# Flavor of Randomness. False means that system time is used for ensuring randomness. True sets a fixed RandomSeed; which is only really useful for debugging purposes.
FixedSeed  = False

In [None]:
# Generate a sample.
GenInstance.sample(SampLib, LocSetLib, NSamp, FixedSeed)

# PART B Cells

### Extraction of Guess Structures

In [None]:
# This cell will save the selected xyz structures to hard-disk. On-the-fly writing of the structure library at the same time.
cnt   = 0
tTot  = 0
incnt = 0
now   = time.time()

GenSample  = "./Sample_1_1k_{}".format(ProjName)   # Sample, for which Guess structures shall be generated.
FID  = open(GenSample, "r")
FLoc = FID.readlines()
FID.close()
GID  = open(ConformerStrucFID, "r")
GLoc = GID.readlines()
GID.close()

NAt  = int(GLoc[0].split()[0])                  # This assumes that each conformer (frame) has the same number of atoms
BlockLength = NAt + 2

for ii in range(len(FLoc)):
    LocFing = FLoc[ii].split()[1]               # Current Fingerprint
    Frame   = int(LocFing.split(",")[-1])       # "Frame" of the xyz File (first frame is counted at 1 !)
    
    # Generate Path
    LocPL1  = "{}/".format(LocFing.split(",")[0])
    LocPL2  = "{}_{}/".format(LocFing.split(",")[0], LocFing.split(",")[1])
    LocPath = GuessPath + LocPL1 + LocPL2 + "Guess_" + LocFing + ".xyz"
    
    # Move "line" to current molecule "frame"
    line    = (Frame-1) * BlockLength
    OID = open(LocPath, 'w')
    for _ in range(BlockLength):
        OID.write(GLoc[line])
        line += 1
    OID.close()
    cnt  += 1
    Perc  = cnt / len(FLoc)
    UpdThrsh = 0.001
    if Perc > UpdThrsh+(incnt*UpdThrsh):
        then  = time.time()
        tReq  = then-now
        tTot += tReq
        mReq  = tTot / cnt
        Rem   = float(len(FLoc)-cnt)*mReq
        clear_output(wait=True)
        STR1  = "Finished {:.2f} % ({}) of all structures ({}) in {:.1f} seconds ({}).\n".format(Perc*100, cnt, len(FLoc), tTot, timeconv(tTot),)
        STR2  = "Required {:.3f} seconds on average for each structure.\n".format(mReq)
        STR3  = "Expecting {:.1f} seconds remaining.({})\n".format(Rem, timeconv(Rem))
        print(STR1+STR2+STR3)
        incnt += 1
        now    = time.time()

# PART C Cells
This part may be run independently from Part A or B - but requires the existence of some libraries that were generated with Part B at some point.
Note, that the "Conf" calculation types below directly use the geometries from the provided conformer ".xyz" file.

In other words, no further (pre-)optimizations are to be performed, since this would (most likely) change the conformer space drastically.

#### Step 1 - Run the Semi-Empirical Orbital Energy calculations

In [None]:
# Specify, for which (sub)library to generate the input files.
GenLib     = "./Sample_1_1k_{}".format(ProjName)
# Specify, for which external quantum-chemistry program to generate calculation inputs.
QCPack     = "orca"
# Specify, which calculation type to generate.
CalType    = "OrbEns_Solv_Conf"
# Specify, which calculation flavor to use.
CalFlav    = "PM3"
# Specify a name of the calculation path library that is to be written. (This may make it easier for you to start thousands of calculations on a HPC system)
CalPathLib = "./SampleCalcs_1_1k_{}".format(ProjName)

# Epsilon and Refraction parameters for the CPCM model - here, settings for diethylether (as in the sample).
CalEps     = 4.34
CalRefrac   = 1.3497

In [None]:
# Generate the input files and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib, epsilon=CalEps, refrac=CalRefrac)

In [None]:
# Run OrbEns Calculations with the external software now.

#### Step 2 - Run the TDDFT calculations (here, singlets only)

In [None]:
# Specify, for which (sub)library to generate the input files.
GenLib     = "./Sample_1_1k_{}".format(ProjName)
# Specify, for which external quantum-chemistry program to generate calculation inputs.
QCPack     = "orca"
# Specify, which calculation type to generate.
CalType    = "TDSn_Solv_Conf"
# Specify, which calculation flavor to use.
CalFlav    = "CB3LG"
# Specify a name of the calculation path library that is to be written. (This may make it easier for you to start thousands of calculations on a HPC system)
CalPathLib = "./SampleCalcs_1_1k_{}".format(ProjName)

# Specify further arguments for your desired calculation. For example the number of excited states to calculate.
# Epsilon and Refraction parameters for the CPCM model - here, settings for diethylether (as in the sample).
CalEps     = 4.34
CalRefrac  = 1.3497

# Number of states to be calculated in the TDDFT calculation.
nstates    = 12

In [None]:
# Generate the input files and path library.
GenInstance.gen_calcs(GenLib, QCPack, CalType, CalFlav, CalPathLib, epsilon=CalEps, refrac=CalRefrac, nstates=nstates)

In [None]:
# Run TDSn Calculations with the external software now.