# Tutorial for Creating Descriptor Sets with Auto-QChem at UCLA

This tutorial will walk you through how to use Auto-QChem at UCLA. Before you begin, you will need a [user account on Hoffman2](https://www.hoffman2.idre.ucla.edu/Accounts/Requesting-an-account.html) (UCLA's computing cluster) with [access to Gaussian 16](https://www.hoffman2.idre.ucla.edu/Accounts/Users-managing-your-account.html). You will also need to [install Auto-QChem](https://github.com/doyle-lab-ucla/auto-qchem/blob/master/Install.md) on your local computer with the required python packages. 

Given a string representation of a molecule, Auto-QChem generates 3D conformers and performs DFT calculations for them on a remote computing cluster. After monitoring that the calculations have finished, Auto-QChem collects the DFT features and stores them in a database accessible by a web interface. 

***


## Standard Workflow


### Import classes from autoqchem

Here we reference components of Auto-QChem so they're available for us to use later:

In [19]:
from autoqchem.molecule import molecule
from autoqchem.sge_manager import sge_manager
from autoqchem.draw_utils import draw

### Set level of logging
Auto-QChem prints out messages with helpful information. You can set the verbosity level of the messages: 
- "INFO" is a good setting for first time users. 
- "WARNING" or "ERROR" are appropriate once you become more comfortable.

In [20]:
import logging
logging.basicConfig(level=logging.INFO)

### Provide SMILES string(s)

You can choose any molecule well-represented by a SMILES string. This representation is easily obtained from ChemDraw by selecting a drawn molecule, then "Edit > Copy As > SMILES". 

*List your SMILES string(s) below within single quotes, separated by commas if multiple molecules.*

In [21]:
smiles_str_list = ['CC', 'CCC']

### Make molecule(s) from SMILES string(s)

The following command turns each SMILES string into conformations with 3D coordinates.

In [22]:
mols = [molecule(s, num_conf=5) for s in smiles_str_list]

### Visualize molecule(s)

The molecule(s) should look reasonable.  If the structure seems off, there may be issues with the installation of OpenBabel (which performs MMFF94 optimization with conformer search).

In [1]:
for m in mols:
    draw(m.mol) 

NameError: name 'mols' is not defined

### Initialize job manager

The job manager will manage the jobs that you currently have running or waiting to run on Hoffman2. It remembers what stage your jobs are at in a cache, so you can close the notebook, turn off your computer, or go on vacation, and later pick up right where you left off. 

Here the job manager that we initialize is `sge_manager` as the Hoffman2 cluster utilizes a SGE/UGE scheduler.

*Replace `userID` with your Hoffman2 username.* We also specify the Hoffman2 cluster as the remote host with ```hoffman2.idre.ucla.edu```.

In [33]:
sm = sge_manager(user='userID', host='hoffman2.idre.ucla.edu')

### Connect to Hoffman2

We now create a connection to the Hoffman2 cluster. This ssh tunnel into the remote cluster will be used to manage jobs from within this notebook.

*Provide your password when prompted.*

In [35]:
sm.connect()

INFO:autoqchem.sge_manager:Creating connection to hoffman2.idre.ucla.edu as wang10
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_7.4)
INFO:paramiko.transport:Authentication (password) successful!
INFO:autoqchem.sge_manager:Connected to hoffman2.idre.ucla.edu as wang10.


### Create Gaussian job files

Gaussian job files for every conformation will be created locally based on provided specifications.

*Specify your desired level of theory for the DFT calculations, including the functional, basis sets, solvent, etc.*

In [43]:
for mol in mols:
    sm.create_jobs_for_molecule(mol, theory="B3LYP", light_basis_set='6-31G**', solvent='methanol')

INFO:autoqchem.gaussian_input_generator:Generating Gaussian input files for 1 conformations.


### View workflow statistics

You can view the stage your jobs are at in the automated workflow. The count of jobs classified as created, submitted, done, uploaded, etc. is tabulated by canonical SMILES string. 

In [None]:
sm.get_job_stats(split_by_can=True)

'No jobs in queue'

### Submit jobs

The following command will transfer the Gaussian input files to the Hoffman2 cluster and submit to them to the queue.

In [44]:
sm.submit_jobs()

INFO:autoqchem.sge_manager:Submitting 1 jobs.
INFO:autoqchem.sge_manager:Submitted job e7e716719dcaf24de944ee06f7c5a24c, job_id: 7360980.


### View queue status

You can view if you have any jobs running on the cluster. The `qstat` ('**q**ueue **stat**us') command  will display a table of currently running jobs or notify you that no jobs are queued. 

In [51]:
sm.qstat(summary=True)

'No jobs in queue'

You can continue to monitor your running jobs with ```qstat()``` or check the workflow statistics with ```get_job_stats()``` at any time.

### Retrieve finished jobs

After jobs finish running, you can retrieve completed jobs from the remote cluster with the following command.

In [52]:
sm.retrieve_jobs()

INFO:autoqchem.sge_manager:There are 0 running/pending jobs, 1 finished jobs.
INFO:autoqchem.sge_manager:Retrieving log files of finished jobs.
INFO:autoqchem.sge_manager:1 jobs finished successfully (all Gaussian steps finished normally). 0 jobs failed.


### Upload completed calculations to database

Once all conformers for a molecule are done, you can upload the finished molecule to the web database.

*Label your collection of molecules with a dataset tag by replacing ```tutorial_INITIALS``` with a brief description and/or your initials.*

In [53]:
sm.upload_done_molecules_to_db(tags=["tutorial_INITIALS"])

INFO:autoqchem.sge_manager:There are 1 finished molecules ['CC'].
INFO:autoqchem.sge_manager:Molecule CC has 0 / 1 duplicate conformers.
INFO:autoqchem.sge_manager:Removing 0 / 1 jobs and log files that contain duplicate conformers.
INFO:autoqchem.sge_manager:Uploaded descriptors to DB for smiles: CC, number of conformers: 1, DB molecule id 646801973834ff8df6a882de.


### All done!

You can now find the computed descriptors for your molecules at [https://autoqchem.org](https://autoqchem.org)

***


## Additional commands

Here are some additional commands for the job manager that you may find helpful if you run into any trouble.

### Resubmit incomplete jobs

If any jobs are classified as incomplete, you can resubmit them with the following command. 

If the job has failed because the optimization has not completed and a log file has been retrieved, then the last geometry will be used for the next submission. For failed jobs, the job input files will need to be fixed manually and submitted using the function `sge_manager.submit_jobs_from_jobs_dict`. The maximum number of allowed submission of the same job is 3.

You may specify a wall time of the job in HH:MM:SS format (default: `wall_time="23:59:00"`).

In [None]:
sm.resubmit_incomplete_jobs()

### Remove molecule(s) from job manager

You can remove molecules from the workflow with the following command. 

It's particularly useful if you need to redo a molecule with the same Gaussian configuration before it's uploaded to the database (since the submission of duplicates isn't permitted). You can also remove done molecules from the job manager once they've been uploaded to the database.

*Specify which jobs to remove from management with either a canonical SMILES string (or list of strings) or job status classification.*

In [None]:
sm.remove_jobs(sm.get_jobs(can="CCC"))
# sm.remove_jobs(sm.get_jobs(status="done"))

### Cancel all running/queued jobs

The `_qdel` ('**q**ueue **del**ete') command removes any queued or runing jobs. 

Beware that all of a user's jobs will be canceled (including jobs submitted independently of Auto-QChem).

In [None]:
sm._qdel()     

***


## Tips for larger batches

This workflow can readily scale up to dozens or hundreds of molecules. You will find the following tips helpful for working with larger batches of molecules.

### Importing SMILES strings in batch

A list of SMILES strings can be read from a csv file, assuming that the filename is ```smiles_strings.csv```, the strings are in the first column, and there are no headers.

In [None]:
import pandas as pd
smiles_str_list = pd.read_csv("smiles_strings.csv", header=None, usecols=[0])[0].tolist()

As another alternative, multiple SMILES strings can be directly copied from ChemDraw together at once, then split into a list here.

In [2]:
my_smiles = "CC.CCC"
smiles_str_list = my_smiles.split(".")

### Visualizing batch SMILES strings

You probably don't want to visualize all the molecules at once, so the following will let you look at just a few.

In [3]:
for m in mols[0:min(len(mols),5)]:
    draw(m.mol) 

NameError: name 'mols' is not defined

### Others comments

You may find that switching `summary` to `False` makes `qstat` significantly quicker when there's large numbers of running/queued jobs.

Otherwise working with large batches should work just the same! 
