## Installation

In [None]:
!pip install git+https://github.com/denovochem/cholla_chem.git@v0.1.0

In [None]:
from cholla_chem import (
    OpsinNameResolver,
    PubChemNameResolver,
    CIRpyNameResolver,
    resolve_compounds_to_smiles,
)

## Basic usage
Provide a list of chemical names. By default, `resolve_compound_to_smiles` will return a dictionary of {name: smiles} pairs. If `detailed_name_dict` is set to `True`, it will return a dictionary with information the SMILES that each resolver returned.

In [None]:
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['aspirin'])

resolved_smiles

In [None]:
resolved_smiles = resolve_compounds_to_smiles(
    compounds_list=['2-acetyloxybenzoic acid'], 
    detailed_name_dict=True
)

resolved_smiles

## Customizing resolvers
Initialize the resolvers that you want to use, assigning a resolver name and weight. Pass the resolvers as a list to resolve_compounds_to_smiles.

In [None]:
opsin_resolver = OpsinNameResolver(
    resolver_name='opsin', 
    resolver_weight=4
)
pubchem_resolver =  PubChemNameResolver(
    resolver_name='pubchem', 
    resolver_weight=3
)
cirpy_resolver = CIRpyNameResolver(
    resolver_name='cirpy', 
    resolver_weight=2
)

resolved_smiles = resolve_compounds_to_smiles(
    compounds_list=['2-acetyloxybenzoic acid'],
    resolvers_list=[opsin_resolver, pubchem_resolver, cirpy_resolver],
    smiles_selection_mode='weighted',
    detailed_name_dict=True
)

resolved_smiles

## SMILES selection
The `smiles_selection_mode` parameter can be used to select the best SMILES if multiple possible SMILES are found. By default, a weighted consensus is used, where the weights are assigned to each resolver based on their reliability. Other options include:

- 'consensus': Pick the SMILES string returned by the most resolvers. Tie-breaker: lexicographical order.
- 'ordered': Pick the first SMILES that was generated by a resolver with the highest priority. The order of the resolvers provided as the resolvers_list argument in resolve_compounds_to_smiles determines the priority (highest to lowest).
- 'weighted': Assign weights to resolvers. Sum weights per SMILES. Pick highest total. Custom weights can be assigned at resolver initialization. See Resolvers for default weights.
- 'rdkit_standardized': Pick the SMILES that is most standardized by RDKit. Penalizes SMILES with more fragments, formal charges, radicals, and isotopes.
- 'fewest_fragments': Pick the smiles with the fewest fragments (separated by '.')
- 'longest_smiles': Pick the longest SMILES.
- 'shortest_smiles': Pick the shortest SMILES.
- 'random': Pick a random SMILES.
- 'highest_symmetry': Pick the SMILES with the highest symmetry.

In [None]:
resolved_smiles = resolve_compounds_to_smiles(
    compounds_list=['aspirin'], 
    smiles_selection_mode='random'
)

resolved_smiles

## Name correction
cholla_chem can automatically correct common OCR errors, typos, mojibakes, and pagination errors in chemical names. It is also capable of resolving compound names that use common delimiters to denote mixtures or combinations of compounds, and also expand peptide shorthand to full names for easier resolution.

In [None]:
## Resolving mojibake - common when reading files with different encodings, e.g. utf-8 to windows-1252
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['Î±-Terpineol'], normalize_unicode=True, detailed_name_dict=True)

resolved_smiles

In [None]:
## Name correction (l->1, Z->2 OCR errors)
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['l-mercapto-Z-thiapropane'], attempt_name_correction=True, detailed_name_dict=True)

resolved_smiles

In [None]:
## Peptide shorthand expansion
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['cyclo(Asp-Arg-Val-Tyr-Ile-His-Pro-Phe)'], resolve_peptide_shorthand=True, detailed_name_dict=True)

resolved_smiles

In [None]:
## Splitting on delimiters
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['BH3•THF'], split_names_to_solve=True, detailed_name_dict=True)

resolved_smiles

## Command line interface
cholla_chem can run using a command line interface. Compound names can be provided either as quoted strings, or via a file.

In [None]:
!cholla-chem "aspirin" "ibuprofen" "sodium chloride"

In [None]:
## Note - names.txt needs to be in the same directory as this notebook
!cholla-chem --input names.txt --output results.csv