<a href="https://colab.research.google.com/github/architvasan/MichiganTutorialMolecularData/blob/main/ChemicalData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to chemical data

## Background information
* All molecules are built of various combinations of atoms found in the periodic table!
* Certain rules dictate the combination of elements to form these molecules according to their position within the periodic table.

\# insert image of periodic table here



## General types of molecules

#### Organic small molecules
* Organic chemistry is the study of the chemistry of carbon-containing compounds.
* Carbon is singled out because it has a chemical diversity unrivaled by any other chemical element.
* Its diversity is based on the following:
  * Carbon atoms bond reasonably strongly with other carbon atoms.
  * Carbon atoms bond reasonably strongly with atoms of other elements.
  * Carbon atoms are the building block of the vast majority of molecules for life!
* In general, organic chemistry is dominated by the following elements:
  * Carbon
  * Nitrogen
  * Oxygen
  * Phosphorus
  * Sulfur
  * Hydrogen
* Organic compounds are represented in 2D by the following diagrams. Since hydrogen and carbon are so ubiquitous, they are not listed in these diagrams.

\# Insert diagram of a few organic compounds


#### Inorganic small molecules

#### Polymers

## Types of molecular data

### 2-dimensional data

Arguably, the most common molecular data type seen are 2-dimensional representations of atoms connected to one another with atomic bonds. There are advantages and disadvantages with this datatype though:

\# Add 2D-graph data here

These 2D-graphs can be represented compactly using adjacency matrices and node lists:

\# add adjacency matrix

**Advantages**

* Simplicity and Clarity
* Ease of Visualization
* Compatibility with Databases
* Integration with Analytical Data

**Disadvantages**
* Lack of Spatial Information
* Stereochemistry
* Conformational Flexibility
* Computational Modeling Limitations

These datatypes are typically incorporated into deep learning models using graph models such as GNNs and Graph Transformers.


In [None]:
from rdkit import Chem
import networkx as nx

smiles = 'CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1'
mol = Chem.MolFromSmiles(smiles)

# Get adjacency matrix
adjacency_matrix = Chem.GetAdjacencyMatrix(mol, useBO = True)

# Convert adjacency matrix to NetworkX graph
G = nx.from_numpy_array(adjacency_matrix)

### 3-dimensional data

3D chemical data offers numerous advantages over 2D data, especially in contexts where spatial information is critical.

\# add image of 3D chemical data

**Advantages**

* Accurate Representation of Molecular Geometry:
* Stereochemistry
* Conformational Analysis
* Visualization

**Disadvantages**

* Computational Resource Requirements
* Data Accuracy and Quality
* Limited Availability
* Overemphasis on static structures

Common models used to incorporate these 3D structures are GNNs and 3D convolutional networks.


## Sequential data types

In addition to 2-D and 3-D data types, molecules can also be represented as text! Since molecules have no obvious grammar to their structures, chemical languages have been developed with their own rules to account for different chemical information such as atom types, bonds, rings, and charges.

## Simplified molecular-input line-entry system (SMILES)

SMILES (Simplified molecular-input line-entry system) is a line notation method to represent molecules as well as reactions. It is one of the most common method to represent molecules because of its simplicity and readability to the human eye.

Examples:

Propane: CCC

Butane:  CCCC

Ethene:  C=C

### Atoms

All non-Hydrogen atoms are represented by their atomic symbols. Any unfulfilled valency of an atom is assumed to be Hydrogen. For example, writing a simple C means that it’s actually a CH4 (Methane) and not an elemental Carbon. Similarly, N is NH3 (Ammonia) and O is H2O (Water).

To represent elemental atoms, a [ ] (Square bracket) notation is used. For example, [S] is elemental Sulfur. In case you want to explicitly add the Hydrogens to a SMILES string, the square bracket can be used here as well. For example, Methane and Ethane can be written as [CH4] and [CH3][CH3] respectively.

### Bonds

Single, double, and triple bonds are represented by the symbols -, =, and #, respectively. Same as Hydrogens, single bonds are often omitted for simplicity. Adjacent atoms are assumed to be connected by a single or aromatic bond.

There can be multiple ways to represent a molecule in the SMILES string. For example, all the following notations are correct for Ethane: CC, C-C, [CH3]-[CH3], [CH3]-C

### Charged molecules

In case of charged atoms or molecules, the square bracket notation is the way to go again. The positive charge is represented by a + sign and a negative charge by - sign.

### Branched molecules

Parentheses are used to create a branch in the SMILES string.

### Cyclic (ring) structures

Ring structures are written by breaking each ring at an arbitrary point to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.

If there are two rings: we break one bond each from both the rings and assign separate numbers to each involved atom.

### Aromatic structures

A preferred way to represent them: aromatic atoms are represented by lowercase letters. For example, aromatic Carbon by c, Nitrogen by n, Boron by b, and so on.

Example: benzene

Benzene can be written as c1ccccc1. Here, adjacent atoms are not assumed to be connected by a single bond but rather the lowercase letters tells us that this is a aromatic ring signifying alternate single and double bonds.

### Disconnected structures

Ions in the ionic molecules are not connected by a covalent bond with each other. Disconnected compounds are written as individual structures separated by a . (period). For example, Sodium Hydroxide in its ionized form will be written as [Na+].[OH-].

In [None]:
import pubchempy as pcp
results = pcp.get_compounds('Glucose', 'name')
for compound in results:
  print compound.isomeric_smiles

## SELFIES
A limitation of SMILES strings is that they have a very complex grammar. When used in machine learning models, many of the results can be invalid.

An alternative system used is the Self-referencing embedded strings (SELFIES) which is claimed to be 100% robust. Even entirely random SELFIES strings represent correct molecular graphs; this makes this language ideal for a generative machine learning model. The SELFIES grammar can be thought of as a elementary computer program that is transformed into a molecular graph with a simple compiler.

SELFIES are designed with two general ideas:

First, the non-local features in SMILES (rings and branches) are localized. SELFIES represents rings and branches by their length. After a ring and branch symbol, the subsequent symbol is interpreted as a number that stands for a length. This circumvents many syntactical issues with non-local features.

Second, physical constraints are encoded by different states of the formal grammar. For example,a molecule of the form C=C=C is possible (three carbons connected via double bonds). However, F=O=F is not possible, because fluorine can only form one bond (not two) and oxygen can only form two bonds (not four as in this example). In SELFIES, after compiling a symbol into a part of the graph, the derivation state changes. This can be considered as a minimal memory that ensures the fulfilment of physical constraints.

\# Put picture of SELFIES here.

In [None]:
! pip install selfies

In [None]:
import selfies as sf

benzene = "c1ccccc1"

# SMILES -> SELFIES -> SMILES translation
try:
    benzene_sf = sf.encoder(benzene)  # [C][=C][C][=C][C][=C][Ring1][=Branch1]
    benzene_smi = sf.decoder(benzene_sf)  # C1=CC=CC=C1
except sf.EncoderError:
    pass  # sf.encoder error!
except sf.DecoderError:
    pass  # sf.decoder error!

len_benzene = sf.len_selfies(benzene_sf)  # 8

symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']