# Cleaning and cannonizing DB


Author: AlvaroVM [https://alvarovm.github.io](http://alvarovm.github.io)
Version: 0.0.1

## Example 1: PCA to distinguish between rings and chains

For this example we define in SMILES string two groups of molecules with different substituents, such as -CH3, -O, -F, -Cl, and- I , in molecules with six carbons 1) in a ring and 2) in chain. Those molecules would be added to a list, additionally we add a 'certain' property , this could be used later as a flag.

In [28]:
import sys
import os
SRC_DIR='..'

In [29]:
sys.path.append(os.path.join(SRC_DIR, 'code'))
import utils

In [30]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from pandas.plotting import scatter_matrix
#https://github.com/jmcarpenter2/swifter
#import swifter
#2-TSNE-UMAP-map-cuda-Copy1

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs 
from rdkit.Chem import Draw
from rdkit.Chem.rdMolDescriptors import  GetHashedMorganFingerprint
from rdkit.DataStructs import ConvertToNumpyArray

from sklearn.manifold import TSNE

import hdbscan

utils.plot_settings2()

results_path = os.path.join(SRC_DIR,'results')

In [31]:
df = pd.read_pickle('../data/extended_db_Zindo_Nov_2019_V5_cannfp.pkl').fillna(value = 0)
print('Column names: {}'.format(str(df.columns.tolist())))
print('Table Shape: {}'.format(df.shape))


Column names: ['smiles', 'inchikey', 'smi_pre', 'smi_post', 'lambda_sTDA (nm)', 'f1_sTDA', 'lumo_dft', 'homo_dft', 'dmom_dft (D)', 'lambda_z (nm)', 'f1_z', 'lumo_z', 'homo_z', 'dmom_z (D)', 'lumo_mopac', 'homo_mopac', 'dmom_mopac (D)', 'lambda_tddft (nm)', 'f1_tddft', 'lambda_exp_max (nm)', 'epsilon_exp_max ', 'lambda_exp_min (nm)', 'epsilon_exp_min ', 'solvent', 'nogood', 'nogoodpost', 'mol', 'morganfps', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRadicalElectrons', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'NumValenceElectrons']
Table Shape: (9870, 45)


### Exercises
* Use df.count() and df.hist() to have idea of the dataset
* Find the molecules with the largest `lambda_sTDA (nm)`, for example those with values  more than 630nm. Do they have anything in common?
* Find the molecules with the smalles `lambda_sTDA (nm)`, for example those with values  more less than 200 nm. Do they have anything in common?
* Plot a distribution function with `sns.distplot()` for `lambda_exp_min (nm)` and `lambda_exp_max (nm)`
* Compute the difference between `lambda_sTDA (nm)` and `lambda_exp_min (nm)`, and plot the distribution of this difference
* Plot `lambda_exp_min (nm)` vs `lambda_exp_max (nm)`
* Plot `lambda_exp_min (nm)` vs `lambda_sTDA (nm)`
* Plot the distribution function of the molecules that absorb light in the UV/Vis spectra, eg. 200 < `lambda_sTDA (nm)` < 800
* Compare and plot the diffenrece between `lambda_sTDA (nm)` and `lambda_z (nm)`
* Find the relation using `scatter_matrix` to find the relation among the excitation energies predicted with the methods 'gapdft', 'gapz', 'gapmopac', 'lambda_z (nm)', 'lambda_sTDA (nm)', 'lambda_tddft (nm)', 'lambda_exp_max (nm)', 'lambda_exp_min (nm)'. Which values correlate better?
* Find the relation using `scatter_matrix` to find the relation among the  absortion with the methods 'f1_sTDA','f1_ZINDO', 'f1_TDDFT', 'ε_Exp_max'
* Do a bar plot to know the distrution of number of ring (`NumAromaticRings`) using sns.barplot, use `df['NumAromaticRings'].value_counts()`
* Do a bar plot to know the distrution of number of ring (`NumAromaticHeterocycles`) using sns.barplot
* Compare the absorption of 'f1_sTDA'  and 'NumAromaticRings'
* Do a scatter plot that compares 'gapdft' with 'lambda_tddft (nm)' and color the points with 'NumAromaticRings'
* Find the systems with more than 20 aromatic rings. Do they have anything in common? Do they absorb more light or have a darker color?
* Find those systems with more than 10 aromatic rings and have values different to zero in 'lambda_z (nm)' and 'lambda_exp_min (nm)', do those values correlate?