# Cleaning and cannonizing DB


Author: AlvaroVM [https://alvarovm.github.io](http://alvarovm.github.io)
Version: 0.0.1

## Example 1: PCA to distinguish between rings and chains

For this example we define in SMILES string two groups of molecules with different substituents, such as -CH3, -O, -F, -Cl, and- I , in molecules with six carbons 1) in a ring and 2) in chain. Those molecules would be added to a list, additionally we add a 'certain' property , this could be used later as a flag.

In [None]:
import sys
import os
SRC_DIR='../..'

In [None]:
sys.path.append(os.path.join(SRC_DIR, 'code'))
import utils

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from pandas.plotting import scatter_matrix
#https://github.com/jmcarpenter2/swifter
#import swifter
#2-TSNE-UMAP-map-cuda-Copy1

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs 
from rdkit.Chem import Draw
from rdkit.Chem.rdMolDescriptors import  GetHashedMorganFingerprint
from rdkit.DataStructs import ConvertToNumpyArray

from sklearn.manifold import TSNE

import hdbscan

utils.plot_settings2()

results_path = os.path.join(SRC_DIR,'results')

In [None]:
df = pd.read_pickle('../../data/extended_db_Zindo_Nov_2019_V5_cannfp.pkl').fillna(value = 0)
print('Column names: {}'.format(str(df.columns.tolist())))
print('Table Shape: {}'.format(df.shape))


### Exercises
* Use df.count() and df.hist() to have idea of the dataset
* Find the molecules with the largest `lambda_sTDA (nm)`, for example those with values  more than 630nm. Do they have anything in common?
* Find the molecules with the smalles `lambda_sTDA (nm)`, for example those with values  more less than 200 nm. Do they have anything in common?
* Plot a distribution function with `sns.distplot()` for `lambda_exp_min (nm)` and `lambda_exp_max (nm)`
* Compute the difference between `lambda_sTDA (nm)` and `lambda_exp_min (nm)`, and plot the distribution of this difference
* Plot `lambda_exp_min (nm)` vs `lambda_exp_max (nm)`
* Plot `lambda_exp_min (nm)` vs `lambda_sTDA (nm)`
* Plot the distribution function of the molecules that absorb light in the UV/Vis spectra, eg. 200 < `lambda_sTDA (nm)` < 800
* Compare and plot the diffenrece between `lambda_sTDA (nm)` and `lambda_z (nm)`
* Find the relation using `scatter_matrix` to find the relation among the excitation energies predicted with the methods 'gapdft', 'gapz', 'gapmopac', 'lambda_z (nm)', 'lambda_sTDA (nm)', 'lambda_tddft (nm)', 'lambda_exp_max (nm)', 'lambda_exp_min (nm)'. Which values correlate better?
* Find the relation using `scatter_matrix` to find the relation among the  absortion with the methods 'f1_sTDA','f1_ZINDO', 'f1_TDDFT', 'ε_Exp_max'
* Do a bar plot to know the distrution of number of ring (`NumAromaticRings`) using sns.barplot, use `df['NumAromaticRings'].value_counts()`
* Do a bar plot to know the distrution of number of ring (`NumAromaticHeterocycles`) using sns.barplot
* Compare the absorption of 'f1_sTDA'  and 'NumAromaticRings'
* Do a scatter plot that compares 'gapdft' with 'lambda_tddft (nm)' and color the points with 'NumAromaticRings'
* Find the systems with more than 20 aromatic rings. Do they have anything in common? Do they absorb more light or have a darker color?
* Find those systems with more than 10 aromatic rings and have values different to zero in 'lambda_z (nm)' and 'lambda_exp_min (nm)', do those values correlate?

### Use df.count() and df.hist() to have idea of the dataset

In [None]:
df.count()

In [None]:
df.hist(bins=50, figsize=(15,15))
plt.show()

### Find the molecules with the largest lambda_sTDA (nm), for example those with values more than 630nm. Do they have anything in common?

In [None]:
#tag='lambda_exp_max (nm)'
dfc=df.copy()
tag='lambda_sTDA (nm)'
dfc=dfc[dfc['lambda_sTDA (nm)']<1200]
dfc=dfc[dfc['lambda_sTDA (nm)']>630]
print('Table Shape: {}'.format(dfc.shape))

In [None]:
mollist=dfc.mol.tolist()
Draw.MolsToGridImage(mollist, legends=['stda={0:.1f} / emin={1:.1f} '.format(dfc['lambda_sTDA (nm)'][x],dfc['lambda_exp_min (nm)'][x]) for x, row in dfc.iterrows()])

Find the molecules with the smalles lambda_sTDA (nm), for example those with values more less than 200 nm. Do they have anything in common?

In [None]:
dfc=df.copy()
tag='lambda_sTDA (nm)'
dfc=dfc[dfc['lambda_sTDA (nm)']<200]
dfc=dfc[dfc['lambda_sTDA (nm)']>10]
print('Table Shape: {}'.format(dfc.shape))
mollist=dfc.mol.tolist()
Draw.MolsToGridImage(mollist, legends=['stda={0:.1f} / emin={1:.1f} '.format(dfc['lambda_sTDA (nm)'][x],dfc['lambda_exp_min (nm)'][x]) for x, row in dfc.iterrows()])


Plot a distribution function with sns.distplot() for lambda_exp_min (nm) and lambda_exp_max (nm)

In [None]:
import seaborn as sns
plt.figure(figsize=(6,4))
sns.distplot( df['lambda_exp_max (nm)'])
sns.distplot( df['lambda_exp_min (nm)'])
plt.show

Compute the difference between lambda_sTDA (nm) and lambda_exp_min (nm), and plot the distribution of this difference

In [None]:
df['diffminmax']=df['lambda_exp_max (nm)']-df['lambda_exp_min (nm)']
plt.figure(figsize=(6,4))
sns.distplot(df[df['diffminmax'] > 1]['diffminmax'])
plt.show()


Plot lambda_exp_min (nm) vs lambda_exp_max (nm)

In [None]:
plt.figure(figsize=(6,4))
df['stdamin']=df['lambda_sTDA (nm)']-df['lambda_exp_min (nm)']
sns.distplot(df[df['stdamin'] > 1]['stdamin'])
plt.show()

Plot lambda_exp_min (nm) vs lambda_sTDA (nm)

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(df['lambda_exp_min (nm)'].values[:],df['lambda_exp_max (nm)'].values[:],s=3)
plt.xlabel('lambda exp min')
plt.ylabel('lambda exp max')
plt.show()

Plot the distribution function of the molecules that absorb light in the UV/Vis spectra, eg. 200 < lambda_sTDA (nm) < 800

In [None]:
dfc=df.copy()
tag='lambda_sTDA (nm)'
dfc=dfc[dfc['lambda_sTDA (nm)']>200]
dfc=dfc[dfc['lambda_sTDA (nm)']<800]

print('Table Shape: {}'.format(dfc.shape))
import seaborn as sns
plt.figure(figsize=(6,4))
sns.distplot(dfc['lambda_sTDA (nm)'].tolist())

Compare and plot the diffenrece between lambda_sTDA (nm) and lambda_z (nm)

In [None]:
dfc['stdaz']=dfc['lambda_sTDA (nm)']-dfc['lambda_z (nm)']
plt.figure(figsize=(6,4))
sns.distplot(dfc['stdaz'])

Find the relation using scatter_matrix to find the relation among the excitation energies predicted with the methods 'gapdft', 'gapz', 'gapmopac', 'lambda_z (nm)', 'lambda_sTDA (nm)', 'lambda_tddft (nm)', 'lambda_exp_max (nm)', 'lambda_exp_min (nm)'. Which values correlate better?

In [None]:
attributes = [ 'gapdft', 
               #'gapz', 
               'gapmopac',
               'lambda_z (nm)',
               'lambda_sTDA (nm)',
               'lambda_tddft (nm)', 
               'lambda_exp_max (nm)', 'lambda_exp_min (nm)']
scatter_matrix(dfc[attributes], figsize=(12, 12))
plt.show()

Find the relation using scatter_matrix to find the relation among the absortion with the methods 'f1_sTDA','f1_ZINDO', 'f1_TDDFT', 'ε_Exp_max'

In [None]:
columns = { "epsilon_exp_max ":"ε_Exp_max",
           'f1_z':'f1_ZINDO',
           'f1_tddft':'f1_TDDFT'
          }

dfc.rename(columns, axis=1, inplace=True)

attributes = [ 'ε_Exp_max', 
               'f1_ZINDO',
               'f1_TDDFT']
scatter_matrix(dfc[attributes], figsize=(3, 3))
plt.show()


Do a bar plot to know the distrution of number of ring (NumAromaticRings) using sns.barplot, use df['NumAromaticRings'].value_counts()

In [None]:
#tag='lambda_exp_max (nm)'
dfr=df.copy()
dfr=dfr[dfr['NumAromaticRings']<20]
print('Table Shape: {}'.format(dfr.shape))
#plt.figure(figsize=(6,4))
sns.set()
sns.set(style='white', palette='deep', font='sans-serif', font_scale=1., color_codes=True, rc=None)

this = dfr['NumAromaticRings'].value_counts()
plt.figure(figsize=(5,5))
sns.barplot(x=this.keys(), y=this.tolist(), color='b')
#plt.xlabel(r'Number Aromatic Rings in Molecule')
plt.xlabel(r'Number of Rings')
plt.ylabel('Frequency')
#plt.title('Number Aromatic Rings')

#utils.save_figure(results_path,'NumAromaticRings')
plt.show()

Do a bar plot to know the distrution of number of ring (NumAromaticHeterocycles) using sns.barplot

In [None]:
this = dfr['NumAromaticHeterocycles'].value_counts()
plt.figure(figsize=(5,5))
sns.barplot(x=this.keys(), y=this.tolist(),color='b')
#plt.xlabel(r'Number Aromatic Carbocycles in Molecule')
plt.xlabel(r'Number of Rings')
plt.ylabel('Frequency')
#plt.title('Number Aromatic Carbocycles')
#utils.save_figure(results_path,'NumAromaticCarbocycles')

Compare the absorption of 'f1_sTDA' and 'NumAromaticRings'

In [None]:
dfc.plot.scatter(x='f1_sTDA', y='NumAromaticRings')

Do a scatter plot that compares 'gapdft' with 'lambda_tddft (nm)' and color the points with 'NumAromaticRings'

In [None]:
plt.figure(figsize=(6,8))
plot_kwds={'alpha':.8, 's':30, 'linewidths':.1}

#plt.scatter(tsne_X.T[0], tsne_X.T[1], c=df[ 'lambda_sTDA (nm)'].values[:], cmap='rainbow' )
plt.scatter(dfc['gapdft'].values[:], dfc[ 'lambda_tddft (nm)'].values[:], c=dfc[ 'NumAromaticRings'], **plot_kwds )

plt.xlabel('gap')
plt.ylabel('f1')
plt.title('Lambda_TDDFT vs NumAromaticRings')

cbar = plt.colorbar(orientation='horizontal')
cbar.set_label('rings')
plt.show()

Find the systems with more than 20 aromatic rings. Do they have anything in common? Do they absorb more light or have a darker color?

In [None]:
#tag='lambda_exp_max (nm)'
dfr=df.copy()
tag='lambda_sTDA (nm)'
#dfr=dfr[dfr['lambda_sTDA (nm)']>0]
dfr=dfr[dfr['NumAromaticRings']>10]
print('Table Shape: {}'.format(dfr.shape))
mollist = dfr.mol.tolist()
Draw.MolsToGridImage(mollist, molsPerRow=5, subImgSize=(350,350), legends=['stda={0:.1f} / emin={1:.1f} '.format(dfr['lambda_sTDA (nm)'][x],dfr['lambda_exp_min (nm)'][x]) for x, row in dfr.iterrows()])