# Molecular Maps: PCA using RDKIT, part 2

Author: AlvaroVM [https://alvarovm.github.io](http://alvarovm.github.io)
Version: 0.0.1

## Example 1: PCA to distinguish between rings and chains

For this example we define in SMILES string two groups of molecules with different substituents, such as -CH3, -O, -F, -Cl, and- I , in molecules with six carbons 1) in a ring and 2) in chain. Those molecules would be added to a list, additionally we add a 'certain' property , this could be used later as a flag.

In [None]:
import sys
import os
SRC_DIR='../..'

In [None]:
sys.path.append(os.path.join(SRC_DIR, 'code'))
import utils

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
#https://github.com/jmcarpenter2/swifter
#import swifter
#2-TSNE-UMAP-map-cuda-Copy1

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs 
from rdkit.Chem import Draw
from rdkit.Chem.rdMolDescriptors import  GetHashedMorganFingerprint
from rdkit.DataStructs import ConvertToNumpyArray

from sklearn.manifold import TSNE

import hdbscan

utils.plot_settings2()

results_path = os.path.join(SRC_DIR,'results')

In [None]:
df = pd.read_pickle('../../data/extended_db_Zindo_Nov_2019_V5_cannfp_clust.pkl').fillna(value = 0)
print('Column names: {}'.format(str(df.columns.tolist())))
print('Table Shape: {}'.format(df.shape))
#df.head(2)

### Exercises

* Clean the tables removing the rows with zeros in 'lambda_tddft (nm)' and 'lambda_sTDA (nm)'

In [None]:
#tag='lambda_exp_max (nm)'
tag='lambda_sTDA (nm)'
df=df[df['lambda_sTDA (nm)']>0]
df=df[df['lambda_tddft (nm)']>0]
print('Table Shape: {}'.format(df.shape))

* Compare 'lambda_tddft (nm)' vs 'lambda_sTDA (nm)', use  t-SNE clusters to colour the points. Is there any relationship? 

* Compare 'lambda_tddft (nm)' vs 'lambda_sTDA (nm)' in the most populated clusters. Are there clusters that correlate better?

In [None]:
df['stda_dft'] = df['lambda_tddft (nm)'].values[:]- df[ 'lambda_sTDA (nm)'].values[:]

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(df['lambda_tddft (nm)'].values[:], df[ 'lambda_sTDA (nm)'].values[:], marker='o', c=df[ 'cluster'], cmap='brg', s=df['stda_dft'],alpha=.8,)
#plt.scatter(X_skernpca[y==1, 0], X_skernpca[y==1, 1], 

plt.xlabel('TDDFT')
plt.ylabel('sTDA')
plt.title('Lambda TDDFT vs sTDA, coloured with Cluster#')
cbar = plt.colorbar()
cbar.set_label('Cluster')
#utils.save_figure(results_path,'tddft-stda-diff_lem')
plt.show()

* Compute the difference between 'lambda_tddft (nm)' and 'lambda_sTDA (nm)'. Analyze the differences by cluster.

In [None]:
import matplotlib as mpl
import seaborn as sns
counts = df.cluster.value_counts()
counts[1:3]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8,7))
ax=[]
names=[]
for e in counts[1:3].keys():
    label='Cluster {}'.format(e)
    ax=sns.distplot(df[df['cluster']==e]['stda_dft'].values,kde_kws={"shade": True},label=label)
    names.append('Cluster {}'.format(e))

plt.xlabel(r'$\lambda$ (nm)')
plt.ylabel('Relative frequency')
plt.title('TDDFT-sTDA, Clusterwise')

ax.legend(names)
#header_legend('',  title='Clusters', loc='upper right',bbox_to_anchor=(1.15,1.0))
#utils.save_figure(results_path,'tddft-stda-diff-cluster-lem')
plt.show()

* Repeat the same analysis with the ZINDO method