## Unsupervised ML Example: Cyclohexane

In [None]:
import ase
import os
import sys
import numpy as np
import pandas as pd
from ase.visualize import view
import plotly.express as px
from plotly.offline import init_notebook_mode, plot
init_notebook_mode(connected=True)
from ase.io import read
from soapml.descriptors.soap import SOAP
from sklearn.preprocessing import normalize
import numpy as np
from openTSNE import TSNE

%config Completer.use_jedi = False
%load_ext autoreload
%autoreload 2

In [None]:
import soapml
soapml.__version__

### Read Data


In [None]:
from ase.io import read
traj=read('../unsupervised-ml-main/cyclohexane_data/MD/trajectory.xyz',index=':')
v=view(traj,viewer='ngl')
v

In [None]:
traj[0].info

In [None]:
energy=[a.info['energy_ryd'] for a in traj]

### Analysis at global structure level 


#### Create SOAP descriptors 
we create soap descriptor with the help of `soampml` library as below. The main concept is that we use `rcut=4.5` that we are considering 4.5A radius of sphere for describing each atom environments in the structure to get environmental descriptors (109 atoms per structure so 109 descriptor per structure). But we need one descriptor per structure so we use one of the `site_to_structure` method to build a structure level descriptor from the local environment ones. 

In [None]:
from soapml.descriptors.soap import SOAP
from sklearn.preprocessing import normalize
import numpy as np

    
soap=SOAP(rcut=4.5,nmax=8,lmax=4,sigma=0.5,periodic=False,rbf='gto',crossover=True)
soap.fit(traj,site_to_structure_method='inner')
soapdesc=soap.featurize_many(traj,n_jobs=40)
soapdesc = normalize(soapdesc, axis=1)

In [None]:
from sklearn.decomposition import PCA
pca=PCA(n_components=10)
pca.fit(soapdesc)

In [None]:
px.line(pca.explained_variance_ratio_,markers='o')

In [None]:
pca_comp=pca.transform(soapdesc)
df_pca=pd.DataFrame({'x':pca_comp[:,0],'y':pca_comp[:,1],'z':pca_comp[:,2],'energy':energy})
fig=px.scatter(data_frame=df_pca,x='x',y='y')#,z='z',color='energy')
pca_comp_conf=pca.transform(conf_soapdesc)
df_pca_conf=pd.DataFrame({'x':pca_comp_conf[:,0],'y':pca_comp_conf[:,1],'z':pca_comp_conf[:,2],'name':names})
fig.add_traces(px.scatter(data_frame=df_pca_conf,x='x',y='y',color='name',size=[20 for i in range(len(conf_embedding))]).data)

#### Visualizing descriptor space
Now that we have computed the `SOAP` descriptors, let's find out how the descriptor scape looks like.There are mainly two ways of doing this when dealing with high dimensional descriptor space like we have here. `Clustering` and `dimensionality reduction`. We can do both and should come up with same insights. Here let's employ one of the most popular non-linear dimensionality reduction algorithm in ML field `T-distributed Stochastic Neighbor Embedding (t-SNE)` to obtain 3 dimensional representation of our descriptor space. 

In [None]:
tsne = TSNE(
    n_components=2,
    perplexity=50,
    metric="euclidean",
    n_jobs=40,
    random_state=42,
    verbose=True,
)

In [None]:
embedding=tsne.fit(pca_comp)

In [None]:
px.defaults.width=800
px.defaults.height=600
px.scatter(x=embedding[:,0],y=embedding[:,1],color=energy)

In [None]:
conf_traj=[]
names=['boat','chair','half-boat','half-chair','planar','twist-boat']
for conformer in names:
    atoms=read('../unsupervised-ml-main/cyclohexane_data/conformers/{}.xyz'.format(conformer))
    atoms.info['name']=conformer
    conf_traj.append(atoms)

In [None]:
view(conf_traj[2],viewer='ngl')

In [None]:
view(conf_traj[3],viewer='ngl')

In [None]:
conf_soapdesc=soap.featurize_many(conf_traj,n_jobs=4)
conf_soapdesc = normalize(conf_soapdesc, axis=1)
conf_embedding=embedding.transform(pca.transform(conf_soapdesc))

In [None]:
conf_df=pd.DataFrame({'TSNE_0':conf_embedding[:,0],'TSNE_1':conf_embedding[:,1],'name':names})

In [None]:
conf_df

In [None]:
fig=px.scatter(x=embedding[:,0],y=embedding[:,1])
fig.add_traces(px.scatter(data_frame=conf_df,x='TSNE_0',y='TSNE_1',color='name',size=[20 for i in range(len(conf_embedding))]).data)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.cluster as cluster
import time
%matplotlib inline
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}

In [None]:
plt.scatter(embedding[:,0], embedding[:,1], c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)

In [None]:
def plot_clusters(embedding, algorithm, args, kwds):
    start_time = time.time()
    labels = algorithm(*args, **kwds).fit_predict(embedding)
    end_time = time.time()
    palette = sns.color_palette('deep', np.unique(labels).max() + 1)
    colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
    plt.scatter(embedding[:,0], embedding[:,1], c=colors, **plot_kwds)
    frame = plt.gca()
    frame.axes.get_xaxis().set_visible(False)
    frame.axes.get_yaxis().set_visible(False)
    plt.title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=24)
    plt.text(-0.5, 0.7, 'Clustering took {:.2f} s'.format(end_time - start_time), fontsize=14)

In [None]:
plot_clusters(embedding, cluster.KMeans, (), {'n_clusters':7})

In [None]:
plot_clusters(embedding, cluster.MeanShift, (), {'cluster_all':False})

In [None]:
plot_clusters(embedding, cluster.DBSCAN, (), {'eps':0.5})

In [None]:
pca_comp

In [None]:
import hdbscan
plot_clusters(pca_comp[:,0:4], hdbscan.HDBSCAN, (), {'min_cluster_size':40})

We plot the obtained three t-SNE components in 3d and we see there are three major clusters in our data indicating there are three major structural groups. As we have access to energy of the configurations, we can color the plot with that and any other properties that we think are important to understand the structure property relations, the most important understand for materials design. 

Here we see there are indeed a good correlation between the clusters and the energy of them and even within a given cluster, we see there are some visible trends.

There are no units in the axis of the plot because the result of non-liner dimensionality reduction algorithms like `t-SNE` does not a fixed scale length across the plot. We give higher importance to proximity matching than trying to reproduce accurate distance from high dimensional space. This is because our objective is to create a meaningful visual representation only. Each points in this plot representing one of the 300 structures and if two points are close they are `similar` in structure and should therefore be `similar` in energy for example.
