# Intro to HPC: Solar Cell project

Project: Solar Power for Affordable Housing through Computational Design of Low-Cost/High-Efficiency Solar Cells.
Author: AlvaroVM [https://alvarovm.github.io](http://alvarovm.github.io)

Version: 0.0.2

## Part 3  Molecular Maps: TSNE CLustering for POST Trimming

Author: AlvaroVM [https://alvarovm.github.io](http://alvarovm.github.io)
Version: 0.0.1

## Example 1: PCA to distinguish between rings and chains

For this example we define in SMILES string two groups of molecules with different substituents, such as -CH3, -O, -F, -Cl, and- I , in molecules with six carbons 1) in a ring and 2) in chain. Those molecules would be added to a list, additionally we add a 'certain' property , this could be used later as a flag.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
SRC_DIR='..'
sys.path.append(os.path.join(SRC_DIR, 'code'))
import utils

import pandas as pd
#https://github.com/jmcarpenter2/swifter
#import swifter
#2-TSNE-UMAP-map-cuda-Copy1

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs 
from rdkit.Chem import Draw
from rdkit.Chem.rdMolDescriptors import  GetHashedMorganFingerprint
from rdkit.DataStructs import ConvertToNumpyArray

from sklearn.manifold import TSNE

import hdbscan

import seaborn as sns

utils.plot_settings2()

results_path = os.path.join(SRC_DIR,'results')

In [None]:
df = pd.read_pickle('../data/extended_db_Zindo_Nov_2019_V5_cannfp_clust.pkl').fillna(value = 0)
print('Column names: {}'.format(str(df.columns.tolist())))
print('Table Shape: {}'.format(df.shape))
#df.head(2)

### Exercises
* Take a sample of 1000 rows from the data set


* Select only the rows with `lambda_exp_max (nm)` between 200 and 800 nm


* Print a histogram plot (use Seaborn and `distplot`)


* Take the fingerprint column and `X=df['morganfps'].values.tolist()` and perform a 2D PCA analysis ( [see PCA in ::wiki::](https://github.com/alvarovm/solarcelldata/wiki/Working-with-Data#unsupervised-learning---principal-component-analysis))

* Perform a a t-SNE cluster analysis, as  `tsne = TSNE(n_components=2, metric='jaccard',perplexity=15, n_iter=3000, init='pca', random_state=23)`  ([See t-SNE in ::wiki::](https://github.com/alvarovm/solarcelldata/wiki/Working-with-Data#cluster-analysis-t-sne))


* Plot the t-SNE analysis in 2D graph colored by the `lambda_exp_max (nm)`, you could use Scatter lot with `matplotlib` or `seaborn`.


* Identify how many clusters are in the data using HDSCAN to the t-SNE output, try with `15` clusters as `cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=15 , gen_min_span_tree=True)


* Add the cluster information and to the`Dataframework` and visualize the molecules, compare the molecules and determine if the molecules have something in common

## Optional. Other cluster analysi. Butina analysis of a cluster

In [None]:
def ClusterFps(fps,cutoff=0.2):
    from rdkit import DataStructs
    from rdkit.ML.Cluster import Butina

    # first generate the distance matrix:
    dists = []
    nfps = len(fps)
    for i in range(1,nfps):
        sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
        dists.extend([1-x for x in sims])

    # now cluster the data:
    cs = Butina.ClusterData(dists,nfps,cutoff,isDistData=True)
    return cs