Browse files

add something for statistical classifier, clustering analysis and mcs…

…/scaffold based hierarchy, more coming in the next few days
  • Loading branch information...
chemhack committed Sep 16, 2011
1 parent 09d4daa commit 5aedeedfffa22c50e90e2c3e2bae809609d7c645
Showing with 72 additions and 2 deletions.
  1. +55 −1 strucontjcheminf.bib
  2. +17 −1 strucontjcheminf.tex
@@ -265,7 +265,61 @@ @INPROCEEDINGS{dumontier2007
of chemical compounds},
booktitle = {Proc. of OWL: Experiences and Directions {(OWLED 2007)}},
year = {2007}
+author = {Todeschini, Roberto and Consonni, Viviana},
+booktitle = {New York},
+doi = {10.1002/9783527613106},
+editor = {Todeschini, Roberto and Consonni, Viviana},
+isbn = {3527299130},
+issn = {35229913},
+pages = {688},
+publisher = {Wiley-VCH},
+series = {Methods and Principles in Medicinal Chemistry},
+title = {{Handbook of Molecular Descriptors}},
+url = {},
+volume = {11},
+year = {2000}
+ author = {Miklos Vargyas and Gabor Imre},
+ title = {{ChemAxon Library MCS}},
+ howpublished = "\url{}",
+ year = {2008},
+ note = "[Online]"
+author = {Schuffenhauer, Ansgar and Ertl, Peter and Roggo, Silvio and Wetzel, Stefan and Koch, Marcus a and Waldmann, Herbert},
+doi = {10.1021/ci600338x},
+issn = {1549-9596},
+journal = {Journal of chemical information and modeling},
+keywords = {Classification,Databases, Factual,Ligands,Molecular Structure,Organic Chemicals,Organic Chemicals: chemistry,Pesticides,Pyruvate Kinase},
+number = {1},
+pages = {47--58},
+pmid = {17238248},
+title = {{The scaffold tree--visualization of the scaffold universe by hierarchical scaffold classification.}},
+url = {},
+volume = {47},
+year = {2007}
+author = {Adamson, G.W. and Bawden, D.},
+journal = {Journal of Chemical Information and Computer Sciences},
+number = {4},
+pages = {204--209},
+publisher = {ACS Publications},
+title = {{Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures}},
+url = {},
+volume = {21},
+year = {1981}
@comment{jabref-meta: selector_publisher:}
@@ -276,7 +276,23 @@ \subsection*{Classification in chemistry}
The relationship between structure and function (SARs; but also cliffs in structure-activity space)
The focus in the remainder of this document will be on crisply defined structure-based classes in chemistry, which will be discussed further in the section \textit{\nameref{sec:resultsclasses}} below. %below
+% Review on algorithm based chemical classification. Need to be relocated
+Since computers are introduced to chemistry research. Chemical structures are formalized as graph in computer systems, and most classification algorithms can not deal with such graphical data. In order to apply classification algorithms, molecular descriptors are calculated. As defined by Todeschini and Consonni\cite{Todeschini2000}, the molecular descriptor is the final result of a logic and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment. Among calculated descriptors, if we focus on structural features, molecular fingerprint is a special form of descriptors which is defined as a string of boolean value, which each bit represent a feature. In the most common types of fingerprint, a feature could be either a pre-defined substructure or a random substructure mapped by hashing algorithms.
+Structure-activity relation(SAR) study is one of the hottest field where classification algorithms are applied. Supervised machine learning methods like Bayesian classifier, decision trees and support vector machine, etc. are employed to classify compounds into positive or negative class. The classification problem we are discussing here is to organize chemical structures in an understandable way. As mentioned above, hierarchical organization is crucial for easier human understanding. But none of those supervised methods mentioned above has the ability to extract such a hierarchical structure. Supervised machine learning require a reasonable training set of chemicals which are already classified. Although existing database like ChEBI and MeSH could act as training set, the size of such data is still a tiny fraction of the enormous chemical space. And what's make the problem more challenging is that the leaf nodes of such classification tree normally contains few structures.
+Other statistical methods like hierarchical clustering\cite{Adamson1981} do give a hierarchy. In general, hierarchical clustering rely on a similarity distance matrix, composed by pair-wise distance. Either agglomerative or divisive methods could be used. In an agglomerative clustering, all instances are assigned to its own class initially. Then a recursive merge is performed. In each step, two nearest class are chosen to be merged, until all classes are merged into one. Divisive clustering are performed in an opposite manner, which all instances are in one class initially and split instead of merging is done in each recursive step.
+In the hierarchical tree constructed by clustering analysis, the each node represent a class. However, the meaning of such nodes is unclear. And changes in input data(or structures) can result in different hierarchical tree, making it more difficult to formalize the class definition.
+Maximum common structure(MCS) based clustering and scaffold tree are more promising in presenting intermediate node in hierarchy tree.
+LibraryMCS\cite{librarymcs} is a commercial application that can perform MCS based clustering on a set of structures. There's no technical details of the underlying implementation available, but it could be easily observed from the output that structures shares common substructure are organized in same class, and the common substructures define the scopes of each class. Scaffold tree\cite{Schuffenhauer2007} is a hierarchal classification of scaffolds(molecular frameworks which is obtained by removing terminal side chains). By removing rings in scaffolds recursively, scaffolds are decomposed into smaller ones which form the higher levels in hierarchy tree.
+Both MCS or scaffold based methods are very helpful in visualizing and giving an overview of the dataset. In the hierarchy tree output, a clear representation could be extracted, as either a MCS or a scaffold, and then to be used as class presentation. But the output is highly dependent on the input, thus could not act as a universal chemical classification.
% This is a background section.
\subsection*{Logic-based reasoning and ontology}

0 comments on commit 5aedeed

Please sign in to comment.