# Clustering Algorithms: Their Application to Gene Expression Data
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5135122/

Applications:
- Create clusters of gene expression data that are similar in expression and dissimilar to gene expression data in other clusters.

Clustering is based on genes, samples, and/or time.
- Genes display related expression across conditions.
- Samples display related expression across all genes.
- Gene-based clustering: genes are regarded as the objects and samples as the features.
- Sample-based clustering: samples are regarded as the object and genes as the features.

Type:
- Partial clustering does not allocate every gene to a cluster. more suitable for gene expression due to the fact that gene expression data comprises irrelevant genes or samples. Genes could represent noises that allows its impact to be correspondingly less on the outcome. Aids in neglecting irrelevant contributions.
- Complete clustering: allocates every gene to a cluster.
- Hard clustering: assign each gene to a single cluster.
- Overlapping clustering: assign degrees of membership in several clusters to each gene. Can be converted to hard clustering by assigning each gene to the cluster with the dominant degree of membership.

## Traditional Techniques
**Hierarchical methods**
- Agglomerative nesting (AGNES)
    - Initially, each object is presumed to form a small cluster by itself. It then joins the two most similar objects, however, this cannot be undone.
    - The use of different metrics may generate different results.
- Divisive Analysis (DIANA)
    - not appropriate as splitting of a cluster requires computing the diameter. Gene expr. doesn't follow that.
- Clustering Using Representatives (CURE):
    - Compromise between centroid-based and all-point extreme approaches.
    - Nonspherical clusters
    - Less sensitive to outliers.
- CHAMELEON
    - Outperforms CURE, ROCK, DBSCAN
- BIRCH
    - Ability to handle outliers, large datasets, and output not being affected by order of input data makes it good for gene expression data clustering.
    - Efficiency of result is largely dependent on parameter settings
    - Exhibits bias toward nonspherical clusters because it uses the concept of radius or diameter to control the boundary of clusters.
    
**Partitioning methods**
- Intelligent Kernel K-means (IKKM)
    - Good cancidate for gene expression clustering, as it overcomes most challenges.
        - finds clusters itself
    - Issue of high dimensionality
    
**Model-based methods**
- Self-organizing maps (SOMs)
    - on-par with Bayesian clustering, HC, K-means clustering).
- Chinese restaurant clustering (CRC)

**Density-based methods**
- DENsity-based CLUstEring (DENCLUE)
    - Good for gene expression clustering. Superior compared to DBSCAN.
    
## Recent techniques
- Binary matrix factorization (BMF): 
    - Good for clustering.
- Ensemble clustering
- MST:
    - does not rely on detailed geometrical shape of a cluster
- Dual-rooted MST
- M-CLUBS (Microarray data CLustering Using Binary Splitting)
    - provides two goals of a clustering process: the efficiency attribute of a divisive technique and the accuracy attribute of agglomerative technique.
- KnA
- CAS-C is good
- Hierarchical Dirichlet proces(HDP) 

## Challenges and Issues
See table 2.

**Hierarchical clustering**:
- Sensitive to noice.
- Not receptive to missing data.
- Difficult to prive information sch as the number of clusters required and individual cluster confidence measures.
- Major limitation: as soon as two points are interconnected, they do not go to other group in a hierarchy or tree.

HAC
- structure of patterns is fixed to a binary tree
- suffers from a lack of vigor when dealing with data containing noise.

KMeans
- Very sensitive to noise
- Very sensitive to outliers