<a href="https://colab.research.google.com/github/ghasemieh/Data-Structure-and-Algorithms/blob/master/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ClusterEnsembles

A Python package for cluster ensembles. Cluster ensembles generate a single consensus clustering label by using base labels obtained from multiple clustering algorithms. The consensus clustering label stably achieves a high clustering performance. 

<p align="center">
  <img width="600" src="https://user-images.githubusercontent.com/60049342/115107122-deb7b880-9fa3-11eb-98d6-9d1d25bf3ae8.png">
</p>

Installation
------------

In [16]:
# !pip install ClusterEnsembles numpy

Usage
-----

`CE.cluster_ensembles` is used as follows.

In [14]:
import pandas as pd
import numpy as np
import ClusterEnsembles as CE
from sklearn import metrics

In [21]:
X = pd.read_csv("gc_score.csv")
X.drop(columns="CompanyName", inplace=True)
X = X.to_numpy()

In [22]:
labels = pd.read_csv("label_aggregator.csv")
labels.head()

Unnamed: 0,0,0.1,1,2,3,4,5,6,7,8,...,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671
0,K_Means_2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,K_Means_3,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,K_Means_4,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,K_Means_5,0,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,K_Means_6,0,0,0,5,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
labels.drop(columns=["0"], inplace=True)

In [24]:
labels = labels.values

In [25]:
label_aggregator = []
score_ensemble = []

solvers = ['cspa', 'hgpa', 'mcla', 'hbgf', 'nmf', 'all']

for solver in solvers:

  label_ce = CE.cluster_ensembles(labels, solver=solver)

  silhouette_score = metrics.silhouette_score(X, label_ce, metric='euclidean')
  calinski_harabasz_score = metrics.calinski_harabasz_score(X, label_ce)
  davies_bouldin_score = metrics.davies_bouldin_score(X, label_ce)

  label_aggregator.append((f"Ensemble_{solver}", label_ce))
  score_ensemble.append((f"Ensemble_{solver}", silhouette_score, calinski_harabasz_score, davies_bouldin_score))

In [26]:
score_aggregate_df = pd.DataFrame(score_ensemble, columns=["Model", "silhouette_score", "calinski_harabasz_score","davies_bouldin_score"])
score_aggregate_df.to_csv("score_ensemble.csv", index=False)

In [27]:
score_aggregate_df

Unnamed: 0,Model,silhouette_score,calinski_harabasz_score,davies_bouldin_score
0,Ensemble_cspa,-0.20173,45.050406,162.597607
1,Ensemble_hgpa,-0.118685,66.309057,91.265408
2,Ensemble_mcla,0.867982,399.448345,1.212863
3,Ensemble_hbgf,-0.089887,130.736047,30.987142
4,Ensemble_nmf,-0.38227,11.1961,12.026666
5,Ensemble_all,0.859148,378.397265,1.256947


#### Parameters

- `labels`: *numpy.ndarray*
  
  Labels generated by multiple clustering algorithms such as K-Means. 
  
  **Note:** Assume that the length of each label is the same. 

- `nclass`: *int, default=None*
  
  Number of classes in a consensus clustering label. 
  If `nclass=None`, set the maximum number of classes in each label except missing values. 
  In other words, set `nclass=3` automatically in the above.

- `solver`: *{'cspa', 'hgpa', 'mcla', 'hbgf', 'nmf', 'all'}, default='hbgf'*
    
    'cspa': Cluster-based Similarity Partitioning Algorithm [1].

    'hgpa': HyperGraph Partitioning Algorithm [1].

    'mcla': Meta-CLustering Algorithm [1].
    
    'hbgf': Hybrid Bipartite Graph Formulation [2].

    'nmf': NMF-based consensus clustering [3].

    'all': The consensus clustering label with the largest objective function value [1] is returned among the results of all solvers. 
    
    <p align="center">
      <img width="600" src="https://user-images.githubusercontent.com/60049342/116185712-20dbb980-a75d-11eb-87cb-ae0e68179674.png">
    </p>
    
    **Note:** Please use 'hbgf' for large-scale `labels`.

- `random_state`: *int, default=None*
  
  Used for 'hgpa', 'mcla', and 'nmf'. Please pass an integer for reproducible results.

- `verbose`: *bool, default=False*
  
  Whether to be verbose.

#### Return

- `label_ce`: *numpy.ndarray*
  
  A consensus clustering label generated by cluster ensembles. 

References
----------

[1] A. Strehl and J. Ghosh, 
"Cluster ensembles -- a knowledge reuse framework for combining multiple partitions,"
Journal of Machine Learning Research, vol. 3, pp. 583-617, 2002.

[2] X. Z. Fern and C. E. Brodley, 
"Solving cluster ensemble problems by bipartite graph partitioning,"
In Proceedings of the Twenty-First International Conference on Machine Learning, p. 36, 2004.

[3] T. Li, C. Ding, and M. I. Jordan, 
"Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization," 
In Proceedings of the Seventh IEEE International Conference on Data Mining, pp. 577-582, 2007.

[4] J. Ghosh and A. Acharya, 
"Cluster ensembles," 
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 4, pp. 305-315, 2011. 