This scalable algorithm was created during an internship at Computer Science Laboratory (Laboratoire d'Informatique de Paris Nord, LIPN) at the University of Paris 13, with Lebbah Mustapha, Duong Tarn, Azzag Hanene and Beck Gaël. Its purpose is to provide an efficient distributed implementation to cluster large multivariate multidimensional data sets (Big Data) Nearest neighbor mean shift (NNMS) defines clusters in terms of locally density regions in the data density. The main advantages of NNMS are that it can detect automatically the number of clusters in the data set and detect non-ellipsoidal clusters, in contrast to k-means clustering. Exact nearest neighbors calculations in the standard NNMS prevent from being used on Big Data so we introduce approximate nearest neighbors via Locality Sensitive Hashing (LSH), which are based on random scalar projections of the data. To further improve the scalability, we implement NNMS-LSH in the distributed Spark/Scala ecosystem.
- G. Beck, T. Duong, H. Azzag and M. Lebbah, "Distributed mean shift clustering with approximate nearest neighbours," 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, 2016, pp. 3110-3115. doi: 10.1109/IJCNN.2016.7727595
- Tarn Duong, Gael Beck, Hanene Azzag, Mustapha Lebbah. Nearest neighbour estimators of density derivatives, with application to mean shift clustering. Pattern Recognition Letters. doi = http://dx.doi.org/10.1016/j.patrec.2016.06.021
- k is the number of neighbours to look at in order to compute centroid.
- nbseg is the number of segments on which the data vectors are projected during LSH. Its value should usually be larger than 20, depending on the data set.
- nbblocs1 is a crucial parameter as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
- nbblocs2 as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
- cmin is the threshold under which clusters with fewer than cmin members are merged with the next nearest cluster.
- normalisation is a flag if the data should be first normalized (X-Xmin)/(Xmax-Xmin) before clustering.
- w is a uniformisation constant for LSH.
- ratioToStop is the percentage of data that need to not have converged in order to considere to stop gradient ascent iterations, set to 1.0 to disable it.
- yStarIter is the maximum number of iterations in the gradient ascent in the mean shift update.
- epsilon1 is the threshold under which we considere a mod to have converged.
- epsilon2 is the threshold under which two final mean shift iterates are considered to be in the same cluster.
- epsilon3 is the threshold under which two final clusters are considered to be the same.
- nbLabelIter is the number of iteration for the labelisation step, it determines the number of final models
Multivariate multidimensional clustering
Unlike the image analysis which has a specific data pre-processing before the mean shift, for general multivariate multidimensional data sets, it is recommended to normalize data so that each variable has a comparable magnitude to the other variables to improve the performance in the distance matrix computation used to determine nearest neighbors.
To carry out image analysis, it is recommended to convert the usual color formats (e.g. RGB, CYMK) to the Luv* color space as the close values in the Luv*-space correspond more to visual perceptions of color proximity, as well adding the row and column indices (x,y). Each pixel is transformed to a 5-dimensional vector (x,y,L*, u*, v*) which is then input into the mean shift clustering.
val sc = new SparkContext(conf) val meanShift = msLsh.MsLsh val data = sc.textFile("myData.csv") val parsedData = data.map(_.split(',').map(_.toDouble)).map(y => (Vectors.dense(y))) .zipWithIndex .map(_.swap) val models = meanShift.train( sc, parsedData, k=60, epsilon1=0.001, epsilon2=0.05, epsilon3=0.06, ratioToStop=0.01, yStarIter=10, cmin=0, normalisation=true, w=1, nbseg=100, nbblocs1=50, nbblocs2=50, nbLabelIter=1) // Save result for an image as (ID, Vector, ClusterNumber) meanShift.saveImageAnalysis(models.head, "MyImageResultDirectory",1) // Save result as (ID, ClusterNumber) meanShift.savelabeling(models.head, "MyResultDirectory", 1) // Save centroids result as (NumCluster, cardinality, CentroidVector) meanShift.saveClusterInfo(models.head, "centroidDirectory")
Image segmentation example
The picture on top left corner is the #117 from Berkeley Segmentation Dataset and Benchmark repository. Others are obtained with :
- nbblocs1 : 200 (top right) , 500 (bottom left), 1000 (bottom right)
- nbblocs2 : 1
- k : 50
- epsilon2 : 0.05
- epsilon3 : 0.05 // doesn't matter with nbblocs2 = 1
- yStarIter : 10
- nbLabelIter : 1
- w : 1
- cmin : 200
Please feel free to report any bugs, recommendations or requests to firstname.lastname@example.org to improve this algorithm.