Scala/Spark implementation of Distributed Nearest Neighbours Mean Shift using LSH
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
project a bit of refactoring, massive quality improvement are coming 😄, unfor… Mar 17, 2018
src update readme Mar 26, 2018
.gitignore advances Feb 24, 2017
LICENSE advances Feb 24, 2017 Update Apr 10, 2018
build.sbt update readme Mar 26, 2018

Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH Download

An upgraded version will be soon available in Clustering4Ever

This scalable algorithm was created during an internship at Computer Science Laboratory (Laboratoire d'Informatique de Paris Nord, LIPN) at the University of Paris 13, with Lebbah Mustapha, Duong Tarn, Azzag Hanene and Beck Gaël. Its purpose is to provide an efficient distributed implementation to cluster large multivariate multidimensional data sets (Big Data) Nearest neighbor mean shift (NNMS) defines clusters in terms of locally density regions in the data density. The main advantages of NNMS are that it can detect automatically the number of clusters in the data set and detect non-ellipsoidal clusters, in contrast to k-means clustering. Exact nearest neighbors calculations in the standard NNMS prevent from being used on Big Data so we introduce approximate nearest neighbors via Locality Sensitive Hashing (LSH), which are based on random scalar projections of the data. To further improve the scalability, we implement NNMS-LSH in the distributed Spark/Scala ecosystem.

Linked publication

  • G. Beck, T. Duong, H. Azzag and M. Lebbah, "Distributed mean shift clustering with approximate nearest neighbours," 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, 2016, pp. 3110-3115. doi: 10.1109/IJCNN.2016.7727595
  • Tarn Duong, Gael Beck, Hanene Azzag, Mustapha Lebbah. Nearest neighbour estimators of density derivatives, with application to mean shift clustering. Pattern Recognition Letters. doi =

Interesting Spark algorithms done at LIPN


  • k is the number of neighbours to look at in order to compute centroid.
  • nbseg is the number of segments on which the data vectors are projected during LSH. Its value should usually be larger than 20, depending on the data set.
  • nbblocs1 is a crucial parameter as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
  • nbblocs2 as larger values give faster but less accurate LSH approximate nearest neighbors, and as smaller values give slower but more accurate approximations.
  • cmin is the threshold under which clusters with fewer than cmin members are merged with the next nearest cluster.
  • normalisation is a flag if the data should be first normalized (X-Xmin)/(Xmax-Xmin) before clustering.
  • w is a uniformisation constant for LSH.
  • ratioToStop is the percentage of data that need to not have converged in order to considere to stop gradient ascent iterations, set to 1.0 to disable it.
  • yStarIter is the maximum number of iterations in the gradient ascent in the mean shift update.
  • epsilon1 is the threshold under which we considere a mod to have converged.
  • epsilon2 is the threshold under which two final mean shift iterates are considered to be in the same cluster.
  • epsilon3 is the threshold under which two final clusters are considered to be the same.
  • nbLabelIter is the number of iteration for the labelisation step, it determines the number of final models


Multivariate multidimensional clustering

Unlike the image analysis which has a specific data pre-processing before the mean shift, for general multivariate multidimensional data sets, it is recommended to normalize data so that each variable has a comparable magnitude to the other variables to improve the performance in the distance matrix computation used to determine nearest neighbors.

Image analysis

To carry out image analysis, it is recommended to convert the usual color formats (e.g. RGB, CYMK) to the Luv* color space as the close values in the Luv*-space correspond more to visual perceptions of color proximity, as well adding the row and column indices (x,y). Each pixel is transformed to a 5-dimensional vector (x,y,L*, u*, v*) which is then input into the mean shift clustering.

  val sc = new SparkContext(conf)
  val meanShift = msLsh.MsLsh
  val data = sc.textFile("myData.csv")
  val parsedData =',').map(_.toDouble)).map(y => (Vectors.dense(y)))
  val models = meanShift.train(  sc,
  // Save result for an image as (ID, Vector, ClusterNumber)
  meanShift.saveImageAnalysis(models.head, "MyImageResultDirectory",1)

  // Save result as (ID, ClusterNumber)
  meanShift.savelabeling(models.head, "MyResultDirectory", 1)

  // Save centroids result as (NumCluster, cardinality, CentroidVector)
  meanShift.saveClusterInfo(models.head, "centroidDirectory")

Image segmentation example

The picture on top left corner is the #117 from Berkeley Segmentation Dataset and Benchmark repository. Others are obtained with :

  • nbblocs1 : 200 (top right) , 500 (bottom left), 1000 (bottom right)
  • nbblocs2 : 1
  • k : 50
  • epsilon2 : 0.05
  • epsilon3 : 0.05 // doesn't matter with nbblocs2 = 1
  • yStarIter : 10
  • nbLabelIter : 1
  • w : 1
  • cmin : 200

alt text


Please feel free to report any bugs, recommendations or requests to to improve this algorithm.