This project deals with the implementation of k-means for multi-dimensional clustering.
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is even with danjok:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
project
src
.gitignore
README.md
analyzer.sh
build.sbt
in10
input.txt
parameter.conf
runKmeans.sh
spark.conf

README.md

Spark Kmeans for Spam

Implementation of Kmeans algorithm on Spark system for clustering spam emails

Getting Started #

Requirements

Spark 0.7.0
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

sbt 0.11.3
A build tool for Scala and Java projects. It requires Java 1.6 or later.

Scala 2.9.2
Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way.

Usage

Run runKmeans.sh with these required parameters:

  • path to SPARK root
  • 0 to only run or 1 to package in jar and run (required to deploy on cluster nodes)
  • spark configuration file (see below for format)
  • parameter file (see below for format)
  • outLabel optional label to add to output folder

E.g.

./runKmeans.sh /user/spark/spark-0.7.0 0 spark.conf parameter.conf kmeansOn1000spams

Configuration files

spark.conf

Parameter Meaning
host master URL passed to Spark
appName name of job
inputfile a path to input json dataset, either a local one on the machine or a hdfs://
evalOutput 1 to evaluate result, 0 otherwise

This is an example:

{
  "host": "spark://10.10.14.3:7077",
	"appName": "SparkKmeans",
	"inputFile": "hdfs://10.10.14.3:8020/user/spark/spam/spam10_6.json",
	"evalOutput": 1
}

Note: Master url must be follow these rules Master URL | Meaning ---- | ---- local | Run Spark locally with one worker thread (i.e. no parallelism at all) local[K] | Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine) spark://HOST:PORT | Connect to the given Spark standalone cluster master

parameter.conf

Parameter Meaning
mode method for kmeans, choose "PartialSums" or "Standard" (see Overview on algorithm )
initialCentroids number of initial Centroids, a first approximation can be sqrt(inputSize/2)
convergeDist Stopping condition: old and new centroids between iterations don't change more than this value.
maxIter Stopping condition: iterations of the algorithm are limited to this value
numSamplesForMedoid num of Elements in a cluster that will be used to update centroid for the categorical part in the Standard Kmeans (see Overview on algorithm )
weights weights for features, values between 0 and 1. Sum of weights must be 1.

This is an example:

{
	"mode" : "PartialSums",
	"initialCentroids" : 100,
	"convergeDist" : 0.001,
	"maxIter" : 2,
	"numSamplesForMedoid" : 3,
	"weights" : {
		"space" : 0.20, 
		"time" : 0.20,
		"IP" : 0.20, 
		"uri" : 0.20, 
		"botname" : 0.20 
	}
}

Output File

The output files are in folder ./out
Each execution has its output folder ./out[outLabel_]dDATE with these following files:

File Content
out result Centroids
DATE.log spark log file of executed job
parameter.conf file with algorithm parameters used to run the program