
# **Semiparametric SVM training in Spark**

This is an example to run the code used in the paper:

"Distributed Non-Linear Semiparametric Support Vector Machine for Big Data Applications on Spark"

Submitted to the journal:

"IEEE Transactions on Systems, Man and Cybernetics: Systems"

This library can be used to train a Distributed Semiparametric SVM:

* Making use of a distributed kmeans to obtain the elements of the semiparametric model.
* Using a distributed version of the Iterative Re-Weighted Least Squares procedure to obtain the weights of the semiparametric model.

Concretely this example run the algorithm on the adult dataset.



In [1]:

#########################################################################################
# If your notebook is not configured to use a previously defined spark cluster,
# you must uncomment the following lines and add your spark parameters and credentials.
#########################################################################################

#import findspark
#findspark.init()
#from pyspark import SparkConf, SparkContext

#conf = (SparkConf().setMaster("local[4]").setAppName("My app").set("spark.executor.memory", "2g"))
#sc = SparkContext(conf = conf)



In [2]:
####################################################
# Now we send our functions to the spark cluster
####################################################

from inspect import getsourcefile
from os.path import abspath,dirname,join

path_name=dirname(abspath(getsourcefile(lambda:0)))

sc.addPyFile("file://"+join(path_name,"common","lib","IRWLSUtils.py"))
sc.addPyFile("file://"+join(path_name,"common","lib","svm_utils.py"))
sc.addPyFile("file://"+join(path_name,"common","lib","KernelUtils.py"))
sc.addPyFile("file://"+join(path_name,"common","lib","ResultsUtils.py"))



In [3]:

############################################################
# Now we load our train, validation and test set
#
# The dataset must be in the folder 'data' of this demo
# currently we are using the adult dataset, if you desire to use
# a different dataset your must specify your file names.
# They must be in libsvm format. Labels must be (0,1) or (-1, 1)
############################################################

from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector, DenseVector
import numpy as np

# Training file
filenameTR = "adultTR"

# Test file
filenameTST = "adultTST"

dimensions=123

XtrRDD = MLUtils.loadLibSVMFile(sc, "file://"+join(path_name,"data",filenameTR),dimensions) \
    .map(lambda x: LabeledPoint(x.label, DenseVector((x.features).toArray()))) 


XtstRDD = MLUtils.loadLibSVMFile(sc, "file://"+join(path_name,"data",filenameTST),dimensions) \
    .map(lambda x: LabeledPoint(x.label, DenseVector((x.features).toArray()))) 



# Mapping labels to (-1, 1), if neccesary
labels = set(XtrRDD.map(lambda x: x.label).take(100))
if 0 in labels:
    print "Mapping labels to (-1, 1)..."
    XtrRDD = XtrRDD.map(lambda x: LabeledPoint(x.label * 2.0 - 1.0, x.features))
    XtstRDD = XtstRDD.map(lambda x: LabeledPoint(x.label * 2.0 - 1.0, x.features))


print("Loaded dataset")


Loaded dataset


In [4]:

#################################################################################################################
#Finally we define the parameterss of our algorithm and 
# we evaluate the AUC, accuracy, training time and classification time
##################################################################################################################

NC=150
C = 1000.0
sigma = 10

from IRWLSUtils import train_SVM

AUCTR, AUCTST, ACCTR, ACCTST, classificationTIME, kmeansTIME, IRWLSTIME = \
    train_SVM(sc, XtrRDD, XtstRDD, sigma, C, NC)
    
    
print "AUCtr = %f, AUCtst = %f" % (AUCTR,AUCTST)
print "ACCtr = %f, ACCtst = %f" % (ACCTR,ACCTST)
print "Elapsed minutes kmeans = %f" % (kmeansTIME / 60.0)
print "Elapsed minutes DIRWLS = %f" % (IRWLSTIME / 60.0)
print "Elapsed minutes classification = %f" % (classificationTIME / 60.0)


Obtaining weights using kmeans
Time obtaining centroids 23.5045239925
Obtaining weights using IRWLS
Iteration 1 : Cost Function 35045.6950759 , Iteration Time 5.48758792877
Iteration 2 : Cost Function 14038.9966916 , Iteration Time 3.55771303177
Iteration 3 : Cost Function 14038.7822024 , Iteration Time 3.43368315697
Iteration 4 : Cost Function 14038.7822024 , Iteration Time 3.50362801552
Time obtaining weights 16.022397995
AUCtr = 0.901157, AUCtst = 0.899044
ACCtr = 0.845490, ACCtst = 0.850255
Elapsed minutes kmeans = 0.391742
Elapsed minutes DIRWLS = 0.267040
Elapsed minutes classification = 0.029007
