# **Semiparametric SVM training in Spark **

This is an example to run the code used in the paper:

"Distributed Non-Linear Semiparametric Support Vector Machine for Big Data Applications on Spark"

Submitted to the journal:

"IEEE Transactions on Systems, Man and Cybernetics: Systems"

This library can be used to train a Distributed Semiparametric SVM:

 - Making use of a distributed and stochastic version of the Sparse Greedy Matrix Approximation algorithm (called DSSGMA) to obtain the elements of the semiparametric model.
 - Using a distributed version of the Iterative Re-Weighted Least Squares procedure to obtain the weights of the semiparametric model.
 
Concretely this example run the algorithm on the adult dataset.


In [1]:
##############################
# Some jupyter notebook for are preconfigured to run diretly over pyspark
# 
# An alternative is the installation of findspark and running the notebook from python.
# If your have installed findspark uncomment the following lines and change the spark configuration
# to point directly to your spark 
##############################

#import findspark
#findspark.init()
#from pyspark import SparkConf, SparkContext
#conf = (SparkConf().setMaster("local[4]").setAppName("My app").set("spark.executor.memory", "2g"))
#sc = SparkContext(conf = conf)


%matplotlib inline
from math import sqrt
from common.lib.IRWLSUtils import *
from pyspark.mllib.util import MLUtils
import numpy as np
from pyspark.mllib.regression import LabeledPoint
    

#sc.addPyFile("file:///export/usuarios01/navia/spark/SVM_spark/common/lib/svm_utils.py")
#sc.addPyFile("file:///export/usuarios01/navia/spark/SVM_spark/common/lib/IRWLSUtils.py")   
#sc.addPyFile("file:///export/usuarios01/navia/spark/SVM_spark/common/lib/KernelUtils.py")   
#sc.addPyFile("file:///export/usuarios01/navia/spark/SVM_spark/common/lib/ResultsUtils.py")   
#sc.addPyFile("file:///export/usuarios01/navia/spark/SVM_spark/common/lib/SGMAUtils.py")   
#from IRWLSUtils import loadFile, train_SGMA_IRWLS

    

# Initialization of hte variables         
Npartitions = -99 # The number of partitions in the RDD, using a negative value the cluster choose the number.
Samplefraction = 0.05
Niter = 300
NC = 50
C=100.0


print "Loading Adult dataset"
dimensions = 123                           
XtrRDD = loadFile('./data/' + 'adult_train',sc,dimensions,Npartitions)
XvalRDD = loadFile('./data/' + 'adult_val',sc,dimensions,Npartitions)
XtstRDD = loadFile('./data/' + 'adult_test',sc,dimensions,Npartitions)

fsigma = 0.9
sigma = fsigma * np.sqrt(dimensions)
                    

XtrRDD.cache()
XvalRDD.cache()
XtstRDD.cache()

             
auc_tr, auc_val, auc_tst, exe_time = train_SGMA_IRWLS(XtrRDD, XvalRDD, XtstRDD, sigma, C, NC, Niter)


print "AUCtr = %f, AUCval = %f, AUCtst = %f" % (auc_tr, auc_val, auc_tst)
print "Elapsed minutes = %f" % (exe_time / 60.0)



Loading Adult dataset
Centroid 1 : Taking candidates, Evaluating ED, Max ED: 781.670223008 , Updating Matrices Time 1.49642109871
Centroid 2 : Taking candidates, Evaluating ED, Max ED: 6.8187263131 , Updating Matrices Time 2.70972108841
Centroid 3 : Taking candidates, Evaluating ED, Max ED: 0.945041056222 , Updating Matrices Time 2.56803417206
Centroid 4 : Taking candidates, Evaluating ED, Max ED: 0.708839871664 , Updating Matrices Time 2.18658208847
Centroid 5 : Taking candidates, Evaluating ED, Max ED: 0.3276716505 , Updating Matrices Time 2.41872906685
Centroid 6 : Taking candidates, Evaluating ED, Max ED: 0.150301087224 , Updating Matrices Time 2.38541007042
Centroid 7 : Taking candidates, Evaluating ED, Max ED: 0.166837828351 , Updating Matrices Time 2.48313903809
Centroid 8 : Taking candidates, Evaluating ED, Max ED: 0.135551089682 , Updating Matrices Time 2.26849293709
Centroid 9 : Taking candidates, Evaluating ED, Max ED: 0.169826778195 , Updating Matrices Time 2.5240521431
Cen