## Support Vector Machine (SVM)
###### SVM constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data points of any class (so-called functional margin),since in gneral the larger the margin the lower the generalization error of the classifier. LinearSVC is spark ML supports binary classification with linear SVM. Internally, it optimizes the Hinge Loss using OWLQN optimizer

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.classification import LinearSVC

In [3]:
spark = SparkSession.builder.appName("SVM").getOrCreate()
training = spark.read.csv("..\Data\kyphosis.csv", header = True, inferSchema = True)
training.show()

+--------+---+------+-----+
|Kyphosis|Age|Number|Start|
+--------+---+------+-----+
|  absent| 71|     3|    5|
|  absent|158|     3|   14|
| present|128|     4|    5|
|  absent|  2|     5|    1|
|  absent|  1|     4|   15|
|  absent|  1|     2|   16|
|  absent| 61|     2|   17|
|  absent| 37|     3|   16|
|  absent|113|     2|   16|
| present| 59|     6|   12|
| present| 82|     5|   14|
|  absent|148|     3|   16|
|  absent| 18|     5|    2|
|  absent|  1|     4|   12|
|  absent|168|     3|   18|
|  absent|  1|     3|   16|
|  absent| 78|     6|   15|
|  absent|175|     5|   13|
|  absent| 80|     5|   16|
|  absent| 27|     4|    9|
+--------+---+------+-----+
only showing top 20 rows



In [10]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol = "Kyphosis", outputCol = "indexedLabel").fit(training)
trainIndexed = labelIndexer.transform(training)
trainIndexed.show()

+--------+---+------+-----+------------+
|Kyphosis|Age|Number|Start|indexedLabel|
+--------+---+------+-----+------------+
|  absent| 71|     3|    5|         0.0|
|  absent|158|     3|   14|         0.0|
| present|128|     4|    5|         1.0|
|  absent|  2|     5|    1|         0.0|
|  absent|  1|     4|   15|         0.0|
|  absent|  1|     2|   16|         0.0|
|  absent| 61|     2|   17|         0.0|
|  absent| 37|     3|   16|         0.0|
|  absent|113|     2|   16|         0.0|
| present| 59|     6|   12|         1.0|
| present| 82|     5|   14|         1.0|
|  absent|148|     3|   16|         0.0|
|  absent| 18|     5|    2|         0.0|
|  absent|  1|     4|   12|         0.0|
|  absent|168|     3|   18|         0.0|
|  absent|  1|     3|   16|         0.0|
|  absent| 78|     6|   15|         0.0|
|  absent|175|     5|   13|         0.0|
|  absent| 80|     5|   16|         0.0|
|  absent| 27|     4|    9|         0.0|
+--------+---+------+-----+------------+
only showing top

In [11]:
#Assemly all Features
from pyspark.ml.feature import VectorAssembler
featureassembler = VectorAssembler(inputCols = ["Age", "Number", "Start"], outputCol = "features")
output = featureassembler.transform(trainIndexed)
output.show()

+--------+---+------+-----+------------+----------------+
|Kyphosis|Age|Number|Start|indexedLabel|        features|
+--------+---+------+-----+------------+----------------+
|  absent| 71|     3|    5|         0.0|  [71.0,3.0,5.0]|
|  absent|158|     3|   14|         0.0|[158.0,3.0,14.0]|
| present|128|     4|    5|         1.0| [128.0,4.0,5.0]|
|  absent|  2|     5|    1|         0.0|   [2.0,5.0,1.0]|
|  absent|  1|     4|   15|         0.0|  [1.0,4.0,15.0]|
|  absent|  1|     2|   16|         0.0|  [1.0,2.0,16.0]|
|  absent| 61|     2|   17|         0.0| [61.0,2.0,17.0]|
|  absent| 37|     3|   16|         0.0| [37.0,3.0,16.0]|
|  absent|113|     2|   16|         0.0|[113.0,2.0,16.0]|
| present| 59|     6|   12|         1.0| [59.0,6.0,12.0]|
| present| 82|     5|   14|         1.0| [82.0,5.0,14.0]|
|  absent|148|     3|   16|         0.0|[148.0,3.0,16.0]|
|  absent| 18|     5|    2|         0.0|  [18.0,5.0,2.0]|
|  absent|  1|     4|   12|         0.0|  [1.0,4.0,12.0]|
|  absent|168|

In [12]:
finalized_data = output.select("features", "indexedLabel")
finalized_data.show()

+----------------+------------+
|        features|indexedLabel|
+----------------+------------+
|  [71.0,3.0,5.0]|         0.0|
|[158.0,3.0,14.0]|         0.0|
| [128.0,4.0,5.0]|         1.0|
|   [2.0,5.0,1.0]|         0.0|
|  [1.0,4.0,15.0]|         0.0|
|  [1.0,2.0,16.0]|         0.0|
| [61.0,2.0,17.0]|         0.0|
| [37.0,3.0,16.0]|         0.0|
|[113.0,2.0,16.0]|         0.0|
| [59.0,6.0,12.0]|         1.0|
| [82.0,5.0,14.0]|         1.0|
|[148.0,3.0,16.0]|         0.0|
|  [18.0,5.0,2.0]|         0.0|
|  [1.0,4.0,12.0]|         0.0|
|[168.0,3.0,18.0]|         0.0|
|  [1.0,3.0,16.0]|         0.0|
| [78.0,6.0,15.0]|         0.0|
|[175.0,5.0,13.0]|         0.0|
| [80.0,5.0,16.0]|         0.0|
|  [27.0,4.0,9.0]|         0.0|
+----------------+------------+
only showing top 20 rows



In [15]:
train, test = finalized_data.randomSplit([0.7, 0.3], 12345)

In [16]:
lsvc = LinearSVC(featuresCol="features",
                 labelCol="indexedLabel", maxIter=10, regParam=0.1)


In [17]:
#Fit tht Model 
lsvcModel = lsvc.fit(train)

In [19]:
#Print the Coefficients and intercept
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))

Coefficients: [0.0013192749316592355,0.06863346641500623,-0.030499745618597404]
Intercept: -1.1235857645035567


In [20]:
prediction = lsvcModel.evaluate(test)

In [23]:
print("Accuracy: " + str(prediction.accuracy))

Accuracy: 0.75


In [25]:
prediction.predictions.show()



+----------------+------------+--------------------+----------+
|        features|indexedLabel|       rawPrediction|prediction|
+----------------+------------+--------------------+----------+
|  [2.0,3.0,13.0]|         0.0|[1.31154350843698...|       0.0|
| [15.0,5.0,16.0]|         0.0|[1.24862523835119...|       0.0|
|  [18.0,5.0,2.0]|         0.0|[0.81767097489585...|       0.0|
| [36.0,4.0,13.0]|         0.0|[1.19805469434556...|       0.0|
| [37.0,3.0,16.0]|         0.0|[1.35686812268470...|       0.0|
| [59.0,6.0,12.0]|         1.0|[0.99994469246879...|       0.0|
| [78.0,6.0,15.0]|         0.0|[1.06637770562306...|       0.0|
| [105.0,6.0,5.0]|         1.0|[0.72575982628228...|       0.0|
| [114.0,7.0,8.0]|         1.0|[0.73675212233813...|       0.0|
|[140.0,5.0,11.0]|         0.0|[0.93121714380080...|       0.0|
|[158.0,3.0,14.0]|         0.0|[1.13623636471674...|       0.0|
|[195.0,2.0,17.0]|         0.0|[1.24755589551614...|       0.0|
+----------------+------------+---------

In [26]:
prediction.areaUnderROC

0.9259259259259259