# Support Vector Machines: College Admission

Let's look at a classification example in Spark MLLib.  We looked at the college admission before. We can look again at this dataset.  


In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
import matplotlib.pyplot as plt

import pandas as pd

## Step 1: Load the data

In [None]:
dataset = spark.read.csv("/data/college-admissions/admission-data.csv", header=True, inferSchema=True)
dataset.show(20)

## Step 2: Build the Vector

**=> Build the vector with these three columns: rank, gre, gpa ** 

In [None]:
assembler = VectorAssembler(inputCols=['???', '???', '???'], outputCol="features")
featureVector = assembler.transform(dataset)
featureVector = featureVector.withColumn("label", featureVector["admit"])
featureVector.sample(False, 0.1, seed=10).show(50)

## Step 3: Split into training and test.

**=> Split into training/test with an 80/20 split ** 

In [None]:
## Split into training and test
## TODO: create training and test with an 80/20 split
(training, test) = featureVector.randomSplit([???, ???])

print ("training set count ", training.count())
print ("testing set count ", test.count())

## Step 4: Build the Linear SVM model

In [None]:
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(???training set????)

# Print the coefficients and intercept for linearsSVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))


## Step 5: Run the test set and get the predictions

In [None]:
predictions = lsvcModel.transform(test)
predictions.sample(False, 0.2).show(50)

## Step 6: See the evaluation metrics

In [None]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)  #AUC


**=> What does AUC mean?** 

In [None]:

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("accuracy",  accuracy)


## Step 7: Show the confusion matrix

In [None]:
# Confusion matrix
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

**=> TODO: What is the meaning of the confusion matrix? **



## Step 8: Try running a prediction on your own data

**=> Create a few rows in your own dataframe (start with pandas dataframe) ** 

**=> Run .transform from your model to see the results. **

In [None]:
newdata = pd.DataFrame({'gre' : [600, 700, 800], 
                        'gpa' : [4.0, 3.5, 3.2],
                        'rank': [1,   2,   3]}
             )
print(newdata)

## Hint : input is 'newdata'
spark_newdata = spark.createDataFrame(???)

## Hint : spark_newdata
newfeatures = assembler.transform(???)

lsvcModel.transform(newfeatures).show()