# Support Vector Machines: College Admission

Let's look at a classification example in Spark MLLib.  We looked at the college admission before. We can look again at this dataset.  


In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import pandas as pd



## Step 1: Load the data

In [None]:
dataset = spark.read.csv("/data/college-admissions/admission-data.csv", header=True, inferSchema=True)


In [None]:
dataset.show(20)

## Step 2: Visualize the data

We cannot visualize the data because there is too many components. However, we can use PCA as
a dimensionality reduction technique to visualize it.

**=>TODO: Reduce dimensions using PCA down to 2. Explain Why 2??? **

In [None]:
pca = PCA(n_components=???)   #TODO: Reduce dimensions down to 2

data = dataset.toPandas()

y = data['admit']          # Split off classifications
X = data.ix[:, 'gre':] # Split off features
#X_norm = (X - X.min())/(X.max() - X.min())
#transformed = pd.DataFrame(pca.fit_transform(X_norm))
transformed = pd.DataFrame(pca.fit_transform(X))

transformed

plt.scatter(transformed[y==0][0], transformed[y==0][1], label='Rejected', c='red')
plt.scatter(transformed[y==1][0], transformed[y==1][1], label='Admitted', c='blue')

plt.legend()




How separable does this data appear to be?  Do you notice any trends in the red and blue dots?

## Step 3: Build the Vector

**=> Build the vector with these three columns: admit, gre, gpa ** 

In [None]:
assembler = VectorAssembler(inputCols=['???', '???', '????'], outputCol="features")
featureVector = assembler.transform(training)
featureVector = featureVector.withColumnRenamed("admit", "label")

## Step 4: Split into training and test.

**=> Split into training/test with an 80/20 split ** 

In [None]:
## Split into training and test
## TODO: create training and test with an 80/20 split
(training, test) = dataset.randomSplit([???, ???])

## Step 5: Build the Linear SVM model

In [None]:
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(featureVector)

# Print the coefficients and intercept for linearsSVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))


**=> TODO: take the coefficients and interrcept and do your own prediction **

sample_gre = 700
sample_gpa = 3.5
sample_rank = 3

predicted_admit = (??? * ???) + (??? * ???) + (??? * ???) + ???


In [None]:
# TODO: do your prediction


## Step 6: Run the test set and get the predictions

**=> TODO: Rename the label from "admit" to "label" **

**=> TODO: Transform the test dataset to get predictions **



In [None]:
featureVector_test = assembler.transform(training)
featureVector_test = featureVector_test.withColumnRenamed("???", "???")

predictions = lsvcModel.transform(???)

predictions.show()

## Step 7: See the evaluation metrics

In [None]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)  #AUC


**=> What does AUC = 1 mean?** 

In [None]:

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))


## Step 8: Show the confusion matrix

In [None]:
# Confusion matrix
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

**=> TODO: What is the meaning of the confusion matrix? **



## Step 9: Try running a prediction on your own data

**=> Create a few rows in your own dataframe (start with pandas dataframe) ** 
**=> Run .transform from your model to see the results.