# Multiple Logistic Regression in Spark  - College Admission

### Overview
Predict college admission using Multiple Logistic Regression
 
### Builds on
None

### Run time
approx. 10-20 minutes

### Notes



In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])


## Step 1: College Admission Data

Let's look at the college admission data.  Here, we have some student test scores, GPA, and Rank, followed by whether the student was admitted or not.


|gre  |gpa  |rank |  admitted |
|-----------------------------|
|380  |3.61 | 3   |    no     |
|660  |3.67 | 1   |    yes    |
|800  |4.0  | 1   |    yes    |
|640  |3.19 | 4   |    yes    |
|520  |2.93 | 4   |    no     |
|760  |3.0  | 2   |    yes    |

In [None]:
admissions = spark.read.csv("/data/college-admissions/admission-data.csv", header=True, inferSchema=True)
admissions.show()

## Step 2: Convert dataframe to Spark and Prepare feature vector

We need to firstconvert the dataframe to spark, and then prepare the feature vector.

**=> TODO: Select all columns except for "admit" to be in features **

**=> TODO: Make a new column called "label" with same value as "admit" **



In [None]:
## Hint : select columns 'gpa', 'gre', 'rank'
assembler = VectorAssembler(inputCols=["???", "???","???"], outputCol="features")
featureVector = assembler.transform(admissions)

## Hint : featureVector is 'admit'
featureVector = featureVector.withColumn("label",featureVector["???"])
featureVector.show()


## Step 3: Split Data into training and Test

We will split our data into training and test so we can see how it performs.

**=> TODO: Use training / test split of 70%/30% **

In [None]:
## Split the data into train and test
## Hint : 
##     - training split is 70%  --> 0.7
##     - testing split  is 30%  --> 0.3
(train, test) = featureVector.randomSplit([??training_split??,  ??testing split??])
#(train, test) = featureVector.randomSplit([??training_split??,  ??testing split??], seed=1)

## print out record count
print ("train dataset count : " , train.???())
print ("test dataset count : " , test.???())

print("training data set")
train.show(10)

print("test data set")
test.show(10)


## Step 4: Run logistic regression

**=> TODO: Run with 50 iteraitons **

In [None]:
lr = LogisticRegression(maxIter=???, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

## Step 5: Inspect Learning


In [None]:
## show predictions in training data
lrModel.summary.predictions.show()

## sample based on data
lrModel.summary.predictions.sampleBy("label", fractions={0: 0.5, 1: 0.5}, seed=0).show()

In [None]:
## How many data points where label != prediction
missed = lrModel.summary.predictions.filter("label != prediction")

print("missed ", missed.count())
missed.show()


## Step 6: Evaluate Model

### 6.1 Confusion Matrix

In [None]:
predictions = lrModel.transform(test)
predictions.groupBy('admit').pivot('prediction').count().na.fill(0).orderBy('admit').show()

### 6.2 :  ROC Curve & AUC

In [None]:
trainingSummary = lrModel.summary

# Area Under Curve is part of training summary
# use TAB completion :  trainingSummary.TAB

print("areaUnderROC: " , trainingSummary.???)

In [None]:
# ROC

roc_df = trainingSummary.roc.toPandas()

plt.plot(roc_df['FPR'], roc_df['TPR'])
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.plot([0.0, 1.0], [0.0, 1.0], 'r')

### 6.2 - Iterations and Objective History
** Q : how many iterations did we do?**  
- What does that tell you?
- Increase the total number of iterations from 50 to 100, Does it change the results?

In [None]:
## trainingSummary has an attribute called 'totalIterations'  
## Hint : use the TAB completion
print ("total iterations ", trainingSummary.???)

## you can uncomment this and see how the error is diminishing in each iteration
print("objectiveHistory:")
#for objective in trainingSummary.objectiveHistory:
#    print(objective)

## Step 7: Run on the test data

**=>TODO: transform the test data **

In [None]:
## what is the name of test dataframe?
predictions = lrModel.transform(???)
predictions.show()

## sample
predictions.sampleBy("label", fractions={0: 0.5, 1: 0.5}, seed=0).show()


## Step 8: Calcuate Accuracy on Test Data

**=>TODO: evaluate the predictions **

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " , accuracy)

## Step 9: Run some predictions on new data

Let's take some new data, and run predictions on that.

**=>TODO: create spark dataframe from pandas dataframe **

**=>TODO: transform the new data in order to get feature vectors **

In [None]:
newdata = pd.DataFrame({'gre' : [600, 700, 800], 
                        'gpa' : [4.0, 3.5, 3.2],
                        'rank': [1,   2,   3]}
             )
print(newdata)

## Hint : input is 'newdata'
spark_newdata = spark.createDataFrame(???)

## Hint : spark_newdata
newfeatures = assembler.transform(???)

lrModel.transform(newfeatures).show()

## Step 10 : Save Output
**TODO : Inspect the saved data**  
(Hint : you can open them in excel)

In [None]:
# save data to a csv file for inspection
predictions2 = predictions.select(['admit', 'gre', 'gpa', 'rank', 'prediction'])

## option1 : use Spark write function
## this works for big data (writes are distributed across cluster)
output_path1="college-admissions-predictions.out1"
predictions2.write.\
    option('header', 'true').\
    mode('overwrite').\
    csv(output_path1)
print("save 1 (spark)  to : ", output_path1)


## Option 2 : convert to Pandas dataframe and save
## This is good for small amount of data
output_path2= 'college-admissions-predictions.out2.csv'
predictions2.toPandas().to_csv(output_path2 )
print("save 2 (pandas) to : ", output_path2)

## Step 11: Do a few runs and see the accuracy
Why do you think the accuracy varies for each run?  

Try this, at Step-3, supply a seed parameter (can be any number) and do the run again.  
Do you see the accuracy varying now?  
Can you explain the behaviour?

In [None]:
trainingSummary = lrModel.summary
# Area Under Curve
print("areaUnderROC: " , trainingSummary.areaUnderROC)
accuracy = evaluator.evaluate(predictions) 
print("Test set accuracy = " , accuracy)