# College Type Predictions

**Prediction to try to classify colleges as Private or Public based off these features:**

- `Private` A factor with levels No and Yes indicating private or public university
- `Apps` Number of applications received
- `Accept` Number of applications accepted
- `Enroll` Number of new students enrolled
- `Top10perc` Pct. new students from top 10% of H.S. class
- `Top25perc` Pct. new students from top 25% of H.S. class
- `F.Undergrad` Number of fulltime undergraduates
- `P.Undergrad` Number of parttime undergraduates
- `Outstate` Out-of-state tuition
- `Room.`Board Room and board costs
- `Books` Estimated book costs
- `Personal Estimated` personal spending
- `PhD` Pct. of faculty with Ph.D.’s
- `Terminal` Pct. of faculty with terminal degree
- `S.F.Ratio` Student/faculty ratio
- `perc.alumni` Pct. alumni who donate
- `Expend Instructional` expenditure per student
- `Grad.Rate` Graduation rate

### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("college").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577772468360)
SparkSession available as 'spark'


2019-12-31 11:38:04 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3a277130


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read the college data set

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("College.csv")

data: org.apache.spark.sql.DataFrame = [School: string, Private: string ... 17 more fields]


### Count

In [4]:
data.count

res1: Long = 777


### Schema

In [5]:
data.printSchema

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



### show

In [6]:
data.show(3)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

### First Row of Dataframe

In [7]:
val colnames = data.columns
val firstRow = data.head(1)(0)

colnames: Array[String] = Array(School, Private, Apps, Accept, Enroll, Top10perc, Top25perc, F_Undergrad, P_Undergrad, Outstate, Room_Board, Books, Personal, PhD, Terminal, S_F_Ratio, perc_alumni, Expend, Grad_Rate)
firstRow: org.apache.spark.sql.Row = [Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60]


In [11]:
for(i <- Range(0,colnames.size)){
    println(s"Column Name: ${colnames(i)}")
    println(s"Column Value: ${firstRow(i)}")
    println()
}

Column Name: School
Column Value: Abilene Christian University

Column Name: Private
Column Value: Yes

Column Name: Apps
Column Value: 1660

Column Name: Accept
Column Value: 1232

Column Name: Enroll
Column Value: 721

Column Name: Top10perc
Column Value: 23

Column Name: Top25perc
Column Value: 52

Column Name: F_Undergrad
Column Value: 2885

Column Name: P_Undergrad
Column Value: 537

Column Name: Outstate
Column Value: 7440

Column Name: Room_Board
Column Value: 3300

Column Name: Books
Column Value: 450

Column Name: Personal
Column Value: 2200

Column Name: PhD
Column Value: 70

Column Name: Terminal
Column Value: 78

Column Name: S_F_Ratio
Column Value: 18.1

Column Name: perc_alumni
Column Value: 12

Column Name: Expend
Column Value: 7041

Column Name: Grad_Rate
Column Value: 60



## Formating the Data

 few things we need to do before Spark can accept the data! 
 
 It needs to be in the form of two columns:
  `("label","features")`

### Importing VectorAssembler and Vectors

In [12]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Assembling all the dependant features to a single vector column "features"

In [13]:
data.columns

res8: Array[String] = Array(School, Private, Apps, Accept, Enroll, Top10perc, Top25perc, F_Undergrad, P_Undergrad, Outstate, Room_Board, Books, Personal, PhD, Terminal, S_F_Ratio, perc_alumni, Expend, Grad_Rate)


In [15]:
//Ignoring school column since it is a string and not much of use
val assembler = new VectorAssembler().setInputCols(Array("Apps", "Accept", "Enroll", "Top10perc", "Top25perc", "F_Undergrad"
                                                         , "P_Undergrad", "Outstate", "Room_Board", "Books", "Personal", "PhD"
                                                         , "Terminal", "S_F_Ratio", "perc_alumni", "Expend", "Grad_Rate"))
                                     .setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_66e9bbb242bc


In [16]:
val output = assembler.transform(data)

output: org.apache.spark.sql.DataFrame = [School: string, Private: string ... 18 more fields]


In [17]:
output.printSchema

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)



## Since the input label "Private" is a string categorical column, it needs to be converted to Numerical type

### Using String Indexer to convert string categorical column to numerical type

In [18]:
import org.apache.spark.ml.feature.StringIndexer

import org.apache.spark.ml.feature.StringIndexer


In [19]:
val indexer = new StringIndexer().setInputCol("Private").setOutputCol("PrivateInd")

indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_e0b3dac7e70d


In [20]:
val indexed_model = indexer.fit(output)

indexed_model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e0b3dac7e70d


In [21]:
val fixed_output = indexed_model.transform(output)

fixed_output: org.apache.spark.sql.DataFrame = [School: string, Private: string ... 19 more fields]


In [22]:
fixed_output.printSchema

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateInd: double (nullable = false)



In [23]:
val final_data = fixed_output.select("PrivateInd","features")

final_data: org.apache.spark.sql.DataFrame = [PrivateInd: double, features: vector]


In [25]:
final_data.show(5,false)

+----------+----------------------------------------------------------------------------------------------------------+
|PrivateInd|features                                                                                                  |
+----------+----------------------------------------------------------------------------------------------------------+
|0.0       |[1660.0,1232.0,721.0,23.0,52.0,2885.0,537.0,7440.0,3300.0,450.0,2200.0,70.0,78.0,18.1,12.0,7041.0,60.0]   |
|0.0       |[2186.0,1924.0,512.0,16.0,29.0,2683.0,1227.0,12280.0,6450.0,750.0,1500.0,29.0,30.0,12.2,16.0,10527.0,56.0]|
|0.0       |[1428.0,1097.0,336.0,22.0,50.0,1036.0,99.0,11250.0,3750.0,400.0,1165.0,53.0,66.0,12.9,30.0,8735.0,54.0]   |
|0.0       |[417.0,349.0,137.0,60.0,89.0,510.0,63.0,12960.0,5450.0,450.0,875.0,92.0,97.0,7.7,37.0,19016.0,59.0]       |
|0.0       |[193.0,146.0,55.0,16.0,44.0,249.0,869.0,7560.0,4120.0,800.0,1500.0,76.0,72.0,11.9,2.0,10922.0,15.0]       |
+----------+----------------------------

### Splitting the resultant data into training data and testing data, Training data is to train the model, Testing data is to test the builted model

In [26]:
val Array(train_data,test_data) = final_data.randomSplit(Array(0.7,0.3))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [PrivateInd: double, features: vector]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [PrivateInd: double, features: vector]


In [30]:
final_data.count()

res16: Long = 777


In [31]:
train_data.count()

res17: Long = 528


In [32]:
test_data.count()

res18: Long = 249


### Setting up the Tree Classifier Models

In [33]:
import org.apache.spark.ml.classification.{DecisionTreeClassifier,RandomForestClassifier,GBTClassifier}

import org.apache.spark.ml.classification.{DecisionTreeClassifier, RandomForestClassifier, GBTClassifier}


#### Creating all three models:

In [34]:
val dtc = new DecisionTreeClassifier().setLabelCol("PrivateInd").setFeaturesCol("features")
val rfc = new RandomForestClassifier().setLabelCol("PrivateInd").setFeaturesCol("features")
val gbt = new GBTClassifier().setLabelCol("PrivateInd").setFeaturesCol("features")

dtc: org.apache.spark.ml.classification.DecisionTreeClassifier = dtc_386de2f591e4
rfc: org.apache.spark.ml.classification.RandomForestClassifier = rfc_4e4d45bbf1eb
gbt: org.apache.spark.ml.classification.GBTClassifier = gbtc_31364d7d1dcf


### Training all three models:

In [35]:
val dtc_model = dtc.fit(train_data)
val rfc_model = rfc.fit(train_data)
val gbt_model = gbt.fit(train_data)

dtc_model: org.apache.spark.ml.classification.DecisionTreeClassificationModel = DecisionTreeClassificationModel (uid=dtc_386de2f591e4) of depth 5 with 49 nodes
rfc_model: org.apache.spark.ml.classification.RandomForestClassificationModel = RandomForestClassificationModel (uid=rfc_4e4d45bbf1eb) with 20 trees
gbt_model: org.apache.spark.ml.classification.GBTClassificationModel = GBTClassificationModel (uid=gbtc_31364d7d1dcf) with 20 trees


### Getting the results of all 3 models on a test dataset

In [36]:
val dtc_predictions = dtc_model.transform(test_data)
val rfc_predictions = rfc_model.transform(test_data)
val gbt_predictions = gbt_model.transform(test_data)

dtc_predictions: org.apache.spark.sql.DataFrame = [PrivateInd: double, features: vector ... 3 more fields]
rfc_predictions: org.apache.spark.sql.DataFrame = [PrivateInd: double, features: vector ... 3 more fields]
gbt_predictions: org.apache.spark.sql.DataFrame = [PrivateInd: double, features: vector ... 3 more fields]


In [37]:
dtc_predictions.show(3)

+----------+--------------------+-------------+-----------+----------+
|PrivateInd|            features|rawPrediction|probability|prediction|
+----------+--------------------+-------------+-----------+----------+
|       0.0|[100.0,90.0,35.0,...|  [223.0,0.0]|  [1.0,0.0]|       0.0|
|       0.0|[141.0,118.0,55.0...|  [223.0,0.0]|  [1.0,0.0]|       0.0|
|       0.0|[174.0,146.0,88.0...|  [223.0,0.0]|  [1.0,0.0]|       0.0|
+----------+--------------------+-------------+-----------+----------+
only showing top 3 rows



In [38]:
rfc_predictions.show(3)

+----------+--------------------+--------------------+--------------------+----------+
|PrivateInd|            features|       rawPrediction|         probability|prediction|
+----------+--------------------+--------------------+--------------------+----------+
|       0.0|[100.0,90.0,35.0,...|[17.2117296389425...|[0.86058648194712...|       0.0|
|       0.0|[141.0,118.0,55.0...|[16.9513206143356...|[0.84756603071678...|       0.0|
|       0.0|[174.0,146.0,88.0...|[17.7267202786947...|[0.88633601393473...|       0.0|
+----------+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



In [39]:
gbt_predictions.show(3)

2019-12-31 12:02:06 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 12:02:06 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
+----------+--------------------+--------------------+--------------------+----------+
|PrivateInd|            features|       rawPrediction|         probability|prediction|
+----------+--------------------+--------------------+--------------------+----------+
|       0.0|[100.0,90.0,35.0,...|[1.45519847123029...|[0.94835800828938...|       0.0|
|       0.0|[141.0,118.0,55.0...|[1.55287067245585...|[0.95712894965148...|       0.0|
|       0.0|[174.0,146.0,88.0...|[1.54383079619313...|[0.95638091900386...|       0.0|
+----------+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



### Evaluating the models using MulticlassClassificationEvaluator

In [40]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator


In [41]:
val acc_evaluator = new MulticlassClassificationEvaluator().setLabelCol("PrivateInd").setPredictionCol("prediction").setMetricName("accuracy")

acc_evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_a4ff4c7f1799


In [42]:
val dtc_accuracy = acc_evaluator.evaluate(dtc_predictions)
val rfc_accuracy = acc_evaluator.evaluate(rfc_predictions)
val gbt_accuracy = acc_evaluator.evaluate(gbt_predictions)

dtc_accuracy: Double = 0.8995983935742972
rfc_accuracy: Double = 0.927710843373494
gbt_accuracy: Double = 0.8955823293172691


In [44]:
println("Here are the results!")
println("-"*80)
println(f"A single decision tree had an accuracy of: ${dtc_accuracy*100}%1.2f")
println("-"*80)
println(f"A random forest ensemble had an accuracy of: ${rfc_accuracy*100}%1.2f")
println("-"*80)
println(f"A nsemble using GBT had an accuracy of: ${gbt_accuracy*100}%1.2f")

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 89.96
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 92.77
--------------------------------------------------------------------------------
A nsemble using GBT had an accuracy of: 89.56


### Converting the data to rdd and evaluating using MulticlassMetrics to print the confusion matrix

In [45]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.evaluation.MulticlassMetrics


In [46]:
val dtc_predictionAndLabel = dtc_predictions.select("prediction","PrivateInd").as[(Double,Double)].rdd
val rfc_predictionAndLabel = rfc_predictions.select("prediction","PrivateInd").as[(Double,Double)].rdd
val gbt_predictionAndLabel = gbt_predictions.select("prediction","PrivateInd").as[(Double,Double)].rdd

dtc_predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[719] at rdd at <console>:49
rfc_predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[723] at rdd at <console>:50
gbt_predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[727] at rdd at <console>:51


In [47]:
val dtc_metrics = new MulticlassMetrics(dtc_predictionAndLabel)
val rfc_metrics = new MulticlassMetrics(rfc_predictionAndLabel)
val gbt_metrics = new MulticlassMetrics(gbt_predictionAndLabel)

dtc_metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@57edf2c
rfc_metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@6e29b7a2
gbt_metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@161101c6


### Printing the confusion matrix

In [50]:
println(dtc_metrics.confusionMatrix)

184.0  9.0   
16.0   40.0  


In [51]:
println(rfc_metrics.confusionMatrix)

186.0  7.0   
11.0   45.0  


In [52]:
println(gbt_metrics.confusionMatrix)

183.0  10.0  
16.0   40.0  


### Printing the Accuracy

In [53]:
println(s"Accuracy of DecisionTreeClassifier Model: ${dtc_metrics.accuracy}")
println()
println(s"Accuracy of RandomForestClassifier Model: ${rfc_metrics.accuracy}")
println()
println(s"Accuracy of GBTClassifier Model: ${gbt_metrics.accuracy}")

Accuracy of DecisionTreeClassifier Model: 0.8995983935742972

Accuracy of RandomForestClassifier Model: 0.927710843373494

Accuracy of GBTClassifier Model: 0.8955823293172691


### Printing the Precision

In [54]:
println(s"Precision of DecisionTreeClassifier Model: ${dtc_metrics.precision}")
println()
println(s"Precision of RandomForestClassifier Model: ${rfc_metrics.precision}")
println()
println(s"Precision of GBTClassifier Model: ${gbt_metrics.precision}")

Precision of DecisionTreeClassifier Model: 0.8995983935742972

Precision of RandomForestClassifier Model: 0.927710843373494

Precision of GBTClassifier Model: 0.8955823293172691


### Printing the Recall

In [55]:
println(s"Recall of DecisionTreeClassifier Model: ${dtc_metrics.recall}")
println()
println(s"Recall of RandomForestClassifier Model: ${rfc_metrics.recall}")
println()
println(s"Recall of GBTClassifier Model: ${gbt_metrics.recall}")

Recall of DecisionTreeClassifier Model: 0.8995983935742972

Recall of RandomForestClassifier Model: 0.927710843373494

Recall of GBTClassifier Model: 0.8955823293172691


### Closing spark session

In [56]:
spark.stop()

## Thank You!