# Private / Public University Case

This time, we will classify a university as private or public based on this features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate
    
Using 3 different tree methods as follows:
- A single decision tree
- A random forest
- A gradient boosted tree classifier

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('UnivTree').getOrCreate()

## Load the Data

In [3]:
data = spark.read.csv('datasets/College.csv', inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
data.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

## Data Wrangling

As you can see, we want to predict whether or not a university is private or public. However, the column representing the label `private` is in the form of categorical variable. In this case, we need to convert the corresponding colummn to integer 0 or 1

### Handle Categorical Column

In [6]:
from pyspark.ml.feature import StringIndexer

In [7]:
indexer = StringIndexer( inputCol='Private', outputCol='private_index')

In [8]:
output = indexer.fit(data).transform(data)

In [9]:
output.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- private_index: double (nullable = false)



## Feature Selection

In [10]:
from pyspark.ml.feature import VectorAssembler

In [11]:
output.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate',
 'private_index']

In [12]:
assembler = VectorAssembler( inputCols=['Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'], outputCol='features')

In [13]:
output_final = assembler.transform(output)

In [14]:
output_final.select('features','private_index').show()

+--------------------+-------------+
|            features|private_index|
+--------------------+-------------+
|[1660.0,1232.0,72...|          0.0|
|[2186.0,1924.0,51...|          0.0|
|[1428.0,1097.0,33...|          0.0|
|[417.0,349.0,137....|          0.0|
|[193.0,146.0,55.0...|          0.0|
|[587.0,479.0,158....|          0.0|
|[353.0,340.0,103....|          0.0|
|[1899.0,1720.0,48...|          0.0|
|[1038.0,839.0,227...|          0.0|
|[582.0,498.0,172....|          0.0|
|[1732.0,1425.0,47...|          0.0|
|[2652.0,1900.0,48...|          0.0|
|[1179.0,780.0,290...|          0.0|
|[1267.0,1080.0,38...|          0.0|
|[494.0,313.0,157....|          0.0|
|[1420.0,1093.0,22...|          0.0|
|[4302.0,992.0,418...|          0.0|
|[1216.0,908.0,423...|          0.0|
|[1130.0,704.0,322...|          0.0|
|[3540.0,2001.0,10...|          1.0|
+--------------------+-------------+
only showing top 20 rows



In [15]:
final_data = output_final.select('features','private_index')

## Train/Test Split

In [16]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

## Create and Train the Model

In [17]:
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier, GBTClassifier

In [18]:
from pyspark.ml import Pipeline

In [19]:
dtc = DecisionTreeClassifier( featuresCol='features', labelCol='private_index')
rfc = RandomForestClassifier( featuresCol='features', labelCol='private_index', numTrees=150)
gbt = GBTClassifier( featuresCol='features', labelCol='private_index')


In [20]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

In [21]:
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

## Evaluate the Model
### Area Under the Curve of ROC

In [22]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [23]:
binary_eval = BinaryClassificationEvaluator( labelCol='private_index')

In [24]:
print('DTC')
print(binary_eval.evaluate(dtc_preds))

DTC
0.9255533199195172


In [25]:
print('RFC')
print(binary_eval.evaluate(rfc_preds))

RFC
0.9795134443021752


In [26]:
print('GBT')
print(binary_eval.evaluate(gbt_preds))

GBT
0.96218218401317


**Notes**:
- You should always expect **Random Forest Classifier to outperform Decision Tree** in almost every situation when we are doing the random splits since having a lot more trees to do the voting ot choose the classification would work better that just a single tree.


- The fact that **Gradient Boosting did not outperform Random Forest** because the default parameters that Spark provided perhaps might not be really good for our particular datasets

### Accuracy

In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [28]:
acc_eval = MulticlassClassificationEvaluator( labelCol='private_index', metricName='accuracy')

In [33]:
print('A Random Forest Classifier accuracy: {}'.format(acc_eval.evaluate(rfc_preds)))
print('A Decision Tree Classifier accuracy: {}'.format(acc_eval.evaluate(dtc_preds)))
print('A Gradient Boosting Classifier accuracy: {}'.format(acc_eval.evaluate(gbt_preds)))

A Random Forest Classifier accuracy: 0.92
A Decision Tree Classifier accuracy: 0.92
A Gradient Boosting Classifier accuracy: 0.9244444444444444
