# Tree Methods Code Along

This is the code along notebook from the course with my notes.

* A single decision tree

    decision tree generate split from root, all the way to leaves which are the outcome. The order of which feature to split on first. The best split is decided by entropy and information gain.

* A random forest

    **How does it work?** Random forest will chose random sample of features for every single trees at every single split. In classification it will takes vote to decide which class a sample belongs to. In Regression it will takes average of prediction value. 

    **why do we need random selection of features?** If there is a strong feature in the dataset, in tree method it will frequently be used as top split, resulting in similar trees that are highly correlated. Even after averaging, the variance will remain the same. By leaving out random features, the trees are decorrelated, therefore, reduce variance.

    **When to use:**  
    - a large and noisy dataset
    - a fast and simple training process
    - robustness to overfitting  
    **Advantages:**
    - provide stable performance across a wide range of datasets, Less sensitive to noisy data 
    - less prone to overfitting compared to GD
    - Easier to tune than GD
    **Disadvantages:**
    - May not achieve the same level of accuracy as GD
    - May require more computational resources than GD for large datasets. 

* A gradient boosted tree classifier

    **element:**
    1. a loss function to be optimized, in regression it might be squared error and in classification it might be logarithmic loss
    2. decision tree is used a weak learner to make prediction in gradient boosting. it's common to constrain weak learner e.g. max num of layers/nodes/split/leaf nodes
    3. an additive model to add weak learner(trees) one at a time, existing trees in the model are not changed. a gradient descent procedure is used to minimize loss function when adding

    **steps:**
    1. train a weak model M using samples drawn according to some weight distribution
    2. increase/decrease weight of samples that are misclassified/correct by model M
    3. train next weak model using updated weight, so the new model will be focusing on data that is difficult to learn last round. resulting in models that are good at learning different parts of the training data

    **When to use:** 
    - When accuracy is paramount. 
    - relatively small and clean dataset. 
    - When a custom loss function or regularization techniques are needed 
    - When a more interpretable ensemble is needed
    **Advantages:**
    - higher predictive accuracy compared to Random Forests. 
    - Can capture complex patterns in the data. 
    **Disadvantages:**
    - More sensitive to noisy data and prone to overfitting, especially with complex models. 
    - than Random Forests. 
    
college dataset to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

In [None]:
#Tree methods Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('treecode').getOrCreate()

In [80]:
# Load training data
data = spark.read.csv('College.csv',inferSchema=True,header=True)

In [81]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [82]:
data.head()

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)

### Spark Formatting of Data

In [83]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [84]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [85]:
assembler = VectorAssembler(
  inputCols=['Apps',
             'Accept',
             'Enroll',
             'Top10perc',
             'Top25perc',
             'F_Undergrad',
             'P_Undergrad',
             'Outstate',
             'Room_Board',
             'Books',
             'Personal',
             'PhD',
             'Terminal',
             'S_F_Ratio',
             'perc_alumni',
             'Expend',
             'Grad_Rate'],
              outputCol="features")

In [86]:
output = assembler.transform(data)

Deal with Private column being "yes" or "no"

In [87]:
from pyspark.ml.feature import StringIndexer

In [88]:
indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
output_fixed = indexer.fit(output).transform(output)

In [89]:
final_data = output_fixed.select("features",'PrivateIndex')

In [90]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

### The Classifiers

In [91]:
from pyspark.ml.classification import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier
from pyspark.ml import Pipeline

Create all three models:

In [116]:
# Use mostly defaults to make this comparison "fair"

dtc = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateIndex',featuresCol='features')

Train all three models:

In [117]:
# Train the models (its three models, so it might take some time)
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

## Model Comparison

Let's compare each of these models!

In [118]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

**Evaluation Metrics:**

In [119]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [120]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", predictionCol="prediction", metricName="accuracy")

In [121]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [122]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
print('-'*80)
print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 91.53%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 91.53%
--------------------------------------------------------------------------------
A ensemble using GBT had an accuracy of: 91.95%


Interesting! Optional Assignment - play around with the parameters of each of these models, can you squeeze some more accuracy out of them? Or is the data the limiting factor?