# [Computational Social Science]
## 3-3 Tree-Based Ensemble Methods - Student Version

In this lab, we will explore extensions of decision trees. Particularly, we will introduce ensemble machine learning. This  which involves combining several machine learning algorithms together to create a better model.

## Virtual Environment
Remember to always activate your virtual environment first before you install packages or run a notebook! This helps to prevent conflicts between dependencies across different projects and ensures that you are using the correct versions of packages. You must have created anaconda virtual enviornment in the `Anaconda Installation` lab. If you have not or want to create a new virtual environment, follow the instruction in the `Anaconda Installation` lab. 

<br>

If you have already created a virtual enviornment, you can run the following command to activate it: 

<br>

`conda activate <virtual_env_name>`

<br>

For example, if your virtual environment was named as CSS, run the following command. 

<br>

`conda activate CSS`

<br>

To deactivate your virtual environment after you are done working with the lab, run the following command. 

<br>

`conda deactivate`

<br>

## Data

We're going to use our [Census Income dataset](https://archive.ics.uci.edu/dataset/20/census+income) again for this lab. Let's load the dataset.

In [None]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
#import seaborn as sns
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, recall_score, precision_score, f1_score
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier


%matplotlib inline
#sns.set_style("darkgrid")

In [None]:
# Create a list of column names, found in "adult.names"
# ----------
col_names = ['age', 
             'workclass', 
             'fnlwgt',
             'education', 
             'education-num',
             'marital-status', 
             'occupation', 
             'relationship', 
             'race', 
             'sex', 
             'capital-gain',
             'capital-loss', 
             'hours-per-week',
             'native-country', 
             'income-bracket']

# Read table from the data folder
census = pd.read_table("../../data/adult.data", 
                       sep = ',', 
                       names = col_names)
census.head()

Remember, we need to preprocess the data to binarize the target and dummify our categorical features.

In [None]:
# Target
# ----------
lb = LabelBinarizer()
y = census['income-bracket-binary'] = lb.fit_transform(census["income-bracket"])

# Features
# ----------
X = census.drop(['income-bracket', 'fnlwgt', 'income-bracket-binary'], axis = 1)
X = pd.get_dummies(X)
X.head()

## Decision Tree Classifier

The first model we will look at is the decision tree. Using the [`tree.DecisionTreeClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) method, let's implement a cross-validation approach to predicting income. We will initialize the model with the standard configurations from the Classification lab.

In [None]:
# Initialize a Decision Tree Classifier
# ----------
dt_classifier = tree.DecisionTreeClassifier(
                       criterion='gini',              # or 'entropy' for information gain
                       splitter='best',               # or 'random' for random best split
                       max_depth=None,                # set how deep tree nodes can go
                       min_samples_split=2,           # samples needed to split node
                       min_samples_leaf=1,            # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,             # number of features to look for when splitting
                       max_leaf_nodes=None,           # max nodes
                       min_impurity_decrease=1e-07,   # early stopping
                       random_state = 10)             #random seed

In [None]:
# cross_val_score returns the accuracy score by default but you can change this with the "scoring" argument
scores = cross_val_score(dt_classifier,   # specify estimator 
                         X,               # specify X
                         y,               # specify y
                         cv=5)            # number of cross validation 

In [None]:
# view the scores individually
scores

In [None]:
# take the mean score from the results of cross validation
scores.mean()

.82 accuracy, not bad! We can also visualize the decision tree to see how it made its splits. Note we limit the max depth to 4 so that the code runs quickly, but in practice you might want to visualize the entire tree.

In [None]:
# calculate the length of our feature dataframe to be able to judge splits by # of observations
len(X)

In [None]:
# fit to data
# ----------
dt_classifier.fit(X, y)

# set column names as list
# ----------
column_names = X.columns.tolist()

# plot the figure
# ----------
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt_classifier, 
                   feature_names=column_names,      # make sure its a list
                   class_names=["<=50k", ">50k"],   # specify class names
                   filled=True,                     # paint nodes to indicate majority class 
                   fontsize = 15,                   # set fontsize
                   max_depth = 3)                   # set max depth of tree to view

In [None]:
# we can use the .max_depth attribute to check out the depth of our entire tree
# ----------
dt_classifier.tree_.max_depth

In [None]:
# remind ourselves how many samples in our negative class
# ----------
np.count_nonzero(y==0)

In [None]:
# check the samples after root node
# ----------
X['marital-status_ Married-civ-spouse'].value_counts()

In [None]:
# identify the most informative features
# ----------
importances = pd.DataFrame({'feature':X.columns,'importance':np.round(dt_classifier.feature_importances_,3)})
importances = importances.sort_values('importance',
                                      ascending=False)
importances

# Ensemble Learning
Ensemble learning is a machine learning paradigm where multiple learners (also known as base or individual models) are trained to solve the same problem. The main idea behind ensemble learning is that a group of "weak learners" can come together to form a "strong learner". Each weak learner makes a prediction, and then the ensemble model makes its final prediction based on the votes or the outputs of all the weak learners.

Ensemble learning often significantly improves machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

## Random Forest

Next, we'll take a look at the [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Random Forest is an extension of the decision tree method. Rather than construct just one tree, a random forest grows many trees, using a subset of features to grow each tree. The trees then make predictions, and the random forest takes a majority vote from the trees to determine the winner. Random forest is known as a ["bagging"](https://en.wikipedia.org/wiki/Bootstrap_aggregating) method. Fill in the code below to train a random forest using cross-validation.

In [None]:
# initialize a random forest classifier
# ----------
rf_classifier = ...

In [None]:
# specify cross-validation
# ----------
scores = cross_val_score(..., 
                         ..., 
                         ...,  # Some algorithms will expect you to ravel the target
                         ...)

In [None]:
# calculate the average score across models
# ----------
scores.mean()

Although it is difficult to visualize a forest of trees, we *can* still visualize the feature importances. Use the code below to look at the top 10 most important features. 

**QUESTION:** What do you notice? Do you think we actually need a large feature space?

In [None]:
# fit the random forest on data to get feature importance
# ----------
rf_classifier.fit(X, y.ravel())

# import library
import seaborn as sns

# create feature importance dataframe
feat_importances = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(rf_classifier.feature_importances_))], axis = 1)
feat_importances.columns = ["Feature", "Importance"]

# plot
sns.barplot(x = "Importance", 
            y = "Feature",  
            data = feat_importances.nlargest(10, 'Importance')) # identify the 10 most important features
plt.show()

**ANSWER:** ...

We see that only a handful of features are contributing a lot to the model. We could probably simplify the decisionmaking considerably. Try training a new decision tree with max depth 5 and only use the 10 most important features.

In [None]:
# refit a basic decision tree using reduced number of features 
# ----------
dt_reduced_classifier = ...

# pull out the most features
important_features = feat_importances.nlargest(10, 'Importance')['Feature']

# create new dataset with only most important features
X_reduced = X[X.columns[X.columns.isin(important_features)]]

# fit the model on the new reduced model
dt_reduced_classifier.fit(..., 
                          ...)

# fit the model on the new reduced model
dt_reduced_classifier.fit(X_reduced,
                          y)
# set column names as list
reduced_column_names = X_reduced.columns.tolist()
fig = plt.figure(figsize=(45,40))
_ = tree.plot_tree(decision_tree = ...
                   feature_names=reduced_column_names,  # make sure its a list
                   class_names=...,                     # specify class names
                   filled=...,                          # paint nodes to indicate majority class 
                   fontsize = 25,                       # set fontsize
                   max_depth = 3)                       # set max depth of tree to view      

Looks a lot more interpretable than a random forest! How did we do on accuracy?

In [None]:
# calculate accuracy using cross validation
# ----------
scores = cross_val_score(..., 
                         ..., 
                         ..., 
                         ...)

# find the mean score across models
scores.mean()

Almost .85! Slightly better then the whole random forest and better than our original decision tree. Growing a random forest and then simplifying down to a more basic decision tree is the basic procedure recommended by the [select-regress-round](https://arxiv.org/pdf/1702.04690.pdf) framework.

**Question**: Why did a simplified decision tree get better accuracy than the first one we ran?

**Answer**:

## Adaptive Boosting

The other approach for ensembling decision trees is called ["boosting"](https://en.wikipedia.org/wiki/Boosting_(machine_learning). Whereas random forests grow many decision trees in parallel and take a vote from them, boosting algorithms start with a strong classifier and iterate on it with weak learners. The weak learners are trained on the errors, which makes boosting algorithms well suited for making classifications in difficult cases. Try filling in the code below to train an [`AdaBoostClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html).

In [None]:
# initialize a Adaptive boosting classifer 
# ----------
ada_classifier = ...   # specify 100 estimators

In [None]:
# calculate accuracy using cross validation
# ----------
scores = ...          # specify 5-fold cross-validation 

**QUESTION:** How does adaptive boosting compare the the random forest, the reduced feature tree, and the basic tree?

In [None]:
# calculate accuracy using cross validation
# ----------
scores.mean()

**ANSWER:**

**QUESTION:** What are some pros and cons of adaptive boosting?

**ANSWER:** 

Here is a [link to a tutorial](https://medium.com/@chaudhurysrijani/tuning-of-adaboost-with-computational-complexity-8727d01a9d20) on `AdaBoost()` that uses a data visualization workflow (box-plots) to visually compare model accuracy of different hyperparameters. This particular workflow relies on user-written functions to create the dataframes necessary for visualization. 

So, while it is not is not as code-efficient as using GridSearchCV(), it could be helpful in understand model accuracy differences across a particualr hyperparameter and could be a workflow you might want to use to illustrate a point in a paper. 

# XGBoost

[XGBoost](https://xgboost.readthedocs.io/en/stable/) also uses sequential weak learners to train the models instead of creating a random forest. 

One key difference from `AdaBoost()` is that it uses Gradient Descent to minimize a loss function and improve fit whereas `AdaBoost()` assigns larger weights to incorrectly classified observations so that future models focus on classifying those observations in future models. 

Here is a [helpful explainer](https://medium.com/@thedatabeast/adaboost-gradient-boosting-xg-boost-similarities-differences-516874d644c6#:~:text=AdaBoost%20is%20generally%20slower%20than,explicit%20imputation%20of%20missing%20values.) the similarities and differences between these models.

In [None]:
# you will likely have to install xgboost using the command-line prompt below
# !pip install xgboost
import xgboost as xgb

In [None]:
# initialize an XGBoost classifer 
# ----------
xgb_classifier = xgb.XGBClassifier(random_state=..)


# define the scoring metrics
scoring = {
          'accuracy': make_scorer(..),
          'recall': make_scorer(..),
          'precision': make_scorer(..),
          'f1': make_scorer(..)
           }

# perform cross-validation with 5-fold and return the trained estimators
cv_results = cross_validate(...,                   # specify estimator 
                            X,                     # specify features
                            y.ravel(),             # specify outcome, and use ravel
                            cv=...,                # specify 5-fold cross validation
                            return_estimator=True, # return the estimators fitted at each split
                            scoring=...)           # which scoring metrics to return (the whole list in this case)
 

# print the results for accuracy, recall, precision, and F1 score
for metric in ['test_accuracy', 'test_recall', 'test_precision', 'test_f1']:
    scores = cv_results[metric]
    print(f"{metric[5:]}: {scores.mean():.3f}")


In [None]:
# Let's look at the feature importance
# ----------

# Initialize an array to hold the feature importances
importances = np.zeros(X.shape[1])

# Average the feature importances over the folds
for estimator in cv_results['estimator']:
    importances += estimator.feature_importances_
    
# Divide by the number of folds
importances /= ...  

# Create a DataFrame for visualization
feature_importance = pd.DataFrame({'feature': X.columns, 
                                   'importance': importances})

# Sort the features by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)

# Take the top 10 features
feature_importance = feature_importance.head(10)

# Plot the feature importances
plt.figure(figsize=(10, 6))

# horizontal bar plot
plt.barh(feature_importance['feature'], 
         feature_importance['importance'], 
         color='maroon', 
         align='center')

# labels
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importance')

# gca stands for "get current axis", which allows you to modify the properties of the axes.
# and then inverts the y-axis, meaning that the values that were at the bottom will now be at the top, and vice versa.
plt.gca().invert_yaxis() 
plt.show()

## Ensemble Learning Beyond Trees

You can also create ensembles with algorithms beyond decision trees. Scikit's ensemble module contains several different options for training ensemble models. Here, we will focus on the [`VotingClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) method. A voting classifier works in a similar fashion to random forest. However, **instead of taking a majority vote of decision trees, it takes a majority vote of various algorithms.** 

The voting can be **"hard"** which means the ensemble uses a majority vote of predicted classes, or **"soft"** meaning the votes are weighted by the probability associated with the prediction. 

Run the code below to initialize a logistic regression, a random forest, and an adaboost model. Pass all three of these into the VotingClassifier to train an ensemble model, and check out their accuracy scores.

*Note that this may take a minute.*

In [None]:
# Logistic Regression - using liblinear solver
# ----------
log_reg = LogisticRegression(random_state = 10, 
                             solver='liblinear')

# Random Forest
# ----------
rf_classifier = RandomForestClassifier(
                       criterion='gini',              # you can also use 'entropy' for information gain
                       max_depth=None,                # how deep tree nodes can go
                       min_samples_split=2,           # samples needed to split node
                       min_samples_leaf=1,            # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,             # number of features to look for when splitting
                       max_leaf_nodes=None,           # max nodes
                       min_impurity_decrease=1e-07,   # early stopping
                       random_state = 10)             # random seed

# AdaBoost
# ----------
ada_classifier = AdaBoostClassifier(n_estimators=100)

# XGBoost
# ----------
xgb_classifier = xgb.XGBClassifier(random_state=10)

# specify voting classifiers
# ----------
voting_classifier = VotingClassifier(
                        # specify estimators to use
                        estimators = [('lr', log_reg),
                                     ('rf', rf_classifier),
                                     ('ada', ada_classifier),
                                     ('xgb', xgb_classifier)],
                        # specify voting
                        voting = 'hard')

# loop through each model to report accuracy
# ----------
for clf, label in zip([log_reg, 
                       rf_classifier, 
                       ada_classifier, 
                       xgb_classifier,
                       voting_classifier], ['Logistic Regression', 
                                            'Random Forest', 
                                            'Ada Boost',
                                            'XGBoost',
                                            'Ensemble']):
         scores = cross_val_score(clf, 
                                  X, 
                                  y.ravel(),
                                  scoring='accuracy', 
                                  cv=5)
         print('Accuracy: %0.2f [%s]' % (scores.mean(), label))

**QUESTION:** How did the ensemble do? 

**ANSWER:** ...

Next, try to use a soft voting classifier to get the predicted probabilities for each prediction. Try using the `predict_proba()` method to get the predicted probabilities.

In [None]:
# specify a "soft" voting classifer in this iteration
# ----------
voting_classifier = VotingClassifier(
                        # specify estimators to use
                        estimators = [('lr', log_reg),
                                     ('rf', rf_classifier),
                                     ('ada', ada_classifier), 
                                     ('xgb', xgb_classifier)],
                        # specify voting 
                        voting = 'soft')

# fit each classifer "c" to the data, predict the probability of tha clasifer and store as "probas"
probas = [c.fit(X, y.ravel()).predict_proba(X)[:,1] for c in (log_reg, 
                                                              rf_classifier,
                                                              ada_classifier,
                                                              xgb_classifier,
                                                              voting_classifier)]

Let's put our predicted probabilities into a dataframe so we can visualize them.

In [None]:
# create a dataset from the predicted probabilities
# ----------
probs_df = pd.DataFrame.from_records(probas).T # pulls the list of "probas" and stores as dataframe
probs_df.rename(columns = {0: 'logit',
                           1: 'rf',
                           2: 'ada',
                           3: 'xgb',
                           4: 'ensemble'}, 
                inplace = True)

# view the first few observations
# ----------
probs_df.head(10)

In [None]:
# visualize distributions
# ----------
# set figure parameters
fig = plt.figure(figsize=(15, 10))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

# logit
ax = fig.add_subplot(2, 3, 1)
sns.histplot(probs_df, x="logit", ax=ax)

# random forest
ax = fig.add_subplot(2, 3, 2)
sns.histplot(probs_df, x="rf", ax=ax)

# adaptive boosting
ax = fig.add_subplot(2, 3, 3)
sns.histplot(probs_df, x="ada", ax=ax)

# xgboost
ax = fig.add_subplot(2, 3, 4)
sns.histplot(probs_df, x="xgb", ax=ax)

# ensemble
ax = fig.add_subplot(2, 3, 5)
sns.histplot(probs_df, x="ensemble", ax=ax)

# show plot 
plt.show()


**QUESTION**: Do you notice something about the distribution of the predicted probabilities? Can you explain the output of `AdaboostClassifier`?

**ANSWER**: 

---
Authored by Aniket Kesari. Minor edits by Tom van Nuenen 2022. Notation and note about XGBoost added by Kasey Zapatka in Fall 2023.