# [Computational Social Science]
## 3-2 Tree-Based Methods - Student Version

In this lab, we will explore decision trees and their extensions. In the next lab, we will introduce ensemble machine learning, which involves combining several machine learning algorithms together to create a better model.

## Virtual Environment
Remember to always activate your virtual environment first before you install packages or run a notebook! This helps to prevent conflicts between dependencies across different projects and ensures that you are using the correct versions of packages. You must have created anaconda virtual enviornment in the `Anaconda Installation` lab. If you have not or want to create a new virtual environment, follow the instruction in the `Anaconda Installation` lab. 

<br>

If you have already created a virtual enviornment, you can run the following command to activate it: 

<br>

`conda activate <virtual_env_name>`

<br>

For example, if your virtual environment was named as CSS, run the following command. 

<br>

`conda activate CSS`

<br>

To deactivate your virtual environment after you are done working with the lab, run the following command. 

<br>

`conda deactivate`

<br>

## Data

We're going to use our [Census Income dataset](https://archive.ics.uci.edu/dataset/20/census+income) again for this lab. Let's load the dataset.

In [None]:
# import libraries 
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
#import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier

# settings
%matplotlib inline
#sns.set_style("darkgrid")

In [None]:
# Create a list of column names, found in "adult.names"
col_names = ['age', 
             'workclass', 
             'fnlwgt',
             'education', 
             'education-num',
             'marital-status', 
             'occupation', 
             'relationship', 
             'race', 
             'sex', 
             'capital-gain',
             'capital-loss', 
             'hours-per-week',
             'native-country', 
             'income-bracket']

# Read table from the data folder
census = pd.read_table("../../data/adult.data", 
                       sep = ',', 
                       names = ...)
census.head()

Remember, we need to preprocess the data to binarize the target and dummify our categorical features.

In [None]:
# Target
# ----------
# initialize binarizer function and store a binary version of the outcome variable as "y"
lb_style = ...
y = census['income-bracket-binary'] = ...

# Features 
# ----------
# drop 3 variables: income-bracket, fnlwgt, and income-bracket-binary
X = census.drop(..., 
                axis = ...)
# get dummies
X = ...
X.head()

## Decision Tree Classifier

The first model we will look at is the decision tree. Using the [`tree.DecisionTreeClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) method, let's implement a cross-validation approach to predicting income. We will initialize the model with the standard configurations from the Classification lab.

In [None]:
# Initialize a Decision Tree Classifier
# ----------
dt_classifier = tree.DecisionTreeClassifier(
                       criterion='gini',              # or 'entropy' for information gain
                       splitter='best',               # or 'random' for random best split
                       max_depth=None,                # set how deep tree nodes can go
                       min_samples_split=2,           # samples (observations) needed to split node
                       min_samples_leaf=1,            # samples (observations) needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,             # number of features to look for when splitting
                       max_leaf_nodes=None,           # max nodes
                       min_impurity_decrease=1e-07,   # early stopping
                       random_state = 10)             #random seed

In [None]:
# cross_val_score returns the accuracy score by default but you can change this with the "scoring" argument
scores = cross_val_score(...,             # specify estimator 
                         X,               # specify X
                         y,               # specify y
                         ...)             # specify 5 cross-validation folds

In [None]:
# view accuracy
...

In [None]:
# Take the mean accuracy score from the results of cross validation
scores.mean()

.82 accuracy, not bad! We can also visualize the decision tree to see how it made its splits. Note we limit the max depth to 4 so that the code runs quickly, but in practice you might want to visualize the entire tree.

In [None]:
# get the shape of the data
X.shape

In [None]:
# fit to data
# ----------
dt_classifier.fit(X, y)

# set column names as list
column_names = X.columns.tolist()

# plot the figure
fig = plt.figure(figsize=(20,20))
_ = tree.plot_tree(dt_classifier,  
                   feature_names=column_names,      # make sure its a list
                   class_names=["<=50k", ">50k"],   # specify class names
                   filled=True,                     # paint nodes to indicate majority class 
                   fontsize = 15,                   # set fontsize
                   max_depth = 3)                   # set max depth of tree to view

In [None]:
# We can use the .max_depth attribute to check out the depth of our entire tree
dt_classifier.tree_.max_depth

In [None]:
# Remind ourselves how many samples in our negative class
np.count_nonzero(y==0)

In [None]:
# Check the samples after root node
X['marital-status_ Married-civ-spouse'].value_counts()

In [None]:
# Getting the most informative features
# ----------
importances = pd.DataFrame({'feature':X.columns,'importance':np.round(dt_classifier.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances

---
Authored by Aniket Kesari. Minor edits by Tom van Nuenen 2022 and Kasey Zapatka in 2023.