# Boosting Project - Census Data

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.

By using this census data with boosting algorithms, we will try to predict whether or not a person makes more than $50,000.

The original data set is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/census+income

### What is Boosting?

Boosting is an ensemble learning technique where simple, weak learners that tend to suffer from high bias are used. In  Boosting, the base models are decision trees with only one level, also known as a decision stump. Decision stumps can only make a decision based off of one feature at a time, causing them to underfit the data substantially.

Base models are referred to "weak learners" as they tend to have high bias or high variance and on average perform only slightly better than choosing at random.

The reason weak learners are used is because it makes the ensemble model implementation be computationally efficient. Combining strong learners doesn’t necessarily make the final ensemble model performan better, so it makes sense to choose learners that have a lower computational cost.

Boosting is a sequential learning technique where each of the base models builds off the previous model. Each subsequent model aims to improve the performance of the final ensembled model by attempting to fix the errors of the previous model.

For example, training instances that are misclassified by a previous decision stump are given more weight by the next decision stump. This is one method in which boosting methods may learn from their mistakes.

Two common implementations of the boosting algorithm are Adaptive Boosting and Gradient Boosting, both of which we use here.




In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


In [3]:
path_to_data = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

col_names = [
    'age', 'workclass', 'fnlwgt','education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain','capital-loss',
    'hours-per-week','native-country', 'income'
]

df = pd.read_csv(path_to_data, header=None, names = col_names)

In [4]:
#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()

target_column = "income"
raw_feature_cols = [
    'age',
    'education-num',
    'workclass',
    'hours-per-week',
    'sex',
    'race'
]

In [5]:
# Taking a look at the distribution of the target column (income)
print(df[target_column].value_counts(normalize=True))

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64


In [6]:
# Taking a look at the datatypes of these columns
print(df[raw_feature_cols].dtypes)

age                int64
education-num      int64
workclass         object
hours-per-week     int64
sex               object
race              object
dtype: object


In [7]:
# Create the features dataframe and convert categorical variables to dummy variables and inspect the first few rows
X = pd.get_dummies(df[raw_feature_cols], drop_first=True)
print(X.head(n=5))

   age  education-num  hours-per-week  workclass_Federal-gov  \
0   39             13              40                      0   
1   50             13              13                      0   
2   38              9              40                      0   
3   53              7              40                      0   
4   28             13              40                      0   

   workclass_Local-gov  workclass_Never-worked  workclass_Private  \
0                    0                       0                  0   
1                    0                       0                  0   
2                    0                       0                  1   
3                    0                       0                  1   
4                    0                       0                  1   

   workclass_Self-emp-inc  workclass_Self-emp-not-inc  workclass_State-gov  \
0                       0                           0                    1   
1                       0                   

In [8]:
# Convert the target variable to a binary value. Setting it to 0 when income <= 50K and 1 when income > 50K.
y = np.where(df.income=='<=50K', 0, 1)
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.30)

In [9]:
# Create the base estimator, a decision stump, with instances of AdaBoost and GradientBoost classifiers
decision_stump = DecisionTreeClassifier(max_depth=1)
ada_classifier = AdaBoostClassifier(base_estimator=decision_stump)
grad_classifier = GradientBoostingClassifier()

In [10]:
# Fit the classifiers on the training data
ada_classifier.fit(X_train, y_train)
grad_classifier.fit(X_train, y_train)

In [11]:
# Make our predictions using the classifiers
y_pred_ada = ada_classifier.predict(X_test)
y_pred_grad = grad_classifier.predict(X_test)

In [12]:
# Calculate accuracy and F1 scores
ada_acc = accuracy_score(y_test, y_pred_ada)
ada_f1 = f1_score(y_test, y_pred_ada)
grad_acc = accuracy_score(y_test, y_pred_grad)
grad_f1 = f1_score(y_test, y_pred_grad)

## Results with Default Hyperparameters

In [13]:
print("AdaBoost Classifier accuracy: {0}  F1: {1}".format(ada_acc, ada_f1))
print("GradBoost Classifier accuracy: {0}  F1: {1}".format(grad_acc, grad_f1))

AdaBoost Classifier accuracy: 0.8072474152932746  F1: 0.5238938053097345
GradBoost Classifier accuracy: 0.8125703756781656  F1: 0.5432776253429782


## Hyperparameter Tuning

For AdaBoost the default n_estimators is 50 and for Gradient Boosting it is 100.

Here we'll explore different values for n_estimators for each of the models that use the AdaBoostClassifier, and then explore different values for the learning rate for the GradientBoostClassifier.

We'll use Sklearn's GridSearchCV to search over a range specified hyperparameter values to find the one that performs the best:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

We'll use F1 score to evaluate the performance of the cross-validated model on the test set.

In [21]:
# Create a new instance of the AdaBoost classifier
ada_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1))
# Specify our list of hyperparam values (n_estimators) to test
ada_params = {'n_estimators':[10,50,100,150,200,250,300]}

ada_gs_clf = GridSearchCV(ada_clf, ada_params, verbose=3, scoring='f1', n_jobs=-1)
ada_gs_clf.fit(X, y)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


In [22]:
# Print the AdaBoost hyperparameters that performed the best.
print("Best AdaBoost parameters: ")
print(ada_gs_clf.best_params_)

Best AdaBoost parameters: 
{'n_estimators': 50}


In [23]:
# Create a new instance of a GradBoost classifier
grad_clf = GradientBoostingClassifier()
# Specify our list of hyperparam values (learning_rate) to test
grad_params = {'learning_rate':[0.001, 0.005, 0.01, 0.05, 0.1]}

grad_gs_clf = GridSearchCV(grad_clf, grad_params, verbose=3, scoring='f1', n_jobs=-1)
grad_gs_clf.fit(X, y)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [24]:
# Print the GradBoost hyperparameters that performed the best.
print("Best GradBoost params: ")
print(grad_gs_clf.best_params_)

Best GradBoost params: 
{'learning_rate': 0.1}
[CV 1/5] END ...................n_estimators=10;, score=0.521 total time=   0.2s
[CV 4/5] END ...................n_estimators=50;, score=0.540 total time=   0.8s
[CV 2/5] END ..................n_estimators=150;, score=0.525 total time=   2.4s
[CV 1/5] END ..................n_estimators=250;, score=0.533 total time=   3.9s
[CV 4/5] END ..................n_estimators=300;, score=0.529 total time=   3.8s
[CV 3/5] END ...............learning_rate=0.001;, score=0.000 total time=   1.9s
[CV 1/5] END ................learning_rate=0.01;, score=0.316 total time=   1.9s
[CV 4/5] END ................learning_rate=0.05;, score=0.526 total time=   1.8s
[CV 3/5] END ...................n_estimators=10;, score=0.500 total time=   0.3s
[CV 1/5] END ..................n_estimators=100;, score=0.504 total time=   2.6s
[CV 4/5] END ..................n_estimators=150;, score=0.540 total time=   3.6s
[CV 3/5] END ..................n_estimators=250;, score=0.536 

[CV 5/5] END ...................n_estimators=10;, score=0.500 total time=   0.2s
[CV 1/5] END ..................n_estimators=100;, score=0.537 total time=   1.6s
[CV 5/5] END ..................n_estimators=150;, score=0.527 total time=   2.3s
[CV 3/5] END ..................n_estimators=250;, score=0.517 total time=   4.0s
[CV 5/5] END ..................n_estimators=300;, score=0.526 total time=   3.3s
[CV 2/5] END ...............learning_rate=0.005;, score=0.000 total time=   1.9s
[CV 5/5] END ................learning_rate=0.01;, score=0.305 total time=   2.0s
[CV 3/5] END .................learning_rate=0.1;, score=0.511 total time=   1.8s
[CV 1/5] END ...................n_estimators=50;, score=0.517 total time=   1.3s
[CV 4/5] END ..................n_estimators=100;, score=0.542 total time=   2.6s
[CV 2/5] END ..................n_estimators=200;, score=0.516 total time=   4.7s
[CV 5/5] END ..................n_estimators=250;, score=0.514 total time=   5.9s
[CV 2/5] END ...............