This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple XGBoost model, and query it for predictions. 

## Set values of variables

In [None]:
# Replace the following with your values 
project = '<PROJECT_ID>'
bucket = '<BUCKET_ID>'  # to be created
folder='<folder or blob name>'
region='us-central1'

In [None]:
bucket_path=f'{bucket}/{folder}'
%env PROJECT_ID=$project
%env BUCKET_ID=$bucket
%env BUCKET_PATH=$bucket_path
%env REGION=$region
!gsutil mb -c standard -l {region} gs://{bucket}

## Download the data
The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample
uses for training is hosted by the [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/datasets/):

 * Training file is `adult.data`
 * Evaluation file is `adult.test`


### Disclaimer
This dataset is provided by a third party. Google provides no representation,
warranty, or other guarantees about the validity or any other aspects of this dataset.

In [None]:
# Download the data from it's location to your bucket
!gsutil cp gs://amazing-public-data/census_income/census_income_data_adult.data gs://${BUCKET_PATH}/adult.data
!gsutil cp gs://amazing-public-data/census_income/census_income_data_adult.test gs://${BUCKET_PATH}/adult.test

## Check/Install dependencies

In [None]:
!pip install xgboost

In [None]:
import os
import json
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

Inspect a sample of training data

In [None]:
train_data = f"gs://{bucket_path}/adult.data"

raw_training_data = pd.read_csv(train_data)

raw_training_data.head()

## Data preparation

Separate the target (label) column from the rest of the training dataset

In [None]:
train_features = raw_training_data.drop("income",
                                        axis=1)

In [None]:
# create training labels list
train_labels = raw_training_data["income"] == " >50K"
train_labels

Check the class distribution. e.g. Is it balanced? 

In [None]:
raw_training_data.income.value_counts()

Do the same for test set as well

In [None]:
test_data = f"gs://{bucket_path}/adult.test"

raw_testing_data = pd.read_csv(test_data, skiprows=[1])

raw_testing_data.head()

In [None]:
# remove column we are trying to predict ('income') from features list
test_features = raw_testing_data.drop("income",
                                      axis=1)

In [None]:
# create labels list
test_labels = raw_testing_data["income"] == " >50K."

test_labels

In [None]:
raw_testing_data.income.value_counts()

Check which columns are numerical, and which ones are categorical

In [None]:
train_features.dtypes

### Encoding of categorical columns

In [None]:
# categorical columns contain data that need to be turned into numerical values before being used by XGBoost
CATEGORICAL_COLUMNS = (
                       "workclass",
                       "education",
                       "marital-status",
                       "occupation",
                       "relationship",
                       "race",
                       "sex",
                       "native-country"
                       )

In [None]:
# convert data in categorical columns to numerical values
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}

encoders

In [None]:
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
    
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

In [None]:
train_features.dtypes

## Model training & evaluation

Use XGBoost to train a binary classifier

In [None]:
model = xgb.XGBClassifier()

model.fit(
          train_features,
          train_labels,
         )

Check whether we can evaluate the model and get probability scores

In [None]:
model.score(test_features,
            test_labels)

In [None]:
predictions = model.predict_proba(test_features)[:, 1]

type(predictions)