# Adult Census Data Set

Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset was donated to the UCI ML Repository in 1996. 

It is a classification task to predict if an individual wil earn an annual salary of >50K or <=50K

In [None]:
from __future__ import print_function, absolute_import, division
import os
import tempfile
import shutil
import sys
from typing import Tuple

import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
print(f'Running TensorFlow version {tf.__version__} with Python {sys.version}')

## Read the U.S. Census Data

We'll be using the U.S. Census Income Dataset from 1994 and 1995 from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/adult)

The problem is to predict one of two labels from the data and is commonly known as a binary classification problem. These two labels are whether each individual (row of data) has an annual income of over 50K or less than 50K.

Getting data is always the first problem in machine learning. In this case we're going to download a comma separated value (CSV) file. This is basically an Excel datasheet if you've ever double clicked a file like this on your computer. It is a very common format to distribute data and has one example per row with each feature or column separated by a comma.

### Download

In [None]:
def retrieve_data(cache_subdirectory: str='/tmp/datasets/census') -> Tuple[str, str]:
  """Download the census dataset to local directory
  
  Args:
    cache_subdirectory: Local directory to cache downloads
  
  Returns:
    Tuple of local paths of (train, test)
  """

  remote_files = [
      'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
      'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
  ]

  for file in remote_files:
      local_fname = os.path.basename(file)

      tf.keras.utils.get_file(local_fname, origin=file, cache_subdir=cache_subdirectory)

      yield os.path.join(cache_subdirectory, local_fname)

In [None]:
# Set up the full path of the csv files on disk
train_local_file, test_local_file = retrieve_data()

### Inspect

The [dataset page](https://archive.ics.uci.edu/ml/datasets/adult) lists attribute information about the dataset which is copied into the table below.

| Column | Feature Description |
| -------|---------------------|
| age | continuous. |
| workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. |
| fnlwgt | continuous (the # people census takers believe that observation represents) |
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |
| education-num | continuous (education feature in numerical form) |
| marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. |
| occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. |
| relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. |
| race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. |
| sex | Female, Male. |
| capital-gain | continuous. |
| capital-loss | continuous. |
| hours-per-week | continuous. |
| native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |

The csv files don't have a header to label the columns so we list them out here for use in the pandas dataframe in the variable `CSV_COLUMNS`. We are also defining the column we want to predict as the `LABEL_KEY`.

In [None]:
CSV_COLUMNS = [
    'age', 'workclass', 'fnl', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
    'hours-per-week', 'native-country', 'income-bracket'
]

LABEL_KEY = 'income-bracket'

Notebooks allow us to call `bash` functions inline by use of an exclamation mark at the start of the call. Before we start loading the data, we are going to use the bash `head` command to read and display the first few lines of our dataset file. This can be useful to get a bit of an understanding of how the dataset is formatted so we can read it in appropriately

In [None]:
!head -n 3 {train_local_file}

In [None]:
!head -n 3 {test_local_file}

Viewing the dataset above, we can notice a few things:
* Fields are separated by a command and space i.e. ', '
* The test file has a description on the first line that isn't part of the data
* The labels in the test file have a period whereas the labels in the train file do not

We will have to use these few bits of information to just do a few transforms when loading the data so we have a clean and consistent dataset.

We would normally have to consider how we deal with rows that have missing values. To keep things simple, in this case, we are just going to drop any rows that aren't complete.

### Load

Load up the training data into a dataframe using our separator and column names

In [None]:
train_df = pd.read_csv(train_local_file, index_col=None, sep=', ', 
                       header=None, engine='python', names=CSV_COLUMNS)

# Drop rows with missing values
train_df.dropna(inplace=True)

Load up the test data into a dataframe using our separator, column names and drop the period from the label column

In [None]:
test_df = pd.read_csv(test_local_file, index_col=None, sep=', ', 
                      header=1, engine='python', names=CSV_COLUMNS)

# Test dataset has periods on the end of the labels we'll drop off to match the train set
test_df['income-bracket'] = test_df['income-bracket'].apply(lambda val: val[:-1])

# Drop rows with missing values
test_df.dropna(inplace=True)

Similar to our use of the bash `head` command previously, we can now visualise the first few rows of our dataframe by using the `head()` function provided by pandas

In [None]:
train_df.head(3)

Visualising the distribution of the dataset can also be useful to give some intuition about the problem. It can also provide some insights into what features may be useful as predictors.

You can also see the large distribution that values can take within a feature. In machine learning this can pose problems as the relative weighting of the number 1 to 1,000 to 1,000,000 can be troublesome for the optimiser to learn appropriate weights. To handle this, we can normalise a feature by scaling it by subtracting the average and dividing by the standard deviation of that feature.

To keep things simple here we will ignore values that don't have a simple distribution or are extremely skewed.

In [None]:
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
train_df.hist(ax=ax)

In [None]:
test_df.head(3)

### Feature Columns

We have both numeric and categorical data but the model only knows how to deal with numbers.

Categorical features is a feature that can be one of a limited number of possible values. As an example, in this dataset, the 'sex' column is either Male or Female. To use this type of feature in our model, we will have to transform it in some way so it becomes a number which we'll show below.

The other types of features are natively numeric and their magnitude has some meaning. We will define the list of numeric and categorical feature columns by their names so we can appropriately transform each before being used in our model.

In [None]:
NUMERIC_FEATURE_KEYS = [
    'age', 'hours-per-week'
]

CATEGORICAL_FEATURE_KEYS = [
    'workclass', 'education', 'marital-status', 'occupation', 'relationship',
    'race', 'sex', 'native-country'
]

It is also handy to know the unique list of values a column can have (especially the labels). Here we'll just grab the unique list of labels which we expect to be `['<=50K', '>50K']`

In [None]:
LABEL_VOCAB = list(train_df['income-bracket'].unique())

### TensorFlow Data

TensorFlow wants to have a function it can call each time it wants more data. Here we'll use built in utilities that create functions from our pandas dataframes.

In [None]:
train_x = train_df.drop(LABEL_KEY, axis=1)
train_y = train_df[LABEL_KEY]
train_input_fn = tf.estimator.inputs.pandas_input_fn(x=train_x, y=train_y, shuffle=True, num_epochs=None, batch_size=1024)

test_x = test_df.drop(LABEL_KEY, axis=1)
test_y = test_df[LABEL_KEY]
test_input_fn = tf.estimator.inputs.pandas_input_fn(x=test_x, y=test_y, shuffle=False, num_epochs=None, batch_size=1)

As we mentioned above, numeric and categorical columns need to be fed differently into our model.

Firstly we'll just map our numeric data columns to TensorFlow feature columns directly (n.b. there is many more interesting ways that these columns could be used: look at [normalisation](https://en.wikipedia.org/wiki/Feature_scaling), [bucketized columns](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column) and others

In [None]:
real_valued_columns = [tf.feature_column.numeric_column(key, shape=()) 
                       for key in NUMERIC_FEATURE_KEYS]

Next we'll look at the categorical columns: the simplest method is to 'One-Hot-Encode' the data. 

The method has a list of zeroes for each feature with the same length as the number of categories. A one is set for the column indicating the matched category. An example looking at countries:

If our country list = 'Australia', 'England', 'Canada', 'New Zealand'

If our row of data has 'Australia' then our one-hot encoded values are `[1, 0, 0, 0]` 

'Canada' -> `[0, 0, 1, 0]`

'England' -> `[0, 1, 0, 0]`

'New Zealand' -> `[0, 0, 0, 1]`

TensorFlow provides a utility function for this and the code below will create all the transforms from our categorical features to one-hot columns for each

In [None]:
categorical_columns = [
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            key, train_df[key].unique()))
    for key in CATEGORICAL_FEATURE_KEYS
]

Finally just create a list of inputs that has both our numeric and categorical columns

In [None]:
INPUT_COLUMNS = real_valued_columns + categorical_columns

## Linear Model

Although deep neural networks are generating a lot of amazing results, they are computationally expensive compared to a linear model. We start with the simplest model we can to:
1. Quickly see if our code works ;) 
2. Get a baseline for accuracy that a simple model can achieve
3. Use this simple model if it provides sufficient accuracy for our desires

In [None]:
linear_model_dir = tempfile.mkdtemp(prefix='linear')

In [None]:
linear_classifier = tf.estimator.LinearClassifier(
    feature_columns=INPUT_COLUMNS, 
    label_vocabulary=LABEL_VOCAB, 
    model_dir=linear_model_dir)

In [None]:
linear_classifier.train(train_input_fn, steps=5)

In [None]:
linear_results = linear_classifier.evaluate(test_input_fn, steps=10)

In [None]:
for key, value in sorted(linear_results.items()):
    print(f'{key}: {value}')

In [None]:
train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=2000)
eval_spec = tf.estimator.EvalSpec(test_input_fn)
tf.estimator.train_and_evaluate(linear_classifier, train_spec, eval_spec)

## Deep Neural Networks

We're running a neural network with 4 hidden layers and using the TensorFlow DNNClassifier estimator

In [None]:
dnn_model_dir = tempfile.mkdtemp(prefix='dnn')

In [None]:
dnn_classifier = tf.estimator.DNNClassifier(
    hidden_units=[100, 70, 50, 25], 
    feature_columns=INPUT_COLUMNS, 
    model_dir=dnn_model_dir, 
    n_classes=len(LABEL_VOCAB), 
    label_vocabulary=LABEL_VOCAB)

In [None]:
dnn_classifier.train(train_input_fn, steps=5)

In [None]:
dnn_results = dnn_classifier.evaluate(test_input_fn, steps=10)

In [None]:
for key, value in sorted(dnn_results.items()):
  print(f'{key}: {value}')

In [None]:
train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=2000)
eval_spec = tf.estimator.EvalSpec(test_input_fn)
tf.estimator.train_and_evaluate(dnn_classifier, train_spec, eval_spec)

## Hosting

We can export our trained model and host it as a REST API so we can utilise it as a web service. To do this, you will need docker on your local machine to host the container.

In prediction (inference) mode, there are a lot of pieces we can drop out of our model. 
* We don't need to update variables so variables become constants
* We aren't updating weights so all of the gradient operations are removed
* There's no loss function so that is removed

These all help to reduce the size of the model on disk and improve its performance when its only purpose is to do inference.

TensorFlow estimators will take care of all of this for us when we call `export_savedmodel`. Below we are also defining the signature of what we expect our incoming data to be. This is so a server hosting the model can appropriately receive data and format so our model can predict with it.

In [None]:
feature_spec = tf.feature_column.make_parse_example_spec(INPUT_COLUMNS)

serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)

linear_export_dir = linear_classifier.export_savedmodel('exports', serving_input_receiver_fn)

In [None]:
if os.path.isdir('linear_census_model'):
  shutil.rmtree('linear_census_model')

shutil.copytree('exports', 'linear_census_model')

In [None]:
!zip -r linear_census_model.zip linear_census_model

In [None]:
from google.colab import files

In [None]:
files.download('linear_census_model.zip')

To run a server locally we can use the TensorFlow Serving docker image.

```bash
cd ~/Downloads
unzip linear_census_model.zip
docker pull tensorflow/serving:latest
docker run -d \
  -p 8501:8501 \
  -e MODEL_NAME=linear_census_model \
  -v $(pwd)/linear_census_model:/models/linear_census_model \
  --name serving_linear \
  tensorflow/serving:latest
```

List out unique values for each categorical feature. We need this as requests to our service will have to match the values from training data to be able to be used

In [None]:
for feature in CATEGORICAL_FEATURE_KEYS:
  print(feature)
  print(list(train_df[feature].unique()))
  print()

http://localhost:8501//v1/models/linear_census_model:classify

Request body
```json
{
  "examples": [
    {
      "age": 31.0,
      "hours-per-week": 40.0,
      "workclass": "Private",
      "education": "Bachelors",
      "marital-status": "Never-married",
      "occupation": "Prof-specialty",
      "relationship": "Unmarried",
      "race": "White",
      "sex": "Male",
      "native-country": "Australia"
    }
  ]
}
```

Response
```json
{
    "results": [
        [
            [
                "<=50K",
                0.818707
            ],
            [
                ">50K",
                0.181293
            ]
        ]
    ]
}
```