# CPS ASEC Research File Analysis

This notebook uses an experimental data set from the 2017 census. This is the source: https://www.census.gov/data/datasets/2017/demo/income-poverty/2017-cps-asec-research-file.html
The "2017 Person file (CSV)" is used

I attempt to use some of the data fields as inputs to predict categorical income: poverty, low, medium, and high

In [1]:
import pandas as pd
import tensorflow as tf
from sklearn.utils import shuffle
from sklearn import preprocessing

In [2]:
raw_data = pd.read_csv('pppub17.csv')

## Prune the data

There are many many columns, we will only select a few data points for analysis.
Our prediction target will be income category, which is stored in column 'PTOT_R'

In [3]:
columns = [
    'PEPAR1',  # -1 = No Parent1 present
    'PEPAR2',  # -1 = No Parent2 present
    'A_AGE',
    'A_FTPT',  # 1 = Full time, 2 = Part time
    'A_HGA',  # Educational attainment, values in order
    'A_SEX',  # 1 = Male, 2 = Female
    'SPM_HRACE',  # Race categories 1 = White alone, 2 = Black alone, 3 = Asian alone, 4 = Other
    'PRCITSHP',  # CITIZENSHIP GROUP categories
    'PTOT_R',  # Income categories, 2017 poverty line is 12,060, which is values greater than 5
]
pruned_data = raw_data[columns].copy()
# filter out individuals who are not working (value 0)
pruned_data = pruned_data[pruned_data['A_FTPT'] != 0]
pruned_data = shuffle(pruned_data)
pruned_data.head()

Unnamed: 0,PEPAR1,PEPAR2,A_AGE,A_FTPT,A_HGA,A_SEX,SPM_HRACE,PRCITSHP,PTOT_R
129096,2,1,17,1,38,2,1,1,1
24979,-1,-1,25,1,40,2,1,1,2
98668,-1,-1,48,1,40,2,1,1,15
102711,2,1,20,1,40,2,1,1,4
82763,-1,-1,33,1,43,1,1,1,15


## Perform data mappings

Several inputs can be turned into binary fields.
There are many income buckets, but we will truncate them into just 4.

In [4]:

def map_parent(value):
    # makes 0 no parent, 1 is some kind of parent
    return 0 if value == -1 else 1


def map_ftpt(value):
    # makes 0 part time, 1 full time
    return 0 if value == 2 else 1


def map_sex(value):
    # makes 0 female, 1 male
    return 0 if value == 2 else 1


def map_income(value):
    """
    truncates from 41 buckets into 4
    new buckets are:
      - 0: below proverty level
      - 1: low income
      - 2: middle income
      - 3: high income
    values below 6 are below poverty level
    value 41 is already a bucket of more than 100k
    """
    poverty_level = 6
    if value < poverty_level:
        value = 0
    elif value == 41:
        value = 3
    else:
        # convert values 6-40 to low and middle incomes
        # the midpoint is 16
        if value < 16:
            value = 1
        else:
            value = 2
    return value

mapped_data = pruned_data.copy()
mapped_data['PEPAR1'] = mapped_data['PEPAR1'].map(map_parent)
mapped_data['PEPAR2'] = mapped_data['PEPAR2'].map(map_parent)
mapped_data['A_FTPT'] = mapped_data['A_FTPT'].map(map_ftpt)
mapped_data['A_SEX'] = mapped_data['A_SEX'].map(map_sex)
mapped_data['PTOT_R'] = mapped_data['PTOT_R'].map(map_income)
mapped_data.head()

Unnamed: 0,PEPAR1,PEPAR2,A_AGE,A_FTPT,A_HGA,A_SEX,SPM_HRACE,PRCITSHP,PTOT_R
129096,1,1,17,1,38,0,1,1,0
24979,0,0,25,1,40,0,1,1,0
98668,0,0,48,1,40,0,1,1,1
102711,1,1,20,1,40,0,1,1,0
82763,0,0,33,1,43,1,1,1,1


### Add the dummies

Citizenship statuses and races are both categorical and converted to dummies.

In [5]:
data_with_dummies = mapped_data.copy()
citizenships = pd.get_dummies(pruned_data['PRCITSHP'], drop_first=True)
citizenships.columns = ['citizen_2', 'citizen_3', 'citizen_4', 'citizen_5']
data_with_dummies = data_with_dummies.drop(['PRCITSHP'], axis=1)
data_with_dummies = pd.concat([data_with_dummies, citizenships], axis=1)

races = pd.get_dummies(data_with_dummies['SPM_HRACE'], drop_first=True)
races.columns = ['black', 'asian', 'other']
data_with_dummies = data_with_dummies.drop(['SPM_HRACE'], axis=1)
data_with_dummies = pd.concat([data_with_dummies, races], axis=1)

data_with_dummies.head()

Unnamed: 0,PEPAR1,PEPAR2,A_AGE,A_FTPT,A_HGA,A_SEX,PTOT_R,citizen_2,citizen_3,citizen_4,citizen_5,black,asian,other
129096,1,1,17,1,38,0,0,0,0,0,0,0,0,0
24979,0,0,25,1,40,0,0,0,0,0,0,0,0,0
98668,0,0,48,1,40,0,1,0,0,0,0,0,0,0
102711,1,1,20,1,40,0,0,0,0,0,0,0,0,0
82763,0,0,33,1,43,1,1,0,0,0,0,0,0,0


### Convert the targets to dummies

In [6]:
targets = pd.get_dummies(data_with_dummies['PTOT_R'])
targets.columns = ['poverty', 'low', 'middle', 'high']

### Remove targets from inputs

In [7]:
unscaled_inputs = data_with_dummies.drop(['PTOT_R'], axis=1)

### Scale the inputs

This ensures inputs with higher numbers like age and education don't get weighted too much

In [8]:
inputs = preprocessing.scale(unscaled_inputs)

## Preprocessing Complete

### Separate train, validation, and test datasets

In [9]:
train_percent = .8
validation_percent = .1
test_percent = 1 - train_percent - validation_percent
total_samples = inputs.shape[0]

train_offset = int(total_samples*train_percent)
train_inputs = inputs[0:train_offset]
train_targets = targets[0:train_offset]

validation_offset = train_offset + int(total_samples*validation_percent)
validation_inputs = inputs[train_offset+1:validation_offset]
validation_targets = targets[train_offset+1:validation_offset]

test_inputs = inputs[validation_offset+1:]
test_targets = targets[validation_offset+1:]

### Define the hyperparameters

In [10]:
hidden_layer_size = 100
max_epochs = 100
patience = 3
number_of_layers = 5
layer_activation_algo = 'relu'
optimizer_algo = 'adam'
loss_algo = 'categorical_crossentropy'

### Create the model

In [11]:
output_size = len(targets.columns)
layers = []

for layer in range(0, number_of_layers):
    tf.keras.layers.Dense(hidden_layer_size, activation=layer_activation_algo)

output_layer = tf.keras.layers.Dense(output_size, activation='softmax')
layers.append(output_layer)

model = tf.keras.Sequential(layers)

model.compile(optimizer=optimizer_algo, loss=loss_algo, metrics=['accuracy'])

2021-08-31 18:39:24.266900: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Train the model

In [12]:
early_stopping = tf.keras.callbacks.EarlyStopping(patience=patience)
model.fit(train_inputs,
          train_targets,
          epochs=max_epochs,
          validation_data=(validation_inputs, validation_targets),
          callbacks=[early_stopping],
          verbose=1)

2021-08-31 18:39:24.347814: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100


<keras.callbacks.History at 0x154fbea60>

## Test the model

In [13]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.58. Test accuracy: 74.82%


## Post-mortem

This experimental dataset does not seem to represent the population as a whole. A large majority of the samples were not working and therefore had no income. Furthermore, of the remaining set, almost 73% were in the poverty bucket. Finally, there was very little significance found in the inputs selected as predictive indicators, as the accuracy of the model only reaches around 75%.