# Estimating Income

weaves

This is for World Remit. It's their data excercise. Estimating the income of individuals. The dataset is large: it has width and length. I've only developed a basic model.

I've added some "Insight" steps. These are insights into the dataset that simplify it.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

from sklearn import svm
from sklearn import datasets

%load_ext autoreload
%autoreload 2

pd.__version__

'0.24.2'

In [2]:
# Read the CSV
df0 = pd.read_csv('cache/bak/data.csv')

In [3]:
# Most fields are string
df0.columns.to_series().groupby(df0.dtypes).groups

{dtype('int64'): Index(['age', 'detailed_industry_recode', 'detailed_occupation_recode',
        'wage_per_hour', 'capital_gains', 'capital_losses',
        'dividends_from_stocks', 'num_persons_worked_for_employer',
        'own_business_or_self_employed', 'veterans_benefits',
        'weeks_worked_in_year', 'year'],
       dtype='object'),
 dtype('O'): Index(['class_of_worker', 'education', 'enroll_in_edu_inst_last_wk',
        'marital_stat', 'major_industry_code', 'major_occupation_code', 'race',
        'hispanic_origin', 'sex', 'member_of_a_labor_union',
        'reason_for_unemployment', 'full_or_part_time_employment_stat',
        'tax_filer_stat', 'region_of_previous_residence',
        'state_of_previous_residence', 'detailed_household_and_family_stat',
        'detailed_household_summary_in_household',
        'migration_code_change_in_msa', 'migration_code_change_in_reg',
        'migration_code_move_within_reg', 'live_in_this_house_1_year_ago',
        'migration_prev_res_

## Insight 1 : children
Children, who are denoted by education = 'Children', and always have an
income that is less than 50k.

In [4]:
df0.shape
df0 = df0[df0.education != 'Children']
df0.shape

(228421, 41)

In [5]:
df0[df0.select_dtypes(['object']).columns] = df0.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [6]:
# The groups phrase doesn't work for category
df0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228421 entries, 0 to 299284
Data columns (total 41 columns):
age                                           228421 non-null int64
class_of_worker                               228421 non-null category
detailed_industry_recode                      228421 non-null int64
detailed_occupation_recode                    228421 non-null int64
education                                     228421 non-null category
wage_per_hour                                 228421 non-null int64
enroll_in_edu_inst_last_wk                    228421 non-null category
marital_stat                                  228421 non-null category
major_industry_code                           228421 non-null category
major_occupation_code                         228421 non-null category
race                                          228421 non-null category
hispanic_origin                               227391 non-null category
sex                                           228

In [7]:
# Store the categories
cats0 = dict([ (x, tuple(df0[x].cat.categories)) for x in df0.select_dtypes(['category']).columns ])
cats0
# I've displayed these so that I can see what to investigate and refine away.

{'class_of_worker': ('Federal government',
  'Local government',
  'Never worked',
  'Not in universe',
  'Private',
  'Self-employed-incorporated',
  'Self-employed-not incorporated',
  'State government',
  'Without pay'),
 'education': ('10th grade',
  '11th grade',
  '12th grade no diploma',
  '1st 2nd 3rd or 4th grade',
  '5th or 6th grade',
  '7th and 8th grade',
  '9th grade',
  'Associates degree-academic program',
  'Associates degree-occup /vocational',
  'Bachelors degree(BA AB BS)',
  'Doctorate degree(PhD EdD)',
  'High school graduate',
  'Less than 1st grade',
  'Masters degree(MA MS MEng MEd MSW MBA)',
  'Prof school degree (MD DDS DVM LLB JD)',
  'Some college but no degree'),
 'enroll_in_edu_inst_last_wk': ('College or university',
  'High school',
  'Not in universe'),
 'marital_stat': ('Divorced',
  'Married-A F spouse present',
  'Married-civilian spouse present',
  'Married-spouse absent',
  'Never married',
  'Separated',
  'Widowed'),
 'major_industry_code': ('A

In [8]:
# Make a copy and quantify the categories by using the category codes.
df2 = df0.copy(deep=True)
df2[df2.select_dtypes(['category']).columns] = df2.select_dtypes(['category']).apply(lambda x: x.cat.codes)

In [9]:
# And obtain the outcome as y. And the features as X
# Note: this uses the values property to get the underlying matrix.
y = df2.income_binned.values
df2.drop(columns=['income_binned'], inplace=True)
X = df2.values

## Model Choice
These are just a couple of models to evaluate. Choose one cell and skip the others or run all the cells, and only the last is used.

The first is not practical for the size of this dataset (width and length), but works well for smaller ones (10 by 100 samples is workable).

The Multi-Layer Perceptron Classifier is a useful simple neural network and can get reasonable results quickly because it parallelizes well. The hidden_layer_sizes can be increased to get a cross validation score over 0.5. (10, 2) gives reasonable results 60%, this one (6,6,2) adds another feedback layer and gives 90%

clf = svm.SVC(kernel='linear', C=1)

In [10]:
clf =  MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(6, 6, 2), 
                     random_state=1, warm_start=True)

To evaluate the model, I'll use a five-fold cross-validation. (It will take 5 sets of train and test samples.)

You run the following cell for whatever model you use.

In [11]:
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.91870417, 0.91870417, 0.91870417, 0.9187225 , 0.9187225 ])

These cross-validation scores for the MLP classifier are very good for a relatively unrefined model.

## Notes

I didn't need to do any programming for this model. Usually, there is a lot of work needed to categorize, and scaling is often needed for parametric models. The MLP-classifier, like all neural networks, does its scaling within its implementation.

Removing the Children greatly improves the model's performance. These sort of insights often come from extensive cluster analysis.

MLP-classifiers are opaque. It is difficult to get a sense of feature importance with them. To use a shallower model - like Random Forest which would give a variable importance - one would have to do a lot more work simplifying the dataset: many features could be dropped ("year" doesn't look useful), and many could be binned and simplified (the "Education" and residency ones would be candidates.)

Another useful partition might be separate models for male and female: it's well-known that women are still paid less than men for similar jobs.