# Balance survey data
Assign weights to each survey respondent such that the population is balanced on gender, age, hhi, race, and ethnicity with respect to the census.

1. Load data
2. Impute missing demo values based on baseline frequencies
3. Create population hooks in the survey data
4. Create an axis with all possible demographic categories
5. Join survey data and population targets to that axis, fill with 0's
6. Find weights
7. Join weights with survey data and save

In [493]:
import numpy as np
import pandas as pd


In [496]:
# load data
survey_data = pd.read_csv('../data/processed/data_2019_preprocessed.csv')
population_targets = pd.read_csv('../data/processed/target_populations.csv')

# trim columns and rename
demo_cols = ['d01_gender',
             'd02_age',
             'race_hooks',
             'd04_ethnicity',
             'd08_hhi_buckets']


demo_data = survey_data[demo_cols]
demo_data.rename(columns={'d01_gender': 'gender',
                          'd02_age': 'age',
                          'race_hooks': 'race',
                          'd04_ethnicity': 'ethnicity',
                          'd08_hhi_buckets': 'hhi'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.rename(columns={'d01_gender': 'gender',


In [497]:
# set 'no answer" to null so we can impute easily
demo_data.replace({"No Answer": np.nan}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.replace({"No Answer": np.nan}, inplace=True)


In [499]:
def impute_by_sampled_frequency(df, col):
    """Impute nulls by sampling according to the frequencies present in the data.
    Modifies the df in place."""
    
    s = df[col].value_counts(normalize=True)
    missing = df[col].isnull()
    df.loc[missing, col] = np.random.choice(s.index, size=len(df[missing]),p=s.values)

    return 

In [500]:
# for any missing value, substitute according to the sample frequencies
impute_by_sampled_frequency(demo_data, 'gender')
impute_by_sampled_frequency(demo_data, 'hhi')
impute_by_sampled_frequency(demo_data, 'race')
impute_by_sampled_frequency(demo_data, 'ethnicity')
impute_by_sampled_frequency(demo_data, 'age')

In [501]:
# change ethnicity column from binary to string
demo_data['ethnicity'] = demo_data['ethnicity'].apply(lambda x: 'hispanic' if x==1.0 else 'not_hispanic')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data['ethnicity'] = demo_data['ethnicity'].apply(lambda x: 'hispanic' if x==1.0 else 'not_hispanic')


In [502]:
# get race x ethnicity combinations.

demo_data['race_ethnicity'] = demo_data.apply(lambda row: "({0}, {1})".format(row['race'], row['ethnicity']), axis=1)
demo_data.drop(['race', 'ethnicity'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data['race_ethnicity'] = demo_data.apply(lambda row: "({0}, {1})".format(row['race'], row['ethnicity']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data.drop(['race', 'ethnicity'], axis=1, inplace=True)


In [503]:
demo_data['hhi'] = demo_data['hhi'].astype(int).astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_data['hhi'] = demo_data['hhi'].astype(int).astype(str)


In [504]:
survey_demo_counts = pd.get_dummies(demo_data)

## Population balancing

In [505]:
# from targets, select the right columns in order to do the calculation
population_targets = population_targets.set_index('demo')


In [506]:
# We've multiplied the % share for each group by the estimated adult population of somerville.
# We're going to assign weights to each participant such that these numbers are approximately met.
population_targets

Unnamed: 0_level_0,count
demo,Unnamed: 1_level_1
hhi_1,3059
hhi_2,7307
hhi_3,8132
hhi_4,9511
hhi_5,10430
hhi_6,15904
hhi_7,9820
hhi_8,10833
gender_Male,37312
gender_Female,37312


In [507]:
population_targets = population_targets.loc[survey_demo_counts.columns]

In [509]:
# we may want to choose a subset of categories if we find that we're overconstrained.
cols = ['gender_Female', 
        'gender_Male', 
        'gender_Nonbinary', 
        'age_17 Years',
       'age_18 to 24 Years', 
        'age_25 to 34 Years', 
        'age_35 to 44 Years',
       'age_45 to 54 Years', 
        'age_55 to 64 Years', 
        'age_65 to 74 Years',
       'age_75 Years & Over', 
        'hhi_1', 
        'hhi_2', 
        'hhi_3',
        'hhi_4', 
        'hhi_5',
        'hhi_6', 
        'hhi_7',
        'hhi_8',
        #'race_ethnicity_(aa, hispanic)',
       #'race_ethnicity_(aa, not_hispanic)', 'race_ethnicity_(asian, hispanic)',
       #'race_ethnicity_(asian, not_hispanic)',
       #'race_ethnicity_(other, not_hispanic)',
       #'race_ethnicity_(two_or_more, hispanic)',
       #'race_ethnicity_(two_or_more, not_hispanic)',
       #'race_ethnicity_(white, hispanic)',
       #'race_ethnicity_(white, not_hispanic)',
       ]

We're going to get weights by solving the linear inverse problem. This is equivalent to a regularized OLS problem.

The problem looks like:
wX = T

Where w is the [1 by N] vector of weights, X is the [N by M] matrix of survey participant demo data, and T is the [1 by M] vector of population target numbers. 

w is then given by
w = T X^+
Here, X^+ is the regularized pseudo-inverse of X.

In [511]:

# find the pseudo-inverse of X
survey_counts_inverse = np.linalg.pinv(survey_demo_counts[cols])

# calculate the weight vector.
weights = np.dot(population_targets.loc[cols]['count'], survey_counts_inverse)

In [512]:
weights.min()

-32.49148466193866

In [513]:
# We have a small number of negative weights which we will set to zero
weights[weights <  0] = 0

In [518]:
# check how close we are
res = population_targets.loc[cols]
res['weighted_survey_pop'] = np.dot(weights, survey_demo_counts[cols])

res['pct_error'] = 100 * (1 - res['weighted_survey_pop']/res['count'])

In [519]:
res

Unnamed: 0,count,weighted_survey_pop,pct_error
gender_Female,37312,37553.334934,-0.646802
gender_Male,37312,37311.714286,0.000766
gender_Nonbinary,375,432.424739,-15.313264
age_17 Years,534,534.017857,-0.003344
age_18 to 24 Years,12005,12005.017857,-0.000149
age_25 to 34 Years,28048,28048.017857,-6.4e-05
age_35 to 44 Years,12139,12139.731583,-0.006027
age_45 to 54 Years,7414,7414.017857,-0.000241
age_55 to 64 Years,7249,7255.39338,-0.088197
age_65 to 74 Years,4351,4360.622097,-0.221147


We see that we're a bit off on very underr

In [486]:
# assign weights
survey_data['weight'] = weights

In [491]:
survey_data.to_csv('../data/processed/weighted_survey_data.csv', index=False)