*Notes.* For now, we use [Folktables](https://github.com/socialfoundations/folktables)' ACSIncome prediction task to explore the weighted (causal) partial dependence plot (PDP). I am picturing two possible approaches: 

(1) For a given year, get all 50 states and like DADT look at train state vs. test state comparisons. Similarly, train on all sates (or a group of states) and focus on drawing insights for some state(s).

(2) For a given state (or set of states), train on one year and test on the next year. For instance, we could look at pre- and post-Covid times.

Overall, the ML pipeline is to find the best-performing classifiers for the data, where the classifier is a black-box. Through the wPDP we would then show the usefulness of the PDP even under the threat of, say, sample slection or any other sort of distribution shift.

The idea is that we train the (black-box) classifier $b()$ on $D_S$ and will deploy it on $D_T$ such that $P_{D_S}(Y, X) \neq P_{D_T}(Y, X)$. The idea is that we can unse $b()$ for drawing policy-relevant decisions. Since it is a black-box, we are inclined to use PDP to draw the effects of each $X$ on $Y$. We can, in principle, use $b()$ but there is the risk of bias as the weights are relative to $P_{D_S}$: we want to re-weight the PDP to approach the weights under $P_{D_T}$ without having to retrain $b()$.

Now, here I want to push for the role of PDP overall with black-box models. I would argue that the application is similar to how, say, economists use *ceteris paribus* for interpreting regression coefficients... I think we can make the case for a linear model (a simple OLS), then one with interaction terms and/or quadratic terms, ... train the model, estimate the betas, and look at the PDP... **can we dran a heuristic here?**

If that's the case, we emphasize how important PDP is to $b()$ and also how the re-weighting could help... for this last point, we should then link wPDP with the steps of sample selection where we re-weight the coefficients! Maybe like Imbens's or Heckman's paper, use a simple (structural) model to formulate the problem in a simple way. Such approach will help us to link the wPDP with structural causal models.

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
from folktables import ACSDataSource, ACSIncome

# all US states
states = sorted(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 
                 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 
                 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 
                 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 
                 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'])

# state codes from folktables:
_STATE_CODES = {'AL': '01', 'AK': '02', 'AZ': '04', 'AR': '05', 'CA': '06',
                'CO': '08', 'CT': '09', 'DE': '10', 'FL': '12', 'GA': '13',
                'HI': '15', 'ID': '16', 'IL': '17', 'IN': '18', 'IA': '19',
                'KS': '20', 'KY': '21', 'LA': '22', 'ME': '23', 'MD': '24',
                'MA': '25', 'MI': '26', 'MN': '27', 'MS': '28', 'MO': '29',
                'MT': '30', 'NE': '31', 'NV': '32', 'NH': '33', 'NJ': '34',
                'NM': '35', 'NY': '36', 'NC': '37', 'ND': '38', 'OH': '39',
                'OK': '40', 'OR': '41', 'PA': '42', 'RI': '44', 'SC': '45',
                'SD': '46', 'TN': '47', 'TX': '48', 'UT': '49', 'VT': '50',
                'VA': '51', 'WA': '53', 'WV': '54', 'WI': '55', 'WY': '56',
                'PR': '72'}

# data folder
data_source = ACSDataSource(survey_year='2017', horizon='1-Year', survey='person', root_dir=os.path.join('..', 'data'))

In [None]:
# TODO: run later
for state in bible_belt:
    print(state)
    
    data = data_source.get_data(states=[state], download=True)
    features, labels, _ = ACSIncome.df_to_numpy(data)
    
    df = pd.DataFrame(features, columns=ACSIncome.features)
    df['Y'] = labels
    df['STATE'] = state
    
    print("{} features for {} individuals".format(df.shape[1]-1, df.shape[0]))
    
    df.to_csv(os.path.join('..', 'data', '{}_adult.csv'.format(state.lower())), sep='|', index=False)
    
    del data, features, labels, df
    
print('DONE')

In [None]:
# TODO: for a given state, train an OLS and then use the PDP to see if it serves as a heuristic for Beta
# also, consider Loftus's paper: maybe it's the way to get the non-linear relationships!