## Use Case 1

Teaching recommendation: Economic activity (restricted to economically active only) aggregated by occupation and gender. Three way table taken from the teaching advice pages 11 \& 12.

Use case would be: restrict to records with `Economic Activity` in {Economically active: Employee (1) / Economically active: Self-employed (2)}, compute the gender gap across `Occupation`.

(This notebook uses the same structure as the other use case notebooks).

In [1]:
import itertools
import json
import numpy as np
import os
import pandas as pd
import pickle
import tqdm
import tqdm.notebook

In [2]:
# Loads of warnings coming from the sub-methods, which we don't care about.
import warnings
warnings.filterwarnings("ignore")

In [3]:
from utils import load_data, load_synthetic_datasets, metadata, custom_ipf, custom_mst

2022-11-03 16:16:29.234159: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Instructions for updating:
non-resource variables are not supported in the long term


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

Load the data.

In [5]:
teaching_file = load_data(train=True, test=True)

In [6]:
output_size = len(teaching_file)

### Use case 

What columns are used in the task? What marginals should be preserved for this task?

In [7]:
columns = ["Economic Activity", "Occupation", "Sex"]

In [8]:
metamap = {m["name"]: m for m in metadata}

In [9]:
list_values = [metamap[c]["representation"] for c in columns]

In [10]:
num_cat = [len(c) for c in list_values]

We use 3-way marginals for the demographics, and 2-ways with the target.

In [11]:
acceptable_marginals = [(0, 1), (0, 2), (1, 2)]

Define the task of interest (analysis on data).

In [12]:
def task(dataset):
    # Extract metadata for the columns.
    
    # Task 1: compute the three-way marginal.
    dataset = dataset.astype(str)
    marginal = np.zeros(num_cat)
    for ix, x in enumerate(list_values[0]):
        for iy, y in enumerate(list_values[1]):
            for iz, z in enumerate(list_values[2]):
                marginal[ix,iy,iz] = np.mean(
                    (dataset[columns[0]] == x) & (dataset[columns[1]] == y) & (dataset[columns[2]] == z)
                )
    
    # Task 2: compute the gender gap for each occupation.
    # Restrict to economically active people.
    df = dataset[(dataset[columns[0]] == '1') | (dataset[columns[0]] == '2')]
    # For each occupation, compute the gender gap.
    gender_gap = []
    for occupation in list_values[1]:
        df_occ = df[df.Occupation == occupation]
        if len(df_occ) == 0:
            gender_gap.append(0)
            continue
        prop_m = np.mean(df_occ.Sex == '1')
        prop_f = np.mean(df_occ.Sex == '2')
        gender_gap.append(prop_m - prop_f)
    
    # Return the model + the y_pred/y_true to evaluate.
    return marginal, np.array(gender_gap)

Define how to measure the success of this analysis (how close it is).

In [13]:
def distance_marginals(m1, m2):
    return np.sqrt(((m1.flatten()-m2.flatten())**2).mean())

In [14]:
def distance_bb(bb1, bb2):
    return np.sqrt(((bb1-bb2)**2).mean())

In [27]:
def utility(task_output_synth, task_output_real):
    # Parse the output of each task.
    marginal_s, gg_s = task_output_synth
    marginal_r, gg_r = task_output_real
    # Measure the distance between answers.
    avg_error_marginal = distance_marginals(marginal_s, marginal_r)
    avg_error_bb = distance_bb(gg_s, gg_r)
    return avg_error_marginal, avg_error_bb

### Generating tailored datasets

In [16]:
master_seed = 42387342

In [17]:
num_runs = 5

In [18]:
np.random.seed(master_seed)
seeds = np.random.randint(np.iinfo(np.int16).max, size=num_runs)

IPF

In [19]:
for seed in seeds:
    custom_ipf(teaching_file, columns, "use_case_1_ipf", seed, acceptable_marginals, output_size)

MST

In [20]:
for seed in seeds:
    custom_mst(teaching_file, columns, "use_case_1_mst", seed, acceptable_marginals, output_size)

### Loading synthetic datasets

Load generated datasets that are agnostic to the task, as well as the datasets from the previous step.

In [21]:
methods = [
    "use_case_1_ipf",
    "use_case_1_mst",
    "MST_eps1000",
    "CTGAN_10epochs",
    "PATEGAN_eps1000",
    "PrivBayes_eps1000",
    "SYNTHPOP"
]

### Summarising the results

In [22]:
synthetic_datasets = load_synthetic_datasets(methods)

Save these task results to disk to remove the need for repeated expensive computations.

In [23]:
try:
    with open("use_case_1.pickle", "rb") as ff:
        task_real_data, results = pickle.load(ff)
except Exception as err:
    task_real_data = task(teaching_file)
    results = {}

The following is re-entrant: pre-existing results will not be recomputed.

In [24]:
for method in tqdm.notebook.tqdm(methods):
    # Prevent re-computation of the task.
    if method in results and len(results[method]) == len(synthetic_datasets[method]):
        continue
    print(method)
    results[method] = L = []
    for ds in synthetic_datasets[method]:
        L.append(task(ds))

  0%|          | 0/7 [00:00<?, ?it/s]

use_case_1_ipf
use_case_1_mst
MST_eps1000
CTGAN_10epochs
PATEGAN_eps1000
PrivBayes_eps1000
SYNTHPOP


Conversely, load these results! You may start here when running the module.

In [25]:
with open("use_case_1.pickle", "wb") as ff:
    pickle.dump((task_real_data, results), ff)

Print the accuracy of models.

In [33]:
for method in methods:
    print('===', method, '===')
    err_marg = 0
    err_gg = 0
    for task_result in results[method]:
        em, eg = utility(task_result, task_real_data)
        err_marg += em
        err_gg += eg
    err_marg /= len(results[method])
    err_gg /= len(results[method])
    print('\tError on marginals: ', '%.2e'% err_marg)
    print('\tError on gender gap:', '%.2e'% err_gg)
    print()

=== use_case_1_ipf ===
	Error on marginals:  5.39e-04
	Error on gender gap: 4.77e-02

=== use_case_1_mst ===
	Error on marginals:  5.35e-04
	Error on gender gap: 4.94e-02

=== MST_eps1000 ===
	Error on marginals:  1.04e-03
	Error on gender gap: 8.68e-02

=== CTGAN_10epochs ===
	Error on marginals:  3.62e-03
	Error on gender gap: 1.70e-01

=== PATEGAN_eps1000 ===
	Error on marginals:  1.29e-02
	Error on gender gap: 5.70e-01

=== PrivBayes_eps1000 ===
	Error on marginals:  8.97e-03
	Error on gender gap: 2.84e-01

=== SYNTHPOP ===
	Error on marginals:  9.60e-05
	Error on gender gap: 4.61e-03

