## Use Case 2

A multinomial regression of approximated Social Grade on Ethnic Group, Country of Birth and Family Composition, i.e. considering these four variables. In this way we could constructive a narrative around use case 1 being focussed on some economic variables/analyses, use case 2 on socio-demographic variables/analyses and use case 3 a broad analysis primarily to test performance of the methods in protecting privacy and preserving utility.

(This notebook uses the same structure as the other use case notebooks).

In [1]:
import itertools
import json
import numpy as np
import os
import pandas as pd
import pickle
import tqdm
import tqdm.notebook

In [2]:
# Loads of warnings coming from the sub-methods, which we don't care about.
import warnings
warnings.filterwarnings("ignore")

In [3]:
from utils import load_data, load_synthetic_datasets, metadata, custom_ipf, custom_mst

2022-11-03 15:58:47.925762: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Instructions for updating:
non-resource variables are not supported in the long term


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

Load the data.

In [5]:
teaching_file_train = load_data(train=True, test=False)
teaching_file_test = load_data(train=False, test=True)

In [6]:
output_size = len(teaching_file_train) + len(teaching_file_test)

### Use case 

What columns are used in the task? What marginals should be preserved for this task?

In [7]:
columns = ["Ethnic Group", "Country of Birth", "Family Composition", "Approximated Social Grade"]

In [8]:
categoric = columns[:-1]
target = columns[-1]

We use 3-way marginals for the demographics, and 2-ways with the target.

In [9]:
acceptable_marginals = [(0, 1, 2), (0, 3), (1, 3), (2, 3)]

Process the test dataset.

In [10]:
df = teaching_file_test.copy()
X_test_df, y_test_df = df.drop(target, axis=1), df[target]
X_test_df[categoric] = X_test_df[categoric].astype(str)

Define the task of interest (analysis on data).

In [11]:
def task(dataset, split_to_estimate_accuracy=True):    
    # Get the categories from the metadata.
    categories = {m["name"]: m["representation"] for m in metadata if m["name"] in categoric}
    
    # Define the encoder and model (random forest with default parameters).
    encoder = OneHotEncoder(categories=[categories[m] for m in categoric])
    preprocessor = ColumnTransformer(
        [
            ("encode", encoder, categoric),
        ],
        remainder="passthrough",
    )
    logreg = LogisticRegression(random_state=0, n_jobs=-1)
    pipe = Pipeline([("preprocess", preprocessor), ("model", logreg)])
    
    # Load the dataset, divide into train/test (to evaluate).
    df = dataset[columns].copy()
    X, y = df.drop(target, axis=1), df[target]
    X[categoric] = X[categoric].astype(str)
    
    if split_to_estimate_accuracy:
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
    else:
        X_train = X
        y_train = y
        X_test = X_test_df
        y_test = y_test_df
    
    # Apply the pipeline, and do predictions on the test set.
    model = pipe.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Return the model + the y_pred/y_true to evaluate.
    return (model, y_pred, y_test)

Define how to measure the success of this analysis (how close it is).

In [12]:
def utility(task_output_synth):
    # Get the model and samples to estimate the accuracy.
    categoric = columns[:-1]
    target = columns[-1]
    model, y_pred_est, y_test_est = task_output_synth

    # Evaluate the accuracy of the model trained on synthetic data.
    y_real_est = model.predict(X_test_df).astype(str)
    real_accuracy = accuracy_score(y_test_df, y_real_est)
    estimated_accuracy = accuracy_score(y_test_est, y_pred_est)
    return real_accuracy, estimated_accuracy

### Generating tailored datasets

In [13]:
master_seed = 42387342

In [14]:
num_runs = 5

In [15]:
np.random.seed(master_seed)
seeds = np.random.randint(np.iinfo(np.int16).max, size=num_runs)

IPF

In [16]:
for seed in seeds:
    custom_ipf(teaching_file_train, columns, "use_case_2_ipf", seed, acceptable_marginals, output_size)

MST

In [17]:
for seed in seeds:
    custom_mst(teaching_file_train, columns, "use_case_2_mst", seed, acceptable_marginals, output_size)

### Loading synthetic datasets

Load generated datasets that are agnostic to the task, as well as the datasets from the previous step.

In [18]:
methods = [
    "use_case_2_ipf",
    "use_case_2_mst",
    "MST_eps1000",
    "CTGAN_10epochs",
    "PATEGAN_eps1000",
    "PrivBayes_eps1000",
    "SYNTHPOP"
]

### Summarising the results

In [19]:
synthetic_datasets = load_synthetic_datasets(methods)

Save these task results to disk to remove the need for repeated expensive computations.

In [20]:
try:
    with open("use_case_2.pickle", "rb") as ff:
        task_real_data, results = pickle.load(ff)
except Exception as err:
    task_real_data = task(teaching_file_train, split_to_estimate_accuracy=False)
    results = {}

The following is re-entrant: pre-existing results will not be recomputed.

In [21]:
for method in tqdm.notebook.tqdm(methods):
    # Prevent re-computation of the task.
    if method in results and len(results[method]) == len(synthetic_datasets[method]):
        continue
    print(method)
    results[method] = L = []
    for ds in synthetic_datasets[method]:
        L.append(task(ds))

  0%|          | 0/7 [00:00<?, ?it/s]

use_case_2_ipf


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

use_case_2_mst


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Conversely, load these results! You may start here when running the module.

In [22]:
with open("use_case_2.pickle", "wb") as ff:
    pickle.dump((task_real_data, results), ff)

Print the accuracy of models.

In [23]:
print("Accuracy trained on _real_ data:", accuracy_score(task_real_data[1], task_real_data[2]))
print()

for method in methods:
    print('===', method, '===')
    acc_method_real = 0
    acc_method_est = 0
    for task_result in results[method]:
        ar, ae = utility(task_result)
        acc_method_real += ar
        acc_method_est += ae
    print('\t Accuracy of classifier:  ', acc_method_real / len(results[method]))
    print('\t Accuracy estimated on SD:', acc_method_est / len(results[method]))
    print()

Accuracy trained on _real_ data: 0.3411671785870996

=== use_case_2_ipf ===
	 Accuracy of classifier:   0.3410794207985959
	 Accuracy estimated on SD: 0.3407045971524053

=== use_case_2_mst ===
	 Accuracy of classifier:   0.3410899517332163
	 Accuracy estimated on SD: 0.34018366143390716

=== MST_eps1000 ===
	 Accuracy of classifier:   0.3118946906537955
	 Accuracy estimated on SD: 0.2968645567131905

=== CTGAN_10epochs ===
	 Accuracy of classifier:   0.2881228609039052
	 Accuracy estimated on SD: 0.3799868010896122

=== PATEGAN_eps1000 ===
	 Accuracy of classifier:   0.20524089512944274
	 Accuracy estimated on SD: 0.4034162711674014

=== PrivBayes_eps1000 ===
	 Accuracy of classifier:   0.31195085563843794
	 Accuracy estimated on SD: 0.3060897525906372

=== SYNTHPOP ===
	 Accuracy of classifier:   0.3410408073716542
	 Accuracy estimated on SD: 0.34035075402285936

