# `find-quasi-ids`- Demo Jupyter Notebook

This Jupyter notebook walks through a demo of using `find-quasi-ids` as a CLI tool with the NY TAP dataset. 


**NOTE:** the `!` character at the beginning of a cell (in a code cell) indicates that the following command should be executed in the system's shell (like Bash) rather than in the Python interpreter. This allows you to run external shell commands directly from your Jupyter notebook.

## Usage Message

## Brute-force Algorithm

In [5]:
!find-quasi-ids -h

usage: find-quasi-ids [-h] -f INPUT_FILE [--out-dir OUT_DIR]
                      {brute,greedy} ...

options:
  -h, --help            show this help message and exit
  -f INPUT_FILE, --input-file INPUT_FILE
                        Path to input CSV datafile
  --out-dir OUT_DIR     Path to output directory where to write the output CSV
                        file(s).

algorithms:
  Algorithm to use. Each has its own options.

  {brute,greedy}
    brute               Run the brute-force algorithm.
    greedy              Run the greedy algorithm.


In [6]:
!find-quasi-ids brute -h

usage: find-quasi-ids brute [-h] [--num-cols-start NUM_COLS_START]
                            [--num-cols-max NUM_COLS_MAX] [--top-k TOP_K]

options:
  -h, --help            show this help message and exit
  --num-cols-start NUM_COLS_START
                        Number of columns to start running the brute force
                        algorithm.
  --num-cols-max NUM_COLS_MAX
                        Max number of cols to use to run the brute force
                        algorithm.
  --top-k TOP_K         Output top-k combination of columns with highest
                        distinct ratio for the brute force algorithm.


In [18]:
!find-quasi-ids --input-file data/TAP_PRUNED.csv --out-dir output \
    brute --num-cols-start 3 --num-cols-max 4 --top-k 5


Top 5 combinations of size 3 by distinct ratio:

Academic Year, Income by $1,000 Range, TAP Recipient Dollars: distinct_ratio=0.6457, unique_combinations=52306
TAP Sector Group, Income by $1,000 Range, TAP Recipient Dollars: distinct_ratio=0.6012, unique_combinations=48706
Recipient Age Group, Income by $1,000 Range, TAP Recipient Dollars: distinct_ratio=0.5956, unique_combinations=48249
Academic Year, Recipient Age Group, TAP Recipient Dollars: distinct_ratio=0.5672, unique_combinations=45945
Academic Year, TAP Sector Group, TAP Recipient Dollars: distinct_ratio=0.5669, unique_combinations=45922
Saved to output/brute_force_distinct_ratio_k3.csv

Top 5 combinations of size 4 by distinct ratio:

Academic Year, TAP Sector Group, Income by $1,000 Range, TAP Recipient Dollars: distinct_ratio=0.8045, unique_combinations=65176
Academic Year, Recipient Age Group, Income by $1,000 Range, TAP Recipient Dollars: distinct_ratio=0.7829, unique_combinations=63419
TAP Sector Group, Recipient Age Gr

## Greedy (eps, delta) Minimum Key Algorithm

In [11]:
!find-quasi-ids greedy -h

usage: find-quasi-ids greedy [-h] [--sample]
                             [--distinct-ratio-target DISTINCT_RATIO_TARGET]

options:
  -h, --help            show this help message and exit
  --sample              If provided input data will be sampled, before running
                        the greedy algorithm.
  --distinct-ratio-target DISTINCT_RATIO_TARGET
                        If provided the greedy algorithm will run until it
                        just achieves a higher distinct ratio percentage. If
                        not provided greedy algorithm will run fully. Expected
                        value 0-100.


In [19]:
!find-quasi-ids --input-file data/TAP_PRUNED.csv --out-dir output \
    greedy --sample

['Income by $1,000 Range', 'Academic Year', 'TAP Recipient Dollars', 'TAP Sector Group', 'Recipient Age Group', 'TAP Level of Study', 'TAP Financial Status', 'TAP Schedule Letter']: distinct_ratio=1.0, unique_combinations=4456
Took 2.294 seconds


In [26]:
!find-quasi-ids --input-file data/TAP_PRUNED.csv --out-dir output \
    greedy --sample --distinct-ratio-target 90

['Income by $1,000 Range', 'Academic Year', 'TAP Recipient Dollars']: distinct_ratio=0.9124775583482945, unique_combinations=4066
Took 1.998 seconds


### Full Greedy Minimum Key Algorithm

⚠️⚠️⚠️ Running the full greedy min key algorithm takes a long time. Uncomment the next cell to run ⚠️⚠️⚠️

In [38]:
!find-quasi-ids --input-file data/TAP_PRUNED.csv --out-dir output \
    greedy

['Income by $1,000 Range', 'Academic Year', 'TAP Recipient Dollars', 'TAP Sector Group', 'Recipient Age Group', 'TAP Schedule Letter', 'TAP Level of Study', 'TAP Financial Status', 'Level', 'Sector Type', 'TAP Award Schedule', 'TAP Degree or NonDegree', 'TAP Recipient FTEs']: distinct_ratio=0.9958523639056907, unique_combinations=80674
Took 200.376 seconds


## Results

Running one of the algorithms in `find-quasi-ids` writes out a list of column combinations with their distinct ratio and unique number of rows in the input dataset.

## Using the results

We can then use these column combinations to identify a unique individual in the dataset through querying.

Define the high-risk pseudo-identifier combination based on the output of `find-quasi-ids`.

In [29]:
risky_combo = ['Income by $1,000 Range', 'Academic Year', 'TAP Recipient Dollars']

We can then find the most common values for these columns in the dataset with the next cell.

In [30]:
import pandas as pd

df = pd.read_csv("data/TAP_PRUNED.csv")

# Calculate group size for each row
df['group_size'] = df.groupby(risky_combo)[risky_combo[0]].transform('count')

# Filter to find individuals in groups of size 1 (fully re-identifiable)
at_risk = df[df['group_size'] == 1]

# Summarize the most common attributes among at-risk individuals
top_income = at_risk['Income by $1,000 Range'].value_counts().head(5)
top_awards = at_risk['TAP Recipient Dollars'].value_counts().head(5)
top_years = at_risk['Academic Year'].value_counts().head(5)

# Show a few re-identifiable combinations
unique_examples = at_risk[risky_combo].value_counts().reset_index().head(10)
unique_examples.columns = risky_combo + ['count']

# Print the analysis
print("\nAt-Risk Group Summary")
print(f"Total re-identifiable individuals (group size = 1): {len(at_risk)}\n")

print("Top 5 Income Brackets:")
print(top_income.to_string(), "\n")

print("Top 5 TAP Award Amounts:")
print(top_awards.to_string(), "\n")

print("Top 5 Academic Years:")
print(top_years.to_string(), "\n")


At-Risk Group Summary
Total re-identifiable individuals (group size = 1): 46194

Top 5 Income Brackets:
Income by $1,000 Range
$ 7,001 to $ 8,000    1709
$ 8,001 to $ 9,000    1668
$ 9,001 to $10,000    1520
$13,001 to $14,000    1173
$11,001 to $12,000    1162 

Top 5 TAP Award Amounts:
TAP Recipient Dollars
750.00    309
166.67    244
125.00    236
333.34    205
500.01    199 

Top 5 Academic Years:
Academic Year
2001    2553
2003    2549
2000    2508
2002    2480
2004    2140 



With these values and a little tinkering we can find unique individuals like the following

In [37]:
filtered_df = df.loc[
    (df['Academic Year'] == 2001) &
    (df['Income by $1,000 Range'] == '$ 6,001 to $ 7,000') &
    (df['TAP Recipient Dollars'] == 750)
]
filtered_df

Unnamed: 0,Academic Year,Level,TAP Level of Study,Sector Type,TAP Sector Group,Recipient Age Group,TAP Financial Status,TAP Award Schedule,TAP Degree or NonDegree,TAP Schedule Letter,"Income by $1,000 Range",TAP Recipient FTEs,TAP Recipient Dollars,group_size
71646,2001,U,2 yr Undergrad,PUBLIC,1-CUNY SR,over age 50,Financial_Independent,Independent_Schedule,Degree,K,"$ 6,001 to $ 7,000",0.5,750.0,1
