## Introduction

Based on a dataset, which contains sensitive information, the goal is to mitigate potential data threats by analyzing features and understanding how they might be used to target individual users.

In [42]:
import pandas as pd
import anonypy

## Possible threat analysis before anonymization

Before anonymization, we identified some data features that could directly reveal user identities, such as: 'income'

In [43]:
names = (
    'age',
    'workclass', #Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
    'fnlwgt', # "weight" of that person in the dataset (i.e. how many people does that person represent) -> https://www.kansascityfed.org/research/datamuseum/cps/coreinfo/keyconcepts/weights
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income',
)

# some fields are categorical and will require special treatment
categorical = set((
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'sex',
    'native-country',
    'race',
    'income',
))
df = pd.read_csv("./adult.all.txt", sep=", ", header=None, names=names, index_col=False, engine='python');# We load the data using Pandas

In [44]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50k
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50k
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50k
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50k
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50k


## Anonymization process

K-anonymity is a privacy-preserving technique designed to protect individuals' identities within a dataset. It works by ensuring that each record in the dataset cannot be uniquely identified when compared to at least k-1 other records. In simpler terms, it involves grouping individuals together so that they cannot be singled out.

**How does it work?**

To achieve k-anonymity, data is often generalized or suppressed. This means that specific details, such as exact dates of birth or precise addresses, are replaced with broader categories (e.g., age range, city). This reduces the granularity of the data, making it more difficult to link a specific record to an individual.

**Why is k-anonymity important?**

In today's data-driven world, privacy concerns are paramount. K-anonymity helps mitigate the risks associated with data breaches and unauthorized access. By obscuring individual identities, it prevents malicious actors from re-identifying individuals and exploiting their sensitive information.

**Limitations of k-anonymity:**

While k-anonymity is a valuable tool, it has limitations. One significant drawback is the potential for privacy breaches through background knowledge attacks. If an attacker possesses additional information about an individual, they may be able to narrow down the group of records and identify the individual.

Despite its limitations, k-anonymity remains a fundamental technique in data privacy and is often used in conjunction with other privacy-preserving methods to provide robust protection for sensitive data.

In [45]:
for name in categorical:
    df[name] = df[name].astype('category')

feature_columns = ['age', 'education-num']
sensitive_column = 'income'

p = anonypy.Preserver(df, feature_columns, sensitive_column)
rows = p.anonymize_k_anonymity(k=2)

dfn = pd.DataFrame(rows)
print(dfn)

         age education-num income  count
0       [17]         [7-8]  <=50k    334
1    [18-19]         [7-8]  <=50k    451
2    [18-19]         [7-8]   >50k      1
3       [21]          [10]  <=50k    568
4       [21]          [10]   >50k      2
..       ...           ...    ...    ...
852     [73]          [16]   >50k      1
853     [74]          [16]   >50k      2
854     [88]          [15]  <=50k      2
855     [90]          [15]  <=50k      1
856     [90]          [15]   >50k      2

[857 rows x 4 columns]


## Possible threat analysis after anonymization

One collums can not be enough to protect privacy.

## Evaluate the Anonymization

Entropy, a fundamental concept in information theory, measures the uncertainty or randomness of a system. In the context of data anonymization, entropy can be used to quantify the difficulty of identifying a specific individual within a dataset.

1. Calculating Entropy Before and After Anonymization:

    - Before: The entropy of each identifying attribute (such as name, date of birth, address) is calculated before anonymization. High entropy indicates a wide variety of values, making individual identification difficult.
    - After: After applying anonymization techniques (generalization, suppression, perturbation), the entropy of the same attributes is recalculated. The goal is for anonymization to increase entropy, making it harder to distinguish one individual from others.

In [46]:
import numpy as np
# Calculate the entropy of a column in the Raw dataset.

# Count the occurrences of each value in the column
counts = df['income'].value_counts()

# Calculate probabilities
probabilities = counts / len(df)

# Calculate entropy using the formula: - Σ (p * log2(p))
entropy_df = -np.sum(probabilities * np.log2(probabilities))

In [47]:
# Calculate the entropy of a column in the Anonymous dataset.

# Count the occurrences of each value in the column
counts = dfn['income'].value_counts()

# Calculate probabilities
probabilities = counts / len(dfn)

# Calculate entropy using the formula: - Σ (p * log2(p))
entropy_dfn = -np.sum(probabilities * np.log2(probabilities))

In [48]:
print(entropy_df)
print(entropy_dfn)

0.7938438393644256
0.9599476519619152
