<a href="https://colab.research.google.com/github/desstaw/Seminar_DataManagement23/blob/main/K_Anonymity_original_heart_k%3D3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [97]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import random


# Load data
url = "https://raw.githubusercontent.com/desstaw/Seminar_DataManagement23/main/datasets/heart.csv"
df = pd.read_csv(url)
df = df.drop('Unnamed: 0', axis=1)

import warnings
warnings.simplefilter('ignore')

### K-anonymity on original heart ds

**Explanation**:

1. Apply generalization to the quasi-identifiers:
This step applies the generalization hierarchy to each quasi-identifier in the dataset using the pandas "cut" function. This function cuts a Series into bins and then labels the bins with the provided categories. In this code, each quasi-identifier is cut into bins based on the corresponding generalization hierarchy defined earlier.

2. Define "k" that represents the minimum number of individuals that must be in a group to avoid suppression.

3. Group the dataset by the quasi-identifiers and suppress any groups with less than k individuals by adding their indices to the "suppressed_indices" list. Then use "groupby" function to group the dataset by the quasi-identifiers, and then a loop is used to check if each group has at least k individuals. If a group has fewer than k individuals, its indices are added to the "suppressed_indices" list.

4. Drops the rows with indices in the "suppressed_indices" list from the dataset to ensure that all data is now k-anonymous.

In [98]:
# Define the sensitive attribute and the quasi-identifiers
sensitive_attribute = 'target'
quasi_identifiers = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
                     'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

# Define the generalization hierarchy for each quasi-identifier
generalization_hierarchy = {
    'age': pd.cut(df['age'], bins=[29, 39, 49, 59, 69, 79, 89]),
    'chol': pd.cut(df['chol'], bins=[100, 150, 200, 250, 300, 350, 400, 450]),
    'oldpeak': pd.cut(df['oldpeak'], bins=[-0.1, 1, 2, 3, 4, 5]),
}

# Apply generalization to the quasi-identifiers
for col, hierarchy in generalization_hierarchy.items():
    df[col] = pd.cut(df[col], bins=hierarchy.cat.categories, labels=hierarchy.cat.categories[:-1])

# Define the privacy parameter
k = 3

# Group the dataset by the quasi-identifiers and suppress the groups with less than k rows
grouped = df.groupby(quasi_identifiers)
suppressed_indices = []
for group_name, group in grouped:
    if len(group) < k:
        suppressed_indices.extend(group.index)
df = df.drop(suppressed_indices)

# Restore the original index
df_index = df.index
df = df.reset_index(drop=True)

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1021 non-null   category
 1   sex       1025 non-null   int64   
 2   cp        1025 non-null   int64   
 3   trestbps  1025 non-null   int64   
 4   chol      1022 non-null   category
 5   fbs       1025 non-null   int64   
 6   restecg   1025 non-null   int64   
 7   thalach   1025 non-null   int64   
 8   exang     1025 non-null   int64   
 9   oldpeak   1018 non-null   category
 10  slope     1025 non-null   int64   
 11  ca        1025 non-null   int64   
 12  thal      1025 non-null   int64   
 13  target    1025 non-null   int64   
dtypes: category(3), int64(11)
memory usage: 92.1 KB


In [100]:
df.head(20)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,"(49, 59]",1,0,125,"(200, 250]",0,1,168,0,"(-0.1, 1.0]",2,2,3,0
1,"(49, 59]",1,0,140,"(200, 250]",1,0,155,1,"(3.0, 4.0]",0,0,3,0
2,"(69, 79]",1,0,145,"(150, 200]",0,1,125,1,"(2.0, 3.0]",0,0,3,0
3,"(59, 69]",1,0,148,"(200, 250]",0,1,161,0,"(-0.1, 1.0]",2,1,3,0
4,"(59, 69]",0,0,138,"(250, 300]",1,1,106,0,"(1.0, 2.0]",1,3,2,0
5,"(49, 59]",0,0,100,"(200, 250]",0,0,122,0,"(-0.1, 1.0]",1,0,2,1
6,"(49, 59]",1,0,114,"(300, 350]",0,2,140,0,"(4.0, 5.0]",0,3,1,0
7,"(49, 59]",1,0,160,"(250, 300]",0,0,145,1,"(-0.1, 1.0]",1,1,3,0
8,"(39, 49]",1,0,120,"(200, 250]",0,0,144,0,"(-0.1, 1.0]",2,0,3,0
9,"(49, 59]",1,0,122,"(250, 300]",0,0,116,1,"(3.0, 4.0]",1,2,2,0


In [101]:
df.isnull().sum()
df.replace('', np.nan, inplace=True)
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1011 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1011 non-null   category
 1   sex       1011 non-null   int64   
 2   cp        1011 non-null   int64   
 3   trestbps  1011 non-null   int64   
 4   chol      1011 non-null   category
 5   fbs       1011 non-null   int64   
 6   restecg   1011 non-null   int64   
 7   thalach   1011 non-null   int64   
 8   exang     1011 non-null   int64   
 9   oldpeak   1011 non-null   category
 10  slope     1011 non-null   int64   
 11  ca        1011 non-null   int64   
 12  thal      1011 non-null   int64   
 13  target    1011 non-null   int64   
dtypes: category(3), int64(11)
memory usage: 98.7 KB


Test the k-anonimity:

**Explanation**

`for qi_vals, group in df.groupby(qi_cols):`group the records in the df based on the values of the quasi-identifiers. For each group of records, the variable qi_vals contains the values of the quasi-identifiers for the group, and the variable group contains the records in the group.

`counts = Counter(df[qi_cols].apply(tuple, axis=1))`: creates a Counter object that counts the number of occurrences of each combination of quasi-identifiers in the dataframe df. The apply() function is used to apply the tuple() function to each row of the dataframe, which converts the values of the quasi-identifiers in each row to a tuple.

`num_violations = len([count for count in counts.values() if count < k_anonymity])`: This line counts the number of records in the dataframe df that do not satisfy k-anonymity. It does this by iterating over the values of the Counter object counts and counting the number of values that are less than the value of k.

In [102]:
from collections import Counter

# Define the quasi-identifiers
qi_cols = [col for col in df.columns if col != sensitive_attribute]

# Define the value of k for k-anonymity
k_anonymity = 3

# Check k-anonymity for each group of records
for qi_vals, group in df.groupby(qi_cols):
    if len(group) < k_anonymity:
        print("k-anonymity is not satisfied for the group:", qi_vals)

# Get the total number of records that do not satisfy k-anonymity
counts = Counter(df[qi_cols].apply(tuple, axis=1))
num_violations = len([count for count in counts.values() if count < k_anonymity])
print("Number of records that do not satisfy k-anonymity:", num_violations)


Number of records that do not satisfy k-anonymity: 0


Test l-diversity

In [103]:
from collections import Counter

sensitive_attribute = 'target'

l_diversity = 2

# Initialize counter
count = 0

# Define the quasi-identifiers
qi_cols = [col for col in df.columns if col != sensitive_attribute]

# Check l-diversity for each group of records
for qi_vals, group in df.groupby(qi_cols):
    if len(group[sensitive_attribute].unique()) < l_diversity:
        count += len(group)
        #print("l-diversity is not satisfied for the group:", qi_vals)
# Print the total number of records that do not satisfy l-diversity
print(f"Total number of records that do not satisfy l-diversity: {count}")

Total number of records that do not satisfy l-diversity: 1011


In [104]:
#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


In [105]:
#df.to_csv('/content/drive/MyDrive/Colab Notebooks/Sepsis/v2_heart.csv', index=False)