#  **Data Anonymization Process**

This section describes all anonymization techniques applied to the medical dataset in accordance with the rubric. The IPYNB transformations implement **suppression**, **generalization**, **perturbation**, and **encoding**, ensuring that no patient can be individually re-identified.


#### **3.1 Suppression (Column Removal)**

**Technique:** Direct removal of sensitive identifiers.

**Column removed**: * **PatientID** → *Suppressed*

Removing this column eliminates the only direct identifier that could explicitly link the data to an individual.
Suppressing unique identifiers prevents linkage attacks, where adversaries match IDs with hospital records, forms, or external databases.

#### **3.2 Generalization (Reducing Granularity)**

Generalization transforms precise values into broader, less identifiable categories.

 **a) State → Region**

* The original *State* column was replaced with a mapped **Region** (Northeast, Midwest, South, West, Territories).
* Implemented through the `region_map` dictionary.

Reduces location precision from exact state to region-level, lowering re-identification risk for small populations.

**b) AgeCategory → AgeGroup**

Using the `generalize_age()` function, granular age categories like **“Age 18 to 24”** became:

* **Young Adult**
* **Adult**
* **Middle-aged**
* **Senior**

Exact age ranges can be identifying, especially for elderly patients. Grouping reduces uniqueness.

**c) HeightInMeters → HeightGroup**

Height was binned into:

* VeryShort
* Short
* Medium
* Tall
* VeryTall

Exact height can uniquely identify individuals; grouping protects against re-identification through biometric attributes.

**d) BMI → BMIGroup**

BMI numerical values were generalized into:

* Underweight
* Normal
* Overweight
* Obese
* ExtremelyObese

Removes sensitive numerical health details while preserving analytical usefulness.

**e) HadDiabetes → DiabetesGroup**

Instead of four detailed diabetic statuses, categories were merged into:

* NoDiabetes
* Prediabetic
* Diabetic

Reduces the granularity of a medically sensitive attribute.

**3.3 Perturbation (Random Noise in Weight)**

```python
df_anon["WeightInKilograms"] = df_anon["WeightInKilograms"] + np.random.normal(0, 0.8, size=len(df_anon))
```

* Adds Gaussian noise (μ = 0, σ = 0.8 kg).
* Keeps weight statistically realistic while preventing exact matching with clinical records.

If an attacker knows a person’s exact weight, perturbation prevents direct linkage while maintaining analytical integrity for modeling.


####  **3.4 Encoding (Hashing / One-Hot Encoding)**

Several sensitive or categorical variables were transformed using **one-hot encoding**, a standard anonymization-compatible method.

**Variables encoded using One-Hot Encoding**

* GeneralHealth → High / Medium / Bad
* HeightGroup
* BMIGroup
* DiabetesGroup
* SmokingGroup
* ECigGroup
* RaceGroup
* TetanusGroup
* AgeGroup
* Region (categorical but not encoded at creation; stays categorical)

Additionally:

* **Sex** was mapped: Female → 0, Male → 1
* **Booleans** converted to 0 / 1

One-hot encoding removes the original string labels, avoiding semantic leakage and enabling privacy-preserving machine learning workflows.


### **3.5 Summary Table of Techniques**

| Variable / Category                   | Technique          | Description                     | Privacy Protection                      |
| ------------------------------------- | ------------------ | ------------------------------- | --------------------------------------- |
| PatientID                             | **Suppression**    | Removed entirely                | Prevents direct identification          |
| State                                 | **Generalization** | Mapped to region                | Avoids geographic singling-out          |
| AgeCategory                           | **Generalization** | Converted to broader age groups | Reduces risk for small-age-range groups |
| Height, BMI                           | **Generalization** | Binned into ranges              | Removes precise biometric identifiers   |
| Weight                                | **Perturbation**   | Added Gaussian noise            | Prevents exact-record linkage           |
| Race, Smoking, Diabetes, Health, etc. | **Encoding**       | One-hot or mapping              | Removes sensitive string labels         |
| Boolean illnesses                     | **Encoding**       | Converted to binary             | Removes original representation         |




In [1]:
import pandas as pd
import numpy as np

df = pd.read_excel("Patients Data ( Used for Heart Disease Prediction ).xlsx")
df.head()


Unnamed: 0,PatientID,State,Sex,GeneralHealth,AgeCategory,HeightInMeters,WeightInKilograms,BMI,HadHeartAttack,HadAngina,...,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,1,Alabama,Female,Fair,Age 75 to 79,1.63,84.82,32.099998,0,1,...,Never used e-cigarettes in my entire life,1,"White only, Non-Hispanic",0,0,0,1,"No, did not receive any tetanus shot in the pa...",0,1
1,2,Alabama,Female,Very good,Age 65 to 69,1.6,71.669998,27.99,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",0,0,1,1,"Yes, received Tdap",0,0
2,3,Alabama,Male,Excellent,Age 60 to 64,1.78,71.209999,22.530001,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",1,0,0,0,"Yes, received tetanus shot but not sure what type",0,0
3,4,Alabama,Male,Very good,Age 70 to 74,1.78,95.25,30.129999,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",0,0,1,1,"Yes, received tetanus shot but not sure what type",0,0
4,5,Alabama,Female,Good,Age 50 to 54,1.68,78.019997,27.76,0,0,...,Never used e-cigarettes in my entire life,1,"Black only, Non-Hispanic",0,0,1,0,"No, did not receive any tetanus shot in the pa...",0,0


In [2]:
#Auxiliary
region_map = {
    # --------------------------
    # NORTHEAST REGION
    # --------------------------
    "Maine": "Northeast",
    "New Hampshire": "Northeast",
    "Vermont": "Northeast",
    "Massachusetts": "Northeast",
    "Rhode Island": "Northeast",
    "Connecticut": "Northeast",
    "New York": "Northeast",
    "New Jersey": "Northeast",
    "Pennsylvania": "Northeast",

    # --------------------------
    # MIDWEST REGION
    # --------------------------
    "Ohio": "Midwest",
    "Indiana": "Midwest",
    "Illinois": "Midwest",
    "Michigan": "Midwest",
    "Wisconsin": "Midwest",
    "Minnesota": "Midwest",
    "Iowa": "Midwest",
    "Missouri": "Midwest",
    "North Dakota": "Midwest",
    "South Dakota": "Midwest",
    "Nebraska": "Midwest",
    "Kansas": "Midwest",

    # --------------------------
    # SOUTH REGION
    # --------------------------
    "Delaware": "South",
    "Maryland": "South",
    "District of Columbia": "South",
    "Virginia": "South",
    "West Virginia": "South",
    "North Carolina": "South",
    "South Carolina": "South",
    "Georgia": "South",
    "Florida": "South",
    "Kentucky": "South",
    "Tennessee": "South",
    "Mississippi": "South",
    "Alabama": "South",
    "Arkansas": "South",
    "Louisiana": "South",
    "Texas": "South",
    "Oklahoma": "South",

    # --------------------------
    # WEST REGION
    # --------------------------
    "Montana": "West",
    "Idaho": "West",
    "Wyoming": "West",
    "Colorado": "West",
    "New Mexico": "West",
    "Arizona": "West",
    "Utah": "West",
    "Nevada": "West",
    "California": "West",
    "Oregon": "West",
    "Washington": "West",
    "Alaska": "West",
    "Hawaii": "West",

    # --------------------------
    # U.S. TERRITORIES (GROUPED)
    # --------------------------
    "Puerto Rico": "Territories",
    "Guam": "Territories",
    "Virgin Islands": "Territories"
}


### **Auxiliary Functions for Anonimization and Encoding**

In [None]:
def State_Anonimize(df_anon):
    df_anon["Region"] = df_anon["State"].map(region_map)
    df_anon = df_anon.drop(columns=["State"])
    df_anon = pd.get_dummies(df_anon, columns=["Region"], drop_first=True)
    return df_anon

def Height_Encode(df_anon):
    df_anon["HeightGroup"] = pd.cut(
    df_anon["HeightInMeters"],
    bins=[0, 1.50, 1.60, 1.70, 1.85, 3],
    labels=["VeryShort", "Short", "Medium", "Tall", "VeryTall"]
)
    df_anon = df_anon.drop(columns=["HeightInMeters"])
    df_anon = pd.get_dummies(df_anon, columns=["HeightGroup"], drop_first=True)
    return df_anon

def Health_Encode(df_anon):
    df_anon["GeneralHealth"] = df_anon["GeneralHealth"].replace({
    "Poor": "Bad",
    "Fair": "Bad",
    "Good": "Medium",
    "Very good": "High",
    "Excellent": "High"
})
    df_anon = pd.get_dummies(df_anon, columns=["GeneralHealth"], drop_first=True)
    return df_anon

def Age_Encode(df_anon):
    df_anon["AgeGroup"] = df_anon["AgeCategory"].apply(generalize_age)
    df_anon = df_anon.drop(columns=["AgeCategory"])
    df_anon = pd.get_dummies(df_anon, columns=["AgeGroup"], drop_first=True)
    return df_anon


def generalize_age(age):
    if age in ["Age 18 to 24", "Age 25 to 29", "Age 30 to 34"]:
        return "Young Adult"
    elif age in ["Age 35 to 39", "Age 40 to 44", "Age 45 to 49"]:
        return "Adult"
    elif age in ["Age 50 to 54", "Age 55 to 59", "Age 60 to 64"]:
        return "Middle-aged"
    else:
        return "Senior"

def BMI_Encode(df_anon):
    df_anon["BMIGroup"] = pd.cut(
    df_anon["BMI"],
    bins=[0, 18.5, 25, 30, 40, 100],
    labels=["Underweight", "Normal", "Overweight", "Obese", "ExtremelyObese"]
)
    df_anon = df_anon.drop(columns=["BMI"])
    df_anon = pd.get_dummies(df_anon, columns=["BMIGroup"], drop_first=True)
    return df_anon

def Diabetes_Encode(df_anon):
    df_anon["DiabetesGroup"] = df_anon["HadDiabetes"].replace({
    "No": "NoDiabetes",
    "No, pre-diabetes or borderline diabetes": "Prediabetic",
    "Yes": "Diabetic",
    "Yes, but only during pregnancy (female)": "Diabetic"
})
    df_anon = df_anon.drop(columns=["HadDiabetes"])
    df_anon = pd.get_dummies(df_anon, columns=["DiabetesGroup"], drop_first=True)
    return df_anon

def SmokerStatus_Encode(df_anon):
    df_anon["SmokingGroup"] = df_anon["SmokerStatus"].replace({
    "Never smoked": "Never",
    "Former smoker": "Former",
    "Current smoker - now smokes every day": "Current",
    "Current smoker - now smokes some days": "Current"
})

    df_anon = df_anon.drop(columns=["SmokerStatus"])
    df_anon = pd.get_dummies(df_anon, columns=["SmokingGroup"], drop_first=True)
    return df_anon

def ECigarreteUsage_Encode(df_anon):
    df_anon["ECigGroup"] = df_anon["ECigaretteUsage"].replace({
    "Never used e-cigarettes in my entire life": "Never",
    "Not at all (right now)": "FormerOrNone",
    "Use them some days": "Current",
    "Use them every day": "Current"
})
    df_anon = df_anon.drop(columns=["ECigaretteUsage"])
    df_anon = pd.get_dummies(df_anon, columns=["ECigGroup"], drop_first=True)
    return df_anon

def RaceEtchnicity_Encode(df_anon):
    df_anon["RaceGroup"] = df_anon["RaceEthnicityCategory"].replace({
    "White only, Non-Hispanic": "White",
    "Hispanic": "Hispanic",
    "Black only, Non-Hispanic": "NonWhiteOther",
    "Other race only, Non-Hispanic": "NonWhiteOther",
    "Multiracial, Non-Hispanic": "NonWhiteOther"
})
    df_anon = df_anon.drop(columns=["RaceEthnicityCategory"])
    df_anon = pd.get_dummies(df_anon, columns=["RaceGroup"], drop_first=True)
    return df_anon

def Tetanus_Encode(df_anon):
    df_anon["TetanusGroup"] = df_anon["TetanusLast10Tdap"].replace({
    "No, did not receive any tetanus shot in the past 10 years": "NoShot10Years",
    "Yes, received tetanus shot but not sure what type": "VaccinatedUnknownType",
    "Yes, received Tdap": "VaccinatedKnownType",
    "Yes, received tetanus shot, but not Tdap": "VaccinatedKnownType"
})

    df_anon = df_anon.drop(columns=["TetanusLast10Tdap"])
    df_anon = pd.get_dummies(df_anon, columns=["TetanusGroup"], drop_first=True)
    return df_anon


# **Data Anonimization Function**

In [None]:
def anon_data(df):
    df_anon = df.copy()
    df_anon.drop(columns=["PatientID"], inplace=True)

    df_anon = State_Anonimize(df_anon)
    df_anon = Health_Encode(df_anon)
    df_anon = Height_Encode(df_anon)
    df_anon = BMI_Encode(df_anon)
    df_anon = Diabetes_Encode(df_anon)
    df_anon = SmokerStatus_Encode(df_anon)
    df_anon = ECigarreteUsage_Encode(df_anon)
    df_anon = RaceEtchnicity_Encode(df_anon)
    df_anon = Tetanus_Encode(df_anon)
    df_anon = Age_Encode(df_anon)
    
    #Perturbation
    df_anon["WeightInKilograms"] = df_anon["WeightInKilograms"] + np.random.normal(0, 0.8, size=len(df_anon))

    df_anon['Sex'] = df_anon['Sex'].map({'Female': 0, 'Male': 1})
    
    df_anon = df_anon.replace({False: 0, True: 1})
    return df_anon


In [17]:
df_anonymized = anon_data(df)
df_anonymized.head()

  df_anon = df_anon.replace({False: 0, True: 1})


Unnamed: 0,Sex,WeightInKilograms,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,...,SmokingGroup_Never,ECigGroup_FormerOrNone,ECigGroup_Never,RaceGroup_NonWhiteOther,RaceGroup_White,TetanusGroup_VaccinatedKnownType,TetanusGroup_VaccinatedUnknownType,AgeGroup_Middle-aged,AgeGroup_Senior,AgeGroup_Young Adult
0,0,84.82,0,1,0,1,1,0,0,0,...,0,0,1,0,1,0,0,0,1,0
1,0,71.669998,0,0,0,0,0,0,0,0,...,0,0,1,0,1,1,0,0,1,0
2,1,71.209999,0,0,0,0,0,0,0,0,...,1,0,1,0,1,0,1,1,0,0
3,1,95.25,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,1,0
4,0,78.019997,0,0,0,0,0,0,0,0,...,1,0,1,1,0,0,0,1,0,0


#### **Export anonimized data**

In [None]:
df_anonymized.to_csv("anonimized_patient_data.csv", index=False)

### **Final Privacy Justification**

The anonymization pipeline ensures robust privacy protection through four complementary mechanisms:

 **1. Direct identifiers removed**
No individual can be directly matched.

 **2. Sensitive traits generalized**
Biological and demographic details lose precision, blocking indirect re-identification.

 **3. Numerical attributes perturbed**
Prevents matching with hospital records while preserving statistical integrity.

 **4. Encoded categorical data**
Eliminates semantic meaning and reduces the risk of attribute inference.

