# Data Deposit Box
  
*Authors: Vassil Dimitrov and Makda G  
Date: June 16, 2023  
Project: DATA DEPOSIT BOX*  

---

#### Import necessary libraries

In [17]:
# Load libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
import seaborn as sns

#### Read pre-parsed tidy tables and subset

In [18]:
# Load parsed data:
X_train = pd.read_csv ('X_train.csv', index_col = 0)
y_train = pd.read_csv ('y_train.csv', index_col = 0)
y_train = y_train['prediabetes_bin_y'].squeeze()

In [19]:
# Subset data for maneouverability
X_train_33, _, y_train_33, _ = train_test_split (X_train, y_train, test_size=0.67, stratify=y_train)

In [20]:
# Clan up space:
del X_train, y_train

## Anonymize

The first step of anonymization is done upon parsing data from FHIR patient *json* files into a table.  
Only the patient ID will be extracted and other features like address and potentially race, ethnicity and gender will be anonymized. For the sake of simplicity, we have not extracted these features with the pre-parsed tables we are working with at this point. In addition, we have NOT anonymized the data upon parsing, which will be a step that will be added to the workflow. Here, we have loaded a tidy data that has not been anonymized and still contains the association with the patients. For the sake of illustration, the data will be anonymized twice, but the reader should keep in mind that anonymization should occur based on the following pipeline:
- Parse *json* file to a table format (8 tables per patient)
- Anonymize patient personal information using a hash function while preserving mapping between tables
- Import tables into an SQL database for data storage
- Upon extraction for model building, load data from SQL, merge based on hashed patient_ID and clean data to obtain a table with 1 patient per row and a number a feature per column.
- Anonymize the merged table for patient ID (and other potentially sensitive information) again using a hash function
- Tidy table and transform features to a numerical encoding appropriate for model building
- Add Laplace noise to the data to further mitigate any possibility of association to a specific patient based on feature values
- Randomize data by subsampling rows of the table

---

#### Anonymize

This first anonymization step should normally be done to each table parsed from the FIHR *json* files, but for the sake of time, simplicity and data availability, it will be done at this stage

Display 5 random columns and 6 random rows from the tidy data pre-anonymization:

In [5]:
import random

# Define how many to display
num_columns = 5  # Number of random columns to select
num_rows = 6     # Number of random rows to select

# Randomly select columns
random_columns = random.sample(list(X_train_33.columns), num_columns)

# Randomly select rows
random_rows = X_train_33.sample(num_rows)

# Display the selected columns and rows
selected_data = X_train_33[random_columns].loc[random_rows.index]
display(selected_data)

Unnamed: 0_level_0,Acute bacterial sinusitis (disorder),Fexofenadine hydrochloride 30 MG Oral Tablet,Estrostep Fe 28 Day Pack,Dander (animal) allergy,Chlorpheniramine 8 MG Oral Tablet
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
857554e7-33c3-49a6-aae5-59749684eaac,0,0.0,0.0,0.0,0.0
0671c11a-77d4-4118-af1c-e7fa73fb12e0,0,0.0,0.0,0.0,0.0
1d2646b5-bffc-4296-8898-a25e0a5239ee,0,0.0,0.0,0.0,0.0
b9832490-008c-454e-8ccf-6c872655f36f,111,0.0,0.0,0.0,0.0
1ef1aec7-8031-4d33-be9a-b0501fab40a4,43,0.0,0.0,0.0,0.0
a9c93589-5436-49e5-8416-973720407dd2,0,0.0,0.0,0.0,0.0


Anonymize the data:

In [6]:
import hashlib

# Function to anonymize an index value using a hash function
def anonymize_index_value(index_value):
    anonymized_value = hashlib.sha256(str(index_value).encode()).hexdigest()
    return anonymized_value

# Anonymize the 'Name' and 'Address' columns
X_train_33.index = X_train_33.index.map(anonymize_index_value)

Display 5 random columns and 6 random rows from the tidy data post-anonymization:

In [11]:
# Randomly select columns
random_columns = random.sample(list(X_train_33.columns), num_columns)

# Randomly select rows
random_rows = X_train_33.sample(num_rows)

# Display the selected columns and rows
selected_data = X_train_33[random_columns].loc[random_rows.index]
display(selected_data)

Unnamed: 0_level_0,Ortho Tri-Cyclen 28 Day Pack,Protracted diarrhea,Streptococcal sore throat (disorder),Dander (animal) allergy,Acetaminophen 325 MG / oxyCODONE Hydrochloride 2.5 MG [Percocet]
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4bf338b9e9bddcaef886750b8afb4c69aefffb8e0ccd978f06ff3db2e28b1db0,0.0,0.0,11,0.0,0.0
a9728c56626b0d1927dab6430312a46134f8baecda661946bcd5bab1aff00943,0.0,0.0,0,0.0,0.0
cbe87104389177ac9fdf2f39832df40774dc6e6d1ce6161369d83545e0310485,0.0,0.0,0,0.0,0.0
837b10934008a1e245e531be2180b622fd0ec536477d7124fdba82701bc56f8a,0.0,0.0,0,0.0,0.0
b4e604813ec1179bf3bf0704a407a86d4ee8c3c087c7c4aaaf81d025dff6f132,0.0,0.0,0,0.0,0.0
bd7fbc0a6b779e7f7bd5b93874c02500e890032a2d4a0e57dfc8cfa13ac1ece4,0.0,0.0,0,0.0,0.0


---

#### Add Laplace noise

Laplace Noise will be added to the data precluding association of anonymized medical record to a specific patient based on their existing healthcare records.  
It should be noted that there is a trade-off between privacy and data utility - i.e. higher levels of privacy protection using this technique may lead to more data distortion and less accurate data.  
Using synthetic data, therefore, it would be a good idea to optimize the parameters for this approach of anonymization.

In [12]:
import math

# Set the sensitivity (maximum change in any single row/column)
sensitivity = 0.3

# Set the privacy parameter (epsilon) for differential privacy
epsilon = 0.5

# Calculate the scale parameter for Laplace distribution
scale = sensitivity / epsilon

# Generate Laplace noise with the same shape as the one-hot-encoded columns
noise = np.random.laplace(scale=scale, size=X_train_33.shape)

# Add noise to the one-hot-encoded columns
noisy_X_train_33 = X_train_33 + noise

Display the dataframe with noise:

In [13]:
# Display the selected columns and rows
selected_data = noisy_X_train_33[random_columns].loc[random_rows.index]
display(selected_data)

Unnamed: 0_level_0,Ortho Tri-Cyclen 28 Day Pack,Protracted diarrhea,Streptococcal sore throat (disorder),Dander (animal) allergy,Acetaminophen 325 MG / oxyCODONE Hydrochloride 2.5 MG [Percocet]
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4bf338b9e9bddcaef886750b8afb4c69aefffb8e0ccd978f06ff3db2e28b1db0,-0.134025,0.290011,9.699624,-0.584771,0.445114
a9728c56626b0d1927dab6430312a46134f8baecda661946bcd5bab1aff00943,-1.069658,0.513506,-0.10724,-0.781131,-0.685442
cbe87104389177ac9fdf2f39832df40774dc6e6d1ce6161369d83545e0310485,0.964023,0.15796,-0.715596,0.09289,-0.737214
837b10934008a1e245e531be2180b622fd0ec536477d7124fdba82701bc56f8a,1.17433,-2.456379,-0.588613,0.452619,-0.543836
b4e604813ec1179bf3bf0704a407a86d4ee8c3c087c7c4aaaf81d025dff6f132,-0.508661,0.134592,0.843128,0.292449,0.2061
bd7fbc0a6b779e7f7bd5b93874c02500e890032a2d4a0e57dfc8cfa13ac1ece4,-1.052363,0.124606,-0.17767,0.150985,-0.513803


Scramble data:

In [16]:
# Scramble rows:
noisy_X_train_33 = noisy_X_train_33.sample(frac=1)


# Display the selected columns and rows
selected_data = noisy_X_train_33[random_columns].loc[random_rows.index]
display(selected_data)

Unnamed: 0_level_0,Ortho Tri-Cyclen 28 Day Pack,Protracted diarrhea,Streptococcal sore throat (disorder),Dander (animal) allergy,Acetaminophen 325 MG / oxyCODONE Hydrochloride 2.5 MG [Percocet]
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4bf338b9e9bddcaef886750b8afb4c69aefffb8e0ccd978f06ff3db2e28b1db0,-0.134025,0.290011,9.699624,-0.584771,0.445114
a9728c56626b0d1927dab6430312a46134f8baecda661946bcd5bab1aff00943,-1.069658,0.513506,-0.10724,-0.781131,-0.685442
cbe87104389177ac9fdf2f39832df40774dc6e6d1ce6161369d83545e0310485,0.964023,0.15796,-0.715596,0.09289,-0.737214
837b10934008a1e245e531be2180b622fd0ec536477d7124fdba82701bc56f8a,1.17433,-2.456379,-0.588613,0.452619,-0.543836
b4e604813ec1179bf3bf0704a407a86d4ee8c3c087c7c4aaaf81d025dff6f132,-0.508661,0.134592,0.843128,0.292449,0.2061
bd7fbc0a6b779e7f7bd5b93874c02500e890032a2d4a0e57dfc8cfa13ac1ece4,-1.052363,0.124606,-0.17767,0.150985,-0.513803


The data is now ready for modeling

---

## Modeling

### Logistic regression:

Model is now ready to run for logistic regression.  
Logistic regression can be used a an inferential model in order to determine potential association that are influencing the occurrence of selected features (e.g. development of pre-diabetes).  
It also has some predictive value.  
For an example, please refer to the *Logistic_Regression* notebook.

## Machine Learning using NNets

The anonymized data should also be used for training Neural Networks as a state-of-the-art predictive model.  
This would be the natural follow-up for this project, but is beyond the scope of this presentation due to the time limitations.

---