# 2. Read Data from SQL Database

**RECAP:** 12 output directories contain 1,000,000 - 6,000,000 entries per tables for patient information, allergies, encounters, medications, conditions, careplans, immunizations, procedures and observations parsed from json files following the FIHR convention for medical records. These tables were imported in a SQL database (prediabetes). Each imported table was cleaned from faulty entries. The clean tables were merged into one single table per category in SQL: patient, allergies, careplans, conditions, immunizations, medications, procedures. Note that observations was not processed in SQL as the table was too large to handle. In addition, encounters tables were not processed as this information is redundant with the information present in the other tables.  
This notebook reads the SQL tables and modifies them in order to obtain dataframes in a format and with values allowing model building.

## Prep

### Import modules

In [1]:
import numpy as np
import pandas as pd
import mysql.connector

### FUNCTIONS

<a id='load_sql'></a> Define a function that imports as a pandas dataframe a table from a SQL database: **`load_sql_table`**.

In [2]:
# Read table from mysql:
def load_sql_table (sql_table, databs = 'prediabetes'):
    # Establish a connection to the MySQL database
    conn = mysql.connector.connect (
        host = 'localhost',
        user = 'root',
        password = 'DrVassil08.2018',
        database = databs
    )
    # Define SQL query
    query = 'SELECT * FROM {}'.format(sql_table)
    # Read table in df
    df = pd.read_sql (query, conn)
    # close connection
    conn.close()
    # return dataframe
    return df

<a id='pivot'></a>Define a function **`pivot_L_table`** that rearranges a pandas dataframe to get for each index the categories within a specific column to appear as separate columns in a new dataframe and the values of a third column to populate the new dataframe. In the new dataframe one aims to get a specific value for each category per patient.

In [3]:
# Pivot Large table:

def pivot_L_table (df, cols, vals, idx = 'patient_id', batch_size = 10000):
    
    batch_size = min(batch_size, len(df))  # Adjust batch size dynamically
    
    # Define a dataframe place holder
    result = pd.DataFrame()

    # Iterate over the DataFrame in batches
    while len(df) > 0:
        
        batch = df[:batch_size]

        # Create a pivot table for the current batch
        pivot_table = batch.pivot_table(
            index = idx,
            columns = cols,
            values = vals,
            aggfunc = 'sum',
            fill_value = 0
        )

        # Append the pivot table to the result DataFrame
        result = pd.concat([result, pivot_table], axis=0)

        # Delete the processed rows from the original DataFrame
        df = df[batch_size:]
    # Reset the index to make patient_id a column again
    result['patient_id'] = result.index
    result.reset_index(drop = True, inplace = True)
    return result

---

## Read tables, tidy (exclude unnecessary columns) and save

Read tables as pandas dataframes [directly](#load_sql) from SQL database. Where applicable, pivot the table as described [above](#pivot). Remove columns deemed unnecessary. Save the processed table in *pickle* format for future processing.

### Patients

Load *patients* table from SQL database. Note that a lot of the information from that table is not kept such as address, driver's license. In fact, much of the information that is kept will be removed as well later on during the processing steps. The dataframe is saved as a `.pkl` file (takes less memory) for future processing

In [None]:
# Read patients table
patients = load_sql_table ('patients')

# Remove unnecessary columns
patients_df = patients[['patient_id', 'marital', 'race', 'ethnicity', 'gender']]

# Save the patients_df as a pickle table
patients_df.to_pickle('patients_df.pkl')

# Remove dataframe to liberate memory
del patients_df

### Allergies

In [None]:
# Read allergies data set (allegies_ds) table
allergies_df = load_sql_table ('allegies_ds')

# Check for duplicated rows
allergies_df[allergies_df.duplicated()].sum()

 The *allergies* dataframe contains no duplicates.  
 Obtain one-hot-encoded allergies table (i.e. dummy variables for each allergy).

In [None]:
# One-hot encode allergies for every patient
description_allergy_1hot = pd.get_dummies (allergies_df['description_allergy'])

# Concatinate the one-hot encoding to patient_id's:
allergies_df_1hot_ = pd.concat (
    [allergies_df['patient_id'], 
     description_allergy_1hot],
    axis = 1
)
allergies_df_1hot_

# Group by patient_id and sum the one-hot encoded columns
allergies_df_1hot = allergies_df_1hot_.groupby ('patient_id').sum().reset_index()

# Clean up to save memory
del allergies_df, description_allergy_1hot, allergies_df_1hot_

Save 1-hot-encoded allergies table to a picke file for later use.

In [None]:
# Save one-hot encoded dataframe in pickle format:
allergies_df_1hot.to_pickle ('allergies_df_1hot.pkl')

# Clean up to save memory
del allergies_df_1hot

### Careplans

In [4]:
df = load_sql_table('careplans_ds')

  df = pd.read_sql (query, conn)


Calculate the total duration of careplan for each patient. For instance, if a patient has undergone careplan1 for 10 days in 2018 and 5 days in 2019, the total value for that careplan would be 15 days. The value is in days. Then, pivot table to obtain a single patient for each row and careplan for each column with values indicating the sum of careplan duration in days.

In [6]:
grouped_df = df.groupby(['patient_id', 'careplan_n_reason'])['careplan_duration'].sum()

# Pivot the grouped DataFrame
pivot_table = grouped_df.reset_index().pivot(index='patient_id', 
                                             columns='careplan_n_reason', 
                                             values='careplan_duration')

Upon pivoting, there will be many patients with NaN for many careplans. The null values will be replaced with **0** indicating the patient has not undergone a single day of the specific careplan.

In [15]:
# Replace NaN with 0
pivot_table.fillna(0, inplace = True)

In [17]:
# save table
pivot_table.to_pickle ('careplans_df.pkl')

# Clean up to save memory
del pivot_table

### Conditions

In [None]:
# Read conditions data set (conditions_ds) table
conditions_df = load_sql_table ('conditions_ds')

# Pivot table
df = pivot_L_table (df = conditions_df, 
               cols = 'condition_description', 
               vals = 'condition_duration'
               )

# Clean up to save memory
del conditions_df

# Replace NaN with 0
df.fillna(0, inplace = True)

# Group for each patient
df = df.groupby ('patient_id').sum()

# Save final table
df.to_pickle ('conditions_df.pkl')

# Clean up to save memory
del df

### Immunizations

<a id='immunizations'></a>For the immunizations table, days since last immunization type will be calculated. For the sake of simplicity, only the last immunization date will be considered for this engineered feature. The patient that did not get specific immunization will have NaN values, which will be replaced with the equivalent of 80 years in days (reasonable life expectancy) since last immunization type. This would be the equivalent of no immunization.

In [None]:
# Read conditions data set (conditions_ds) table
df = load_sql_table ('immunizations_ds')

# Sort by days since
df_sort = df.sort_values ('days_since_immunization')

del df

# Keep only the first row (latest) for each 'patient_id'
result = df_sort.groupby('patient_id').first().reset_index()

del df_sort

# Pivot the grouped DataFrame
df = result.pivot(index='patient_id',
                  columns='immunization_description',
                  values='days_since_immunization')

# Clean up:
del result

df.reset_index(inplace = True)

# Replace the NaN (no immunization) with numbeer of days for 80 years
df.fillna (int(80*365), inplace = True)

# Save cleaned up dataframe
df.to_pickle ('immunizations_df.pkl')

### Medications

In [None]:
# Read medications data set (medications_ds) table
df = load_sql_table ('medications_ds')

All medication entries related to *prediabetes* and *diabetes* will be removed for obvious reasons (i.e. if a patient is taking a medication for diabetes, it implies the patient has been diagnosed with the diabetes and has potentially already passed through the prediabetes stage).

In [None]:
# Remove all entries of medication for diabetes
df_filtered = df[~df['medication_reason'].str.contains('diabetes|prediabetes', case=False)]

# Drop medication_reason
df_filtered.drop ('medication_reason', axis = 1, inplace = True)

The medications engineered feature here will be total days under a specific medication. Upon pivoting the table, many patients will have null values for a variety of medications indicating they were never on the corresponding medication. These null values will be replaced with 0 (i.e. patient has never taken the medication).

In [None]:
# Pivot table to get days of medication for each medication per patient_id
df = pivot_L_table (df = df_filtered, 
                    cols = 'medication', 
                    vals = 'days_on_medication', 
                    idx = 'patient_id', batch_size = 5000)

# Save cleaned up dataframe
df.to_pickle ('medications_df.pkl')

# Clean up:
del df_filtered

# Replace NaN with 0
df.fillna(0, inplace = True)

# Group for each patient
df_final = df.groupby ('patient_id').sum()

# Clean up:
del df

# Re-create patient_id column
df_final['patient_id'] = df_final.index

# Save table
df_final.to_pickle ('medications_df.pkl')

### Procedures

Same reasoning will be followed for *procedures* as for *immunizations* [above](#immunizations).

In [None]:
# Read medications data set (medications_ds) table
df = load_sql_table ('procedures_ds')

# Pivot table to get days since procedure for each procedure per patient_id
df_pivot = pivot_L_table (df = df, 
                    cols = 'procedure_description', 
                    vals = 'days_since_procedure', 
                    idx = 'patient_id', batch_size = 5000)

# Clean up memory
del df

# Replace NaN with 0
df_pivot.fillna(0, inplace = True)

# Group for each patient
df_final = df_pivot.groupby ('patient_id').sum()

# Clean up:
del df_pivot

# Save cleaned up dataframe
df_final.to_pickle ('procedures_df.pkl')

# Read table:
df_final = pd.read_pickle ('procedures_df.pkl')

df_final

# Reset the index to make 'patient_id' a column again
df_final['patient_id'] = df_final.index

df_final.to_pickle ('procedures_df.pkl')

---