## Build DB

This notebook will be used to build the complete exam datasets based on patient data.

In [1]:
import pandas as pd
import numpy as np

db_raw = pd.read_excel('DBRaw.xlsx', index_col=0)
all_patients_data = pd.read_excel('.\\out\\patients.xlsx')

We separate fully completed patients from completable patients. Fully completed patients go straight onto DBClean. We then find the first completed row for each completable patient, and foward fill the patient's exam from that row onwards. We then add that patient to DBCompleted (their first completed row will be their first row in the dataset).

In [2]:
completable_patients = all_patients_data[all_patients_data['Destino'] == 'DBCompleted']
clean_patients = all_patients_data[all_patients_data['Destino'] == 'DBClean']
db_clean = db_raw[db_raw['Código'].isin(clean_patients['Código'])]
db_completed_initial = db_raw[db_raw['Código'].isin(completable_patients['Código'])]
db_completed_list = []

for idx, patient in completable_patients.iterrows():
    current_patient_data = db_completed_initial[db_completed_initial['Código'] == patient['Código']]

    oldest_completed_entry_index = 0

    for index, row in current_patient_data.iterrows():
        if not row.isna().any():
            break
        oldest_completed_entry_index += 1

    current_patient_data = current_patient_data.iloc[oldest_completed_entry_index:]
    db_completed_list.append(current_patient_data.ffill())

db_completed = pd.concat(db_completed_list, ignore_index=True)
db_cleanPlusCompleted = db_clean
db_cleanPlusCompleted = pd.concat([db_cleanPlusCompleted, db_completed], ignore_index=True)

Finally, we export the resulting datasets to spreadsheets.

In [3]:
db_completed.to_excel('out2/DBCompleted.xlsx')
db_clean.to_excel('out2/DBClean2.xlsx')
db_cleanPlusCompleted.to_excel('out2/DBCleanPlusCompleted.xlsx')

In [4]:
print(len(db_completed['Código'].unique()))
print(len(db_clean['Código'].unique()))
print(len(db_clean))
print(len(db_completed))
print(len(db_clean) + len(db_completed))

3994
385
4599
69321
73920
