# Data Cleaning Report

## Introduction
This notebook outlines the data issues found in the healthcare dataset and the steps taken to clean and preprocess the data.

## 1. Ingest Data

In [None]:
import pandas as pd

# Ingest raw data
input_filepath = "../data/Data Insights - Synthetic Dataset.csv"
input_df = pd.read_csv(input_filepath)
print(input_df.head())  # Display the first few rows

## 2. Data Issues
Reviewing the data, a number of issues are evident. 

### 2.1 Age discrepancy
Issue: For some patient entries we find that the calculated age (Admission date - Date of Birth) is not equal to the value in the Age column.

Solution: Create a column for the calculated age, and flag the records where calculated_age != Age. Also identify where this Age value is being sourced from and understand why this discrepancy is present (old datasource? manual error as patients enter their Age and also their DoB?).

### 2.2 Pharmacy Charge
Issue: The PharmacyCharge column appears to be incorrect, as the values in it are far too large to represent sensible dollar amounts (ie. 1.66E+72). Comparing against the range of values typically found in the other charge related columns confirms that this column is anomolous. There are also some string values in this column (eg. "ERROR") that could cause issues down the line if we perform numerical operations on this column.

Solution: Trace back the source of the data in this column and try identify where this value is coming from. Understand which systems/stakeholders this data is being sourced from, and clarify if these values are indeed correct or if there is some sort of error as we suspect. 
From a technical perspective, we can change the values in this column to be of type numeric. However while this would allow us to perform numeric calculations using this column, we still should refrain from doing so until we clarify exactly what the values represent and apply any data fixes.

### Other issues
A number of other issues are also present in the data:
##Issue
Post codes don't map to any Australian suburbs (eg.'58698'). NOTE: This could be a function of the synthetic dataset.
Solution: If the same issue was found in the real dataset, we would trace back the source of this data and see why the post codes are incorrect (are patients filling them out incorrectly? Is there an issue in the ETL for this data?). Also try find other sources of truth for postcode (e.g. a 'home address' field)

##Issue
Blanks in charge related columns. It is unclear if these should be assumed to be 0 or if they are due to missing data
Solution: As part of the ETL for this data, we should clearly mark which entries are truly '0' values and which are missing values

##Issue
Readmission28days column has 'Yes' and 'No' values, however vast majority of entries are blank. This suggests the data is incomplete.

# 3. Script to cleanse file

To address some of the issues above, I have written a small script that computes the calculated age of each patient and compares it to the Age value, flagging any discrepancies.
The script also converts the PharmacyCharge column into type numeric, addressing the issue of string vals in the column. The magnitude of the values in this column are still suspicious and require further investigation.

In [None]:
import pandas as pd
import numpy as np

def comprehensive_clean(df):
    """
    Cleansing of healthcare dataset addressing multiple quality issues
    """
    # Create copy to avoid modifying original
    df_clean = df.copy()
    
    # 1. Fix Age calculation
    df_clean['DateOfBirth'] = pd.to_datetime(df_clean['DateOfBirth'], format='%d/%m/%Y')
    df_clean['AdmissionDate'] = pd.to_datetime(df_clean['AdmissionDate'], format='%d/%m/%Y')
    df_clean['calculated_age'] = (df_clean['AdmissionDate'].dt.year - df_clean['DateOfBirth'].dt.year)
    
    # Flag age discrepancies
    df_clean['age_discrepancy'] = abs(df_clean['calculated_age'] - df_clean['Age']) > 1

    # 2. Address PharmacyCharge data issue
    # Convert charge columns to numeric types (float)
    df_clean['PharmacyCharge'] = pd.to_numeric(df_clean['PharmacyCharge'], errors='coerce')                      
    # Replace empty strings with NaN
    df_clean['PharmacyCharge'] = df_clean['PharmacyCharge'].replace('', np.nan)

    df_clean.to_csv('../data/cleansed_data.csv', index=False)

    return(df_clean)

# Step 2. Run cleaning function over data
cleansed_df = comprehensive_clean(input_df)
print(cleansed_df.head()) 


# 4. Summary


After applying the data cleansing steps outlined above, we can calculate the prevalence of each data issue.

## Age discrepancy
16.4% of total entries have been flagged as having age discrepancies. This suggests it is a commonly occurring issue and should be investigated. In the short-term, we may choose to omit these entries from any further modelling/training until we fix it or gain confidence in the accuracy of the data.

## PharmacyCharge var type
44% of total entries had PharmacyCharge type string. These have all been converted to numeric type now. It would be wise to look upstream to the ETL process and see if we can define the var type at point of ingestion to catch any issues earlier in the data pipeline.
The issue of the magnitude of these values is still ongoing. Further investigation is needed with the potential for engaging other areas of the data ingestion team in order to understand where this data is coming from and why the values are so large.

## Further QA works
Having identified these issues and any others (ie. blank charge values), it would be valuable to build a 'data quality score' and compute this value for each entry in our data table. This would give us clarity on the quality of our data, allow us to monitor it over time, identify any particular time periods/changes that result in worsened quality, and filter the data to only use high quality entries where appropriate (ie. for training models).

In [None]:
# Calculate the proportion of 'age_discrepancy' entries that are True
age_discrepancy_proportion = cleansed_df['age_discrepancy'].mean()

raw_pharmacy_charge_str_proportion = (input_df['PharmacyCharge'].apply(lambda x: isinstance(x, str))).mean()
cleansed_pharmacy_charge_str_proportion = (cleansed_df['PharmacyCharge'].apply(lambda x: isinstance(x, str))).mean()

print(f"Proportion of 'age_discrepancy' that are True: {age_discrepancy_proportion:.2%}")
print(f"Proportion of 'PharmacyCharge' entries that are of type str: {raw_pharmacy_charge_str_proportion:.2%}")
print(f"Proportion of 'PharmacyCharge' entries that are of type str: {cleansed_pharmacy_charge_str_proportion:.2%}")