### Healthcare – Patient Data Accuracy

**Task 1**: Patient Record Accuracy Assessment

**Objective**: Achieve high accuracy in patient records.

**Steps**:
1. Examine a sample patient dataset for common inaccuracies.
2. Identify at least three common issues, such as medication errors or misdiagnoses.
3. Propose validation measures to ensure data accuracy at the point of entry.

In [1]:
import pandas as pd
import re

# Step 1: Sample patient dataset
data = {
    "PatientID": [101, 102, 103, 104],
    "Name": ["Alice Smith", "Bob Jones", "Charlie Brown", "Diana Prince"],
    "DateOfBirth": ["1985-04-12", "1970-11-23", "1995-06-15", "1982-09-10"],
    "Medication": ["Aspirin", "Ibuprofen", "Paracetamol", "Aspirin"],
    "Dosage_mg": [100, 200, None, 100],  # Missing dosage for patient 103
    "DiagnosisCode": ["I10", "E11", "E11", "Z99"],  # Z99 may be invalid
    "ContactNumber": ["1234567890", "12345AB789", "9876543210", "1234567890"]  # invalid phone number format for patient 102
}

patients_df = pd.DataFrame(data)

# Step 2: Define validation functions

def validate_dosage(row):
    # Dosage must not be null or <= 0
    return pd.notna(row['Dosage_mg']) and row['Dosage_mg'] > 0

def validate_phone(phone):
    # Phone must be exactly 10 digits (numeric only)
    pattern = re.compile(r'^\d{10}$')
    return bool(pattern.match(phone))

def validate_diagnosis_code(code):
    # Whitelist of valid diagnosis codes (example ICD-10 codes)
    valid_codes = {"I10", "E11", "J45", "K21"}
    return code in valid_codes

# Step 3: Apply validations

patients_df['DosageValid'] = patients_df.apply(validate_dosage, axis=1)
patients_df['PhoneValid'] = patients_df['ContactNumber'].apply(validate_phone)
patients_df['DiagnosisValid'] = patients_df['DiagnosisCode'].apply(validate_diagnosis_code)

# Step 4: Print invalid records for review

print("Invalid Dosage entries:")
print(patients_df.loc[~patients_df['DosageValid'], ['PatientID', 'Dosage_mg']])

print("\nInvalid Phone entries:")
print(patients_df.loc[~patients_df['PhoneValid'], ['PatientID', 'ContactNumber']])

print("\nInvalid Diagnosis Code entries:")
print(patients_df.loc[~patients_df['DiagnosisValid'], ['PatientID', 'DiagnosisCode']])


Invalid Dosage entries:
   PatientID  Dosage_mg
2        103        NaN

Invalid Phone entries:
   PatientID ContactNumber
1        102    12345AB789

Invalid Diagnosis Code entries:
   PatientID DiagnosisCode
3        104           Z99


**Task 2**: Implement Healthcare Data Quality Checks

**Objective**: Maintain accurate health records within a healthcare system.

**Steps**:
1. Develop a validation workflow for patient data.
2. Use appropriate software to automate checks for common errors.

In [2]:
import pandas as pd
import re

# Sample patient dataset
data = {
    "PatientID": [101, 102, 103, 104],
    "Name": ["Alice Smith", "Bob Jones", "Charlie Brown", "Diana Prince"],
    "DateOfBirth": ["1985-04-12", "1970-11-23", "1995-06-15", "1982-09-10"],
    "Medication": ["Aspirin", "Ibuprofen", "Paracetamol", "Aspirin"],
    "Dosage_mg": [100, 200, None, 100],  # Missing dosage for patient 103
    "DiagnosisCode": ["I10", "E11", "E11", "Z99"],  # Z99 may be invalid
    "ContactNumber": ["1234567890", "12345AB789", "9876543210", "1234567890"]
}

patients_df = pd.DataFrame(data)

# Validation functions
def validate_dosage(row):
    return pd.notna(row['Dosage_mg']) and row['Dosage_mg'] > 0

def validate_phone(phone):
    pattern = re.compile(r'^\d{10}$')
    return bool(pattern.match(phone))

def validate_diagnosis_code(code):
    valid_codes = {"I10", "E11", "J45", "K21"}
    return code in valid_codes

def validate_name(name):
    # Simple check: name should be alphabetic characters and spaces only
    return bool(re.match(r'^[A-Za-z ]+$', name))

def validate_dob(dob):
    # Check format YYYY-MM-DD and realistic year (e.g., >1900)
    try:
        date = pd.to_datetime(dob)
        return date.year > 1900 and date <= pd.Timestamp.today()
    except:
        return False

# Workflow function to run all checks
def run_healthcare_data_quality_checks(df):
    df['DosageValid'] = df.apply(validate_dosage, axis=1)
    df['PhoneValid'] = df['ContactNumber'].apply(validate_phone)
    df['DiagnosisValid'] = df['DiagnosisCode'].apply(validate_diagnosis_code)
    df['NameValid'] = df['Name'].apply(validate_name)
    df['DOBValid'] = df['DateOfBirth'].apply(validate_dob)
    
    # Collect all invalid records for each check
    invalid_records = {
        'Dosage': df.loc[~df['DosageValid'], ['PatientID', 'Dosage_mg']],
        'Phone': df.loc[~df['PhoneValid'], ['PatientID', 'ContactNumber']],
        'Diagnosis': df.loc[~df['DiagnosisValid'], ['PatientID', 'DiagnosisCode']],
        'Name': df.loc[~df['NameValid'], ['PatientID', 'Name']],
        'DOB': df.loc[~df['DOBValid'], ['PatientID', 'DateOfBirth']]
    }
    
    return invalid_records

# Run the validation workflow
invalids = run_healthcare_data_quality_checks(patients_df)

# Report invalid data for review and correction
for check, records in invalids.items():
    print(f"\nInvalid records for {check}:")
    if not records.empty:
        print(records)
    else:
        print("No issues found.")




Invalid records for Dosage:
   PatientID  Dosage_mg
2        103        NaN

Invalid records for Phone:
   PatientID ContactNumber
1        102    12345AB789

Invalid records for Diagnosis:
   PatientID DiagnosisCode
3        104           Z99

Invalid records for Name:
No issues found.

Invalid records for DOB:
No issues found.
