# Generate Healthcare Data for Analysis

Create a mock dataset for 12 ambulatory practices including:
- Visit volume
- Average wait time
- Patient satisfaction score
- Appointment no-show rate
- Follow-up adherence rate
- Staff-to-patient ratio
- Provider productivity (visits per FTE)
- Quality measure compliance (e.g., A1C screening for diabetic patients)

Rows will be based on monthly data from each practice

Refer the to [Data Dictionary](Practice/sr_data_analyst_project1_data_dictionary.md) For the Data Variables Used in this Dataset.

In [112]:
# Import Packages
import pandas as pd
import random
import numpy as np
import math

## Define functions to generate random values within specific ranges

### Visit Volume

In [113]:
def visit_volume():
    return random.randint(500,5000)

### Average Patient Wait Time Before Being Seen (in minutes)

In [114]:
def avg_wait_time_min():
    return round(random.uniform(0,60), 1)

### Average patient satisfaction score (0-100 scale)

In [115]:
def patient_satisfaction_score():
    return round(random.uniform(0,100), 1)

### Percentage of scheduled appointments that patients missed
**Solution: Truncated Normal Distribution**
Since standard normal distributions can go beyond [0, 100], we need to:
1. Center the distribution at 25.
2. Constrain results between 0 and 100 (truncation).
3. Adjust the standard deviation (std_dev) to control the spread.

In [116]:
def no_show_rate(mean=75, std_dev=15, min_val=0, max_val=100):
    while True:
        sample = np.random.normal(mean, std_dev)
        if min_val <= sample <= max_val:
            return int(round(sample))

## Percentage of patients who completed recommended follow-ups
**Solution: Truncated Normal Distribution**
Since standard normal distributions can go beyond [0, 100], we need to:
1. Center the distribution at 75.
2. Constrain results between 0 and 100 (truncation).
3. Adjust the standard deviation (std_dev) to control the spread.

In [117]:
def followup_adherence_rate(mean=75, std_dev=15, min_val=0, max_val=100):
    while True:
        sample = np.random.normal(mean, std_dev)
        if min_val <= sample <= max_val:
            return int(round(sample))

### Percentage of diabetic patients with documented A1C screening

In [118]:
def A1C_screening_compliance():
    return round(random.uniform(0, 100), 1)

### Total number of full-time equivalent providers

In [119]:
def total_providers_FTE():
    return round(random.uniform(1, 6), 1)

### Total number of full time equivalent support staff

In [120]:
def total_staff_FTE():
    return round(random.uniform(10, 50), 1)

### Total number of unique patients seen during the quarter

In [121]:
def unique_patients_seen():
    return random.randint(300, 2000)

### Ratio of support staff to unique patients seen
Calculated based on the `total_staff_FTE` and `unique_patients_seen`.

In [122]:
def staff_to_patient_ratio():
    try:
        return round(total_staff_FTE() / unique_patients_seen(), 3)
    except ZeroDivisionError:
        return 0.0

### Average number of visits per full-time equivalent (FTE) provide
Calculated based on `visit_volume` and `total_providers_FTE`.

In [123]:
def provider_productivity():
    try:
        return round(visit_volume() / total_providers_FTE(), 1)
    except ZeroDivisionError:
        return

## Create a list of possible values for categorical 

In [124]:
practice_id = ['AP001', 'AP002', 'AP003', 'AP004', 'AP005', 'AP006', 'AP007', 'AP008', 'AP009', 'AP010', 'AP011', 'AP012']
quarter = ['Q1', 'Q2', 'Q3', 'Q4']


## Generate Data for Each Column

In [125]:
data = {
    'Practice_ID': [random.choice(practice_id) for _ in range(5000)],
    'Quarter': [random.choice(quarter) for _ in range(5000)],
    'Visit_Volume': [visit_volume() for _ in range(5000)],
    'Avg_Wait_Time_Min': [avg_wait_time_min() for _ in range(5000)],
    'Patient_Satisfaction_Score': [patient_satisfaction_score() for _ in range(5000)],
    'No_Show_Rate': [no_show_rate() for _ in range(5000)],
    'Followup_Adherence_Rate': [followup_adherence_rate() for _ in range(5000)],
    'Staff_to_Patient_Ratio': [staff_to_patient_ratio() for _ in range(5000)],
    'Provider_Productivity': [provider_productivity() for _ in range(5000)],
    'A1C_Screening_Compliance': [A1C_screening_compliance() for _ in range(5000)],
    'Total_Providers_FTE': [total_providers_FTE() for _ in range(5000)],
    'Total_Staff_FTE': [total_staff_FTE() for _ in range(5000)],
    'Unique_Patients_Seen': [unique_patients_seen() for _ in range(5000)]
}

## Create a DataFrame

In [126]:
df = pd.DataFrame(data)

## Save the DataFrame as a csv file

In [127]:
df.to_csv('D:/GitHub/important-reference-repo/Data/fake_healthcare_data_v2.csv', index=False)