An outline of the ground basis for balancing key fields based on real-world data:

1. **Sex Ratio**:
   - Typically, the sex ratio in many Western countries is close to 50-50. According to the World Bank, the global average is approximately 50.4% men and 49.6% women.
   src: https://data.worldbank.org/indicator/SP.POP.TOTL
   - **Recommended Ratio**: 50% men and 50% women.

2. **Hypertension**:
   - Hypertension prevalence varies by age and sex but is generally significant in adults. According to the American Heart Association, about 45% of adults in the U.S. have hypertension.
   src: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803011/#:~:text=3.2.-,Burden%20of%20hypertension,104%E2%80%93128
   - **Recommended Ratio**: 45% with hypertension and 55% without.

3. **Cholesterol Levels**:
   - High cholesterol is a common issue, with the CDC reporting that about 38% of American adults have high cholesterol.
   src: https://www.cdc.gov/cholesterol/data-research/facts-stats/index.html
   - **Recommended Ratio**: 38% with high cholesterol and 62% with normal cholesterol levels.

4. **Glucose Levels**:
   - Diabetes and prediabetes are prevalent, with about 10.5% of the U.S. population having diabetes and 34.5% having prediabetes.
    src: https://www.cdc.gov/diabetes/php/data-research/index.html#:~:text=of%20diagnosed%20diabetes-,Among%20the%20U.S.%20population%20overall%2C%20crude%20estimates%20for%202021%20were,304%2C000%20with%20type%201%20diabetes.
   - **Recommended Ratio**: Roughly 10-15% with diabetes, 30-35% with prediabetes, and 50-60% with normal glucose levels.

5. **Smoking**:
   - Smoking rates have declined but remain a risk factor for cardiovascular disease. About 14% of adults in the U.S. smoke.
    src: https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm
   - **Recommended Ratio**: 14% smokers and 86% non-smokers.

6. **Alcohol Consumption**:
   - Alcohol consumption varies widely. Around 25% of adults engage in heavy drinking or binge drinking according to various health reports.
    src: https://www.pewresearch.org/short-reads/2024/01/03/10-facts-about-americans-and-alcohol-as-dry-january-begins/#:~:text=In%202021%2C%20the%20most%20recent,contains%200.6%20ounces%20of%20alcohol.
   - **Recommended Ratio**: 25% heavy drinkers and 75% moderate or non-drinkers.

7. **Physical Activity**:
   - Physical activity is critical for cardiovascular health. Approximately 54% of adults meet the CDC’s Physical Activity Guidelines.
    src: https://www.cdc.gov/nchs/fastats/exercise.htm
   - **Recommended Ratio**: 54% active and 46% inactive.

8. **Cardiovascular Disease (CVD) Flag**:
   - The prevalence of cardiovascular diseases varies by age and sex but is a leading cause of death. In the U.S., about 48% of adults have some form of CVD.
    src: https://www.ajpmonline.org/article/S0749-3797(23)00465-8/fulltext#:~:text=During%202010%E2%80%932022%2C%2010%2C951%2C403%20CVD,rate%20(456.7%20per%20100%2C000
   - **Recommended Ratio**: 48% with CVD and 52% without CVD.

Given these ratios, our aim is to create a balanced dataset that reflects these real-world distributions. This will help ensure that the model is trained on a representative sample and can generalize well to new data.:

| Field             | Ratio                 |
|-------------------|-----------------------|
| Sex               | 50% men, 50% women    |
| Hypertension      | 45% yes, 55% no       |
| Cholesterol       | 38% high, 62% normal  |
| Glucose           | 10-15% diabetes, 30-35% prediabetes, 50-60% normal |
| Smoking           | 14% smokers, 86% non-smokers |
| Alcohol Consumption | 25% heavy drinkers, 75% moderate/non-drinkers |
| Physical Activity | 54% active, 46% inactive |
| Cardiovascular Disease (CVD) | 48% yes, 52% no |


In [1]:
from scripts.generic_methods import load_dataset
from icecream import ic
DATASET = load_dataset("../data/filtered/analysis.csv", ",")

In [2]:
DATASET.columns


Index(['Unnamed: 0', 'cardio', 'is_overweight', 'bmi_is_valid', 'is_healthy',
       'ap_lo', 'years', 'bmi', 'height_is_valid', 'alco', 'active', 'sex',
       'id', 'bmi_status', 'broader_age_group', 'ap_lo_status', 'age', 'smoke',
       'is_underweight', 'height', 'in_hypertension', 'ap_hi_status', 'gluc',
       'is_valid', 'age_is_valid', 'weight_is_valid', 'weight', 'age_group',
       'ap_hi', 'gender', 'cholesterol'],
      dtype='object')

In [3]:
from pandas import DataFrame

TARGET_FIELDS = ["gender", "in_hypertension", "cholesterol", "gluc", "smoke", "alco", "active", "cardio"]

def get_ratios(dataset: DataFrame) -> dict[str,dict[str,float]]:
    ratios = {}
    for field in TARGET_FIELDS:
        ratios[field] = {}
        for value in dataset[field].unique():
            ratios[field][value] = dataset[field].value_counts(normalize=True)[value]
    return ratios

get_ratios(DATASET)



{'gender': {'FEMALE': 0.3438776721083592, 'MALE': 0.6561223278916408},
 'in_hypertension': {False: 0.6644967705843462, True: 0.33550322941565375},
 'cholesterol': {'NORMAL': 0.7723134821746427,
  'WELL_ABOVE_NORMAL': 0.10566376782111919,
  'ABOVE_NORMAL': 0.12202275000423808},
 'gluc': {'NORMAL': 0.8600756073166184,
  'ABOVE_NORMAL': 0.06608011663191442,
  'WELL_ABOVE_NORMAL': 0.07384427605146722},
 'smoke': {False: 0.9140687246774822, True: 0.08593127532251776},
 'alco': {False: 0.9491430605706148, True: 0.05085693942938514},
 'active': {True: 0.8022004102459781, False: 0.19779958975402193},
 'cardio': {False: 0.561206326603265, True: 0.438793673396735}}

In [7]:
ACCURACY_TOLERENCE = 0.05
RECOMENDED_RATIOS = {
    'gender': {'FEMALE': 0.5, 'MALE': 0.5},
    'in_hypertension': {False: 0.55, True: 0.45},
    'cholesterol': {'NORMAL': 0.62, 'ABOVE_NORMAL': 0.12, 'WELL_ABOVE_NORMAL': 0.38},
    'gluc': {'NORMAL': 0.55, 'ABOVE_NORMAL': 0.1, 'WELL_ABOVE_NORMAL': 0.35},
    'smoke': {False: 0.86, True: 0.14},
    'alco': {False: 0.75, True: 0.25},
    'active': {False: 0.46, True: 0.54},
    'cardio': {False: 0.52, True: 0.48}
}

In [11]:
import pandas as pd

def balance_dataset(dataset: pd.DataFrame, ratios: dict[str, dict[str, float]], tolerance: float) -> pd.DataFrame:
    balanced_dataset = pd.DataFrame()
    
    for field, field_ratios in ratios.items():
        for value, ratio in field_ratios.items():
            count = int(len(dataset) * ratio)
            if count > len(dataset[dataset[field] == value]):
                count = len(dataset[dataset[field] == value])
            elif count < len(dataset[dataset[field] == value]) * (1 - tolerance):
                count = len(dataset[dataset[field] == value]) * (1 - tolerance)
            subset = dataset[dataset[field] == value].sample(n=int(count), random_state=42)
            balanced_dataset = pd.concat([balanced_dataset, subset])
    
    return balanced_dataset.sample(frac=1, random_state=42, replace=True).reset_index(drop=True)

balanced_dataset = balance_dataset(DATASET, RECOMENDED_RATIOS, ACCURACY_TOLERENCE)

get_ratios(balanced_dataset)


{'gender': {'FEMALE': 0.34619302322504975, 'MALE': 0.6538069767749503},
 'in_hypertension': {True: 0.33757898589388, False: 0.66242101410612},
 'cholesterol': {'NORMAL': 0.770591451664822,
  'WELL_ABOVE_NORMAL': 0.1064859930946596,
  'ABOVE_NORMAL': 0.12292255524051839},
 'gluc': {'NORMAL': 0.8587095038826004,
  'WELL_ABOVE_NORMAL': 0.07393256048787156,
  'ABOVE_NORMAL': 0.06735793562952805},
 'smoke': {False: 0.9120869914146247, True: 0.08791300858537532},
 'alco': {False: 0.948390738201172, True: 0.051609261798827945},
 'active': {True: 0.8010856288666158, False: 0.1989143711333842},
 'cardio': {True: 0.44239596791618346, False: 0.5576040320838166}}