<a href="https://colab.research.google.com/github/andresfuentesf/nube_analisis_palabras/blob/main/prueba_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import numpy as np
import pandas as pd

In [9]:
synthetic_data.columns

Index(['age', 'gender', 'systolic_bp', 'diastolic_bp', 'total_cholesterol',
       'ldl_cholesterol', 'hdl_cholesterol', 'bmi', 'smoking_status',
       'diabetes', 'family_history', 'cardiovascular_risk'],
      dtype='object')

# Task
Generate a Python function that produces a synthetic dataset of potential patients with cardiovascular risk, including relevant health parameters and a calculated risk score.

## Define parameters

### Subtask:
Determine the relevant parameters and their ranges for generating the dataset (e.g., age, blood pressure, cholesterol levels, lifestyle factors).


**Reasoning**:
Define the relevant parameters and their ranges for generating the synthetic dataset based on common cardiovascular risk factors.



In [2]:
# Relevant health parameters and their ranges/values for cardiovascular risk assessment

# Continuous variables with typical ranges
age_range = (20, 80)  # Years
systolic_bp_range = (90, 180)  # mmHg
diastolic_bp_range = (60, 120)  # mmHg
total_cholesterol_range = (120, 300)  # mg/dL
ldl_cholesterol_range = (50, 200)  # mg/dL
hdl_cholesterol_range = (20, 100)  # mg/dL
bmi_range = (18.5, 40)  # kg/m^2

# Categorical variables with possible values
gender_values = ['Male', 'Female']
smoking_status_values = ['Non-smoker', 'Former smoker', 'Current smoker']
diabetes_values = [True, False]  # True if has diabetes, False otherwise
family_history_values = [True, False]  # True if family history of CVD, False otherwise

# Note potential dependencies/correlations for more realistic data:
# - Blood pressure and cholesterol levels tend to increase with age.
# - BMI is often correlated with blood pressure and cholesterol levels.
# - Smoking is a significant risk factor and can correlate with higher blood pressure and lower HDL.
# - Diabetes is a major risk factor and often associated with higher blood pressure and cholesterol.
# - Gender can influence typical ranges and risk profiles.

## Generate synthetic data

### Subtask:
Create a function to generate synthetic data for each parameter based on realistic distributions or ranges.


**Reasoning**:
Define the function to generate synthetic patient data, including importing necessary libraries, initializing data storage, and generating data for each parameter based on defined ranges and distributions.



In [3]:
import numpy as np
import pandas as pd

def generate_patient_data(num_patients):
    """Generates synthetic data for potential patients with cardiovascular risk.

    Args:
        num_patients: The number of patients to generate data for.

    Returns:
        A pandas DataFrame containing the synthetic patient data.
    """
    data = {}

    # Generate data for each parameter
    data['age'] = np.random.randint(age_range[0], age_range[1] + 1, num_patients)
    data['gender'] = np.random.choice(gender_values, num_patients)

    # Generate blood pressure with some age dependency
    data['systolic_bp'] = np.random.uniform(systolic_bp_range[0], systolic_bp_range[1], num_patients) + data['age'] * 0.5
    data['diastolic_bp'] = np.random.uniform(diastolic_bp_range[0], diastolic_bp_range[1], num_patients) + data['age'] * 0.3

    # Generate cholesterol levels with some age and BMI dependency
    data['total_cholesterol'] = np.random.uniform(total_cholesterol_range[0], total_cholesterol_range[1], num_patients) + data['age'] * 0.4
    data['ldl_cholesterol'] = np.random.uniform(ldl_cholesterol_range[0], ldl_cholesterol_range[1], num_patients) + data['age'] * 0.3
    data['hdl_cholesterol'] = np.random.uniform(hdl_cholesterol_range[0], hdl_cholesterol_range[1], num_patients) - data['age'] * 0.1

    # Generate BMI
    data['bmi'] = np.random.uniform(bmi_range[0], bmi_range[1], num_patients)

    # Generate categorical data
    data['smoking_status'] = np.random.choice(smoking_status_values, num_patients)
    data['diabetes'] = np.random.choice(diabetes_values, num_patients)
    data['family_history'] = np.random.choice(family_history_values, num_patients)

    # Convert to DataFrame
    df = pd.DataFrame(data)

    # Ensure values are within defined ranges after adding dependencies
    for col, (lower, upper) in [('systolic_bp', systolic_bp_range), ('diastolic_bp', diastolic_bp_range),
                                 ('total_cholesterol', total_cholesterol_range), ('ldl_cholesterol', ldl_cholesterol_range),
                                 ('hdl_cholesterol', hdl_cholesterol_range), ('bmi', bmi_range)]:
        df[col] = np.clip(df[col], lower, upper)

    return df

# Example usage:
# synthetic_data = generate_patient_data(100)
# display(synthetic_data.head())

## Calculate cardiovascular risk

### Subtask:
Develop a method or formula to calculate the cardiovascular risk based on the generated parameters. This could be a simplified model or based on established risk assessment tools.


**Reasoning**:
Define a function to calculate the cardiovascular risk score based on the patient's data and apply it to the synthetic data.



In [4]:
def calculate_cardiovascular_risk(patient_data):
    """Calculates a simplified cardiovascular risk score for a patient.

    Args:
        patient_data: A pandas Series representing a single patient's data.

    Returns:
        A numerical risk score.
    """
    risk_score = 0

    # Age: Older age increases risk
    risk_score += patient_data['age'] * 0.5

    # Gender: Males generally have higher risk
    if patient_data['gender'] == 'Male':
        risk_score += 10

    # Blood Pressure: Higher blood pressure increases risk
    risk_score += (patient_data['systolic_bp'] / 10) * 2
    risk_score += (patient_data['diastolic_bp'] / 10) * 1.5

    # Cholesterol: High LDL and low HDL increase risk
    risk_score += (patient_data['ldl_cholesterol'] / 10) * 1.5
    risk_score -= (patient_data['hdl_cholesterol'] / 10) * 1

    # Smoking Status: Smoking significantly increases risk
    if patient_data['smoking_status'] == 'Current smoker':
        risk_score += 20
    elif patient_data['smoking_status'] == 'Former smoker':
        risk_score += 10

    # Diabetes: Diabetes significantly increases risk
    if patient_data['diabetes']:
        risk_score += 25

    # Family History: Family history of CVD increases risk
    if patient_data['family_history']:
        risk_score += 15

    return risk_score

# Generate synthetic data
synthetic_data = generate_patient_data(1000)

# Apply the risk calculation function to each row
synthetic_data['cardiovascular_risk'] = synthetic_data.apply(calculate_cardiovascular_risk, axis=1)

display(synthetic_data.head())

Unnamed: 0,age,gender,systolic_bp,diastolic_bp,total_cholesterol,ldl_cholesterol,hdl_cholesterol,bmi,smoking_status,diabetes,family_history,cardiovascular_risk
0,33,Female,152.356285,72.584467,154.370734,200.0,94.997354,36.599457,Current smoker,True,True,138.359192
1,67,Female,174.216131,97.015166,260.698344,200.0,73.526271,25.331393,Former smoker,True,False,140.542874
2,78,Female,180.0,108.608848,158.591208,111.221535,57.685058,32.059767,Non-smoker,False,False,102.206052
3,59,Male,147.315097,120.0,235.487856,200.0,67.642912,23.111884,Current smoker,False,False,130.198728
4,62,Female,157.164057,96.826292,186.929323,133.55013,25.457671,25.541638,Former smoker,False,True,119.443508


## Save dataset

### Subtask:
Save the generated dataset to a file (e.g., CSV, JSON) for further analysis or use.


**Reasoning**:
Save the synthetic_data DataFrame to a CSV file.



In [5]:
synthetic_data.to_csv('synthetic_cardiovascular_data.csv', index=False)

## Document function

### Subtask:
Add comments and docstrings to the function to explain its purpose, parameters, and return values.


**Reasoning**:
Add comprehensive docstrings and comments to the `generate_patient_data` and `calculate_cardiovascular_risk` functions as per the instructions.



In [6]:
def generate_patient_data(num_patients):
    """Generates synthetic data for potential patients with cardiovascular risk.

    This function creates a synthetic dataset of patients with various health
    parameters relevant to cardiovascular risk assessment. It incorporates
    some basic dependencies between variables to make the data more realistic.

    Args:
        num_patients: The number of patients to generate data for.

    Returns:
        A pandas DataFrame containing the synthetic patient data.
        The DataFrame includes columns for age, gender, systolic and diastolic
        blood pressure, total, LDL, and HDL cholesterol, BMI, smoking status,
        diabetes status, and family history of cardiovascular disease.
    """
    data = {}

    # Generate data for each parameter
    data['age'] = np.random.randint(age_range[0], age_range[1] + 1, num_patients)
    data['gender'] = np.random.choice(gender_values, num_patients)

    # Generate blood pressure with some age dependency (increases with age)
    data['systolic_bp'] = np.random.uniform(systolic_bp_range[0], systolic_bp_range[1], num_patients) + data['age'] * 0.5
    data['diastolic_bp'] = np.random.uniform(diastolic_bp_range[0], diastolic_bp_range[1], num_patients) + data['age'] * 0.3

    # Generate cholesterol levels with some age and BMI dependency
    # Total and LDL cholesterol increase with age, HDL decreases with age
    data['total_cholesterol'] = np.random.uniform(total_cholesterol_range[0], total_cholesterol_range[1], num_patients) + data['age'] * 0.4
    data['ldl_cholesterol'] = np.random.uniform(ldl_cholesterol_range[0], ldl_cholesterol_range[1], num_patients) + data['age'] * 0.3
    data['hdl_cholesterol'] = np.random.uniform(hdl_cholesterol_range[0], hdl_cholesterol_range[1], num_patients) - data['age'] * 0.1

    # Generate BMI within the defined range
    data['bmi'] = np.random.uniform(bmi_range[0], bmi_range[1], num_patients)

    # Generate categorical data based on defined values
    data['smoking_status'] = np.random.choice(smoking_status_values, num_patients)
    data['diabetes'] = np.random.choice(diabetes_values, num_patients)
    data['family_history'] = np.random.choice(family_history_values, num_patients)

    # Convert to DataFrame
    df = pd.DataFrame(data)

    # Ensure values are within defined ranges after adding dependencies by clipping
    for col, (lower, upper) in [('systolic_bp', systolic_bp_range), ('diastolic_bp', diastolic_bp_range),
                                 ('total_cholesterol', total_cholesterol_range), ('ldl_cholesterol', ldl_cholesterol_range),
                                 ('hdl_cholesterol', hdl_cholesterol_range), ('bmi', bmi_range)]:
        df[col] = np.clip(df[col], lower, upper)

    return df


def calculate_cardiovascular_risk(patient_data):
    """Calculates a simplified cardiovascular risk score for a patient.

    This function computes a numerical risk score based on various health
    parameters. The calculation is a simplified model and not based on a
    specific clinical risk assessment tool.

    Args:
        patient_data: A pandas Series representing a single patient's data.
                      Expected keys include 'age', 'gender', 'systolic_bp',
                      'diastolic_bp', 'ldl_cholesterol', 'hdl_cholesterol',
                      'smoking_status', 'diabetes', and 'family_history'.

    Returns:
        A numerical risk score representing the estimated cardiovascular risk.
        Higher values indicate higher risk.
    """
    risk_score = 0

    # Age: Older age increases risk
    risk_score += patient_data['age'] * 0.5

    # Gender: Males generally have higher risk
    if patient_data['gender'] == 'Male':
        risk_score += 10

    # Blood Pressure: Higher systolic and diastolic blood pressure increase risk
    risk_score += (patient_data['systolic_bp'] / 10) * 2
    risk_score += (patient_data['diastolic_bp'] / 10) * 1.5

    # Cholesterol: High LDL increases risk, low HDL increases risk
    risk_score += (patient_data['ldl_cholesterol'] / 10) * 1.5
    risk_score -= (patient_data['hdl_cholesterol'] / 10) * 1

    # Smoking Status: Smoking significantly increases risk (current more than former)
    if patient_data['smoking_status'] == 'Current smoker':
        risk_score += 20
    elif patient_data['smoking_status'] == 'Former smoker':
        risk_score += 10

    # Diabetes: Presence of diabetes significantly increases risk
    if patient_data['diabetes']:
        risk_score += 25

    # Family History: Family history of CVD increases risk
    if patient_data['family_history']:
        risk_score += 15

    return risk_score

# Re-generate synthetic data and calculate risk with the updated functions
synthetic_data = generate_patient_data(1000)
synthetic_data['cardiovascular_risk'] = synthetic_data.apply(calculate_cardiovascular_risk, axis=1)

display(synthetic_data.head())

Unnamed: 0,age,gender,systolic_bp,diastolic_bp,total_cholesterol,ldl_cholesterol,hdl_cholesterol,bmi,smoking_status,diabetes,family_history,cardiovascular_risk
0,20,Female,158.192503,92.612082,300.0,114.46851,54.968154,38.446969,Non-smoker,True,False,92.203774
1,47,Male,118.024744,76.810609,191.026346,125.208095,23.524737,31.461666,Non-smoker,False,False,85.055281
2,76,Male,180.0,87.391358,300.0,111.618654,60.217121,25.08003,Non-smoker,False,True,122.82979
3,33,Male,180.0,112.071865,250.106015,117.182211,85.754789,32.412321,Non-smoker,False,True,103.312632
4,41,Female,172.789667,112.759793,203.735101,185.090988,51.471562,28.283062,Current smoker,True,False,139.588394


## Summary:

### Data Analysis Key Findings

*   The process successfully defined a Python function `generate_patient_data` that creates synthetic patient data with relevant cardiovascular risk parameters, including age, gender, blood pressure, cholesterol levels, BMI, smoking status, diabetes, and family history.
*   The data generation incorporates basic dependencies between variables (e.g., age influencing blood pressure and cholesterol) for greater realism and includes clipping to ensure values stay within defined ranges.
*   A simplified `calculate_cardiovascular_risk` function was developed and applied to the synthetic data, assigning a numerical risk score to each patient based on their health parameters.
*   The generated synthetic dataset, including the calculated cardiovascular risk scores, was successfully saved to a CSV file named `synthetic_cardiovascular_data.csv`.
*   Both the `generate_patient_data` and `calculate_cardiovascular_risk` functions were documented with clear docstrings and comments explaining their purpose, arguments, and return values.

### Insights or Next Steps

*   The generated synthetic dataset can be used for training and testing simple cardiovascular risk prediction models.
*   The risk calculation formula used is simplified; future work could involve implementing more clinically validated risk assessment tools (e.g., ASCVD Risk Estimator) for a more accurate representation of risk.
