# Machine Learning Regression for early Hypertension risk detection.

By predicting blood pressure from lifestyle and health data we can identify individuals at risk of hypertension earlier. This enables proactive prevention strategies such as tailored nutrition, exercise and stress management, which reduce the long-term burden of cardiovascular disease. In line with **SDG 3** *(Good Health and Well-being*) this approach promotes healthier lives, lowers healthcare costs and supports more sustainable health systems.


## 1.0 Loading, cleaning and feature engineering

### 1.1 Loading the dataset
First thing we are going to do is load the dataset into a SQL database and print the first 5 rows so we get a bit more insight in the data. 

In [50]:
import pandas as pd
import sqlite3

# Load CSV into pandas DataFrame
csv_path = "Data/health_lifestyle_classification.csv"
df = pd.read_csv(csv_path)

# Create a SQLite database connection
conn = sqlite3.connect("health_data.db")

# Write DataFrame into a SQL table
df.to_sql("health", conn, if_exists="replace", index=False)

# Example query: select first 5 rows
query = "SELECT * FROM health LIMIT 5;"
result = pd.read_sql(query, conn)
print(result)


   survey_code  age  gender      height     weight        bmi  bmi_estimated  \
0            1   56    Male  173.416872  56.886640  18.915925      18.915925   
1            2   69  Female  163.207380  97.799859  36.716278      36.716278   
2            3   46    Male  177.281966  80.687562  25.673050      25.673050   
3            4   32  Female  172.101255  63.142868  21.318480      21.318480   
4            5   60  Female  163.608816  40.000000  14.943302      14.943302   

   bmi_scaled  bmi_corrected  waist_size  ...  sunlight_exposure  \
0   56.747776      18.989117   72.165130  ...               High   
1  110.148833      36.511417   85.598889  ...               High   
2   77.019151      25.587429   90.295030  ...               High   
3   63.955440      21.177109  100.504211  ...               High   
4   44.829907      14.844299   69.021150  ...               High   

   meals_per_day  caffeine_intake  family_history  pet_owner  \
0              5         Moderate             

### 1.2 Show data types
Then we are going to check which datatypes we are dealing with to see what we need to drop, change or feature engineer.

In [51]:
# Show data types
print(df.dtypes)

survey_code                   int64
age                           int64
gender                       object
height                      float64
weight                      float64
bmi                         float64
bmi_estimated               float64
bmi_scaled                  float64
bmi_corrected               float64
waist_size                  float64
blood_pressure              float64
heart_rate                  float64
cholesterol                 float64
glucose                     float64
insulin                     float64
sleep_hours                 float64
sleep_quality                object
work_hours                  float64
physical_activity           float64
daily_steps                 float64
calorie_intake              float64
sugar_intake                float64
alcohol_consumption          object
smoking_level                object
water_intake                float64
screen_time                 float64
stress_level                  int64
mental_health_score         

### 1.3 Drop columns we are not going to use 
We checked if any of the colums could give a data leakage. and which colums we don't need.
Dropped all the colums with medical information because we aim to predict blood pressure without relying on clinical laboratory values that are typically measured at the same time.
In the following index I descriped which columns I kept, which columns I dropped and which ones we are going to feature engineer in the following step: https://www.notion.so/Dataset-index-25598c6768cd8070a09ee67bdb12f3ca

In [52]:
# Define columns to drop
drop_columns = [
    "survey_code",
    "target",
    "bmi_estimated",
    "bmi_scaled",
    "bmi_corrected",
    "weight",
    "height",
    "cholesterol",
    "glucose",
    "insulin",
    "occupation",
    "electrolyte_level",
    "gene_marker_flag"
]

# Drop the columns
df_cleaned = df.drop(columns=drop_columns, errors="ignore")

print("Remaining columns:", df_cleaned.shape[1])
print(df_cleaned.head())

Remaining columns: 35
   age  gender        bmi  waist_size  blood_pressure  heart_rate  \
0   56    Male  18.915925   72.165130      118.264254   60.749825   
1   69  Female  36.716278   85.598889      117.917986   66.463696   
2   46    Male  25.673050   90.295030      123.073698   76.043212   
3   32  Female  21.318480  100.504211      148.173453   68.781981   
4   60  Female  14.943302   69.021150      150.613181   92.335358   

   sleep_hours sleep_quality  work_hours  physical_activity  ...  \
0     6.475885          Fair    7.671313           0.356918  ...   
1     8.428410          Good    9.515198           0.568219  ...   
2     5.702164          Poor    5.829853           3.764406  ...   
3     5.188316          Good    9.489693           0.889474  ...   
4     7.912514          Good    7.275450           2.901608  ...   

   device_usage  healthcare_access  insurance sunlight_exposure meals_per_day  \
0          High               Poor         No              High          

### 1.4 Feature engineering

In [53]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Binary encoding (Yes/No → 0/1)

binary_cols = ["mental_health_support", "insurance", "family_history"]
for col in binary_cols:
    if col in df.columns:
        df[col] = df[col].map({"No": 0, "Yes": 1}).astype("Int64")

# Categorical encoding (unordered categories)
categorical_cols = ["education_level", "job_type", "diet_type", "exercise_type"]
for col in categorical_cols:
    if col in df.columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))


# Ordinal encoding (ordered categories)
ordinal_mappings = {
    "device_usage": ["Low", "Moderate", "High"],
    "healthcare_access": ["Poor", "Moderate", "Good"],
    "sunlight_exposure": ["Low", "Moderate", "High"],
    "caffeine_intake": ["None", "Low", "Moderate", "High"]
}

for col, order in ordinal_mappings.items():
    if col in df.columns:
        # Clean: strip spaces, unify capitalization
        s = df[col].astype("string").str.strip().str.title()

        # Impute missing values with the most frequent category
        if s.isna().any():
            mode_val = s.mode(dropna=True)
            fill_val = mode_val.iloc[0] if not mode_val.empty else order[0]
            s = s.fillna(fill_val)

        # Encode with fixed order; unseen categories → -1 (safe fallback)
        enc = OrdinalEncoder(
            categories=[order],
            handle_unknown="use_encoded_value",
            unknown_value=-1
        )
        df[col] = enc.fit_transform(s.to_frame()).astype("int64")

# Preview transformed columns
cols_to_show = binary_cols + categorical_cols + list(ordinal_mappings.keys())
print(df[cols_to_show].head())


   mental_health_support  insurance  family_history  education_level  \
0                      0          0               0                3   
1                      0          0               1                1   
2                      0          1               0                2   
3                      0          0               0                2   
4                      1          1               1                2   

   job_type  diet_type  exercise_type  device_usage  healthcare_access  \
0         4          2              2             2                  0   
1         2          2              0             1                  1   
2         2          2              0             2                  2   
3         1          3              1             0                  1   
4         5          2              3             0                  1   

   sunlight_exposure  caffeine_intake  
0                  2                2  
1                  2                3  
2 

#### Feature Encoding Reference

| Feature               | Type            | Mapping / Notes                                                                 |
|------------------------|-----------------|---------------------------------------------------------------------------------|
| mental_health_support | Binary          | 0 = No, 1 = Yes                                                                 |
| insurance             | Binary          | 0 = No, 1 = Yes                                                                 |
| family_history        | Binary          | 0 = No, 1 = Yes                                                                 |
| device_usage          | Ordinal         | 0 = Low, 1 = Moderate, 2 = High                                                 |
| healthcare_access     | Ordinal         | 0 = Poor, 1 = Moderate, 2 = Good                                                |
| sunlight_exposure     | Ordinal         | 0 = Low, 1 = Moderate, 2 = High                                                 |
| caffeine_intake       | Ordinal         | 0 = None, 1 = Low, 2 = Moderate, 3 = High                                       |
| education_level       | Categorical     | Label Encoded → depends on encoder.classes_ (e.g., High School, Bachelor, etc.) |
| job_type              | Categorical     | Label Encoded → depends on encoder.classes_ (e.g., Office, Tech, Labor, etc.)   |
| diet_type             | Categorical     | Label Encoded → depends on encoder.classes_ (e.g., Vegan, Vegetarian, Mixed)    |
| exercise_type         | Categorical     | Label Encoded → depends on encoder.classes_ (e.g., Cardio, Strength, Mixed)     |


### 1.4 Check for missing values

In [54]:
# Check missing values
missing = df.isnull().sum()

# Show only columns with missing values
missing = missing[missing > 0].sort_values(ascending=False)

print("Missing values per column:")
print(missing)

# Optional: show percentage of missing values
missing_pct = (df.isnull().mean() * 100).sort_values(ascending=False)
print("\nPercentage missing per column:")
print(missing_pct[missing_pct > 0])

Missing values per column:
alcohol_consumption    42387
insulin                15836
heart_rate             14003
gene_marker_flag       10474
income                  8470
daily_steps             8329
blood_pressure          7669
dtype: int64

Percentage missing per column:
alcohol_consumption    42.387
insulin                15.836
heart_rate             14.003
gene_marker_flag       10.474
income                  8.470
daily_steps             8.329
blood_pressure          7.669
dtype: float64


## 2.0 Visualizations to get a better understanding of the data
In this step we're going to make several visualizations to get a better understanding of the data and dubbel check if we missed something in step 1 