# Logistic Regression — Heart Disease (Framingham)

Building a transparent, reproducible baseline for 10-year CHD risk classification.

**Dataset:** `framingham.csv` (demographic, behavioral, medical)  
**Target:** `TenYearCHD` (0 = no CHD event in 10 years, 1 = CHD event)


In [1]:
# ─────────────────────────────────────────────────────────────────────────────
# Environment and reproducibility setup
# Imports, seeds, and device configuration (PyTorch used later for modeling).
# ─────────────────────────────────────────────────────────────────────────────
from __future__ import annotations

import os
import random
from pathlib import Path

import numpy as np
import pandas as pd

try:
    import torch
except ImportError as e:
    raise ImportError(
        "PyTorch is required. Install via `pip install torch --index-url https://download.pytorch.org/whl/cu121` "
        "or any version matching the local CUDA/CPU setup."
    ) from e


def set_global_seeds(seed: int = 42) -> None:
    """Setting random seeds for Python, NumPy, and PyTorch."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.use_deterministic_algorithms(False)  # faster baseline (not fully deterministic)


def get_device() -> torch.device:
    """Selecting the appropriate computation device (CPU or CUDA)."""
    return torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


def print_versions() -> None:
    """Printing library versions for traceability."""
    print(f"Python: {os.sys.version.split()[0]}")
    print(f"Pandas: {pd.__version__}")
    print(f"NumPy:  {np.__version__}")
    print(f"Torch:  {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA device: {torch.cuda.get_device_name(0)}")


SEED = 42
set_global_seeds(SEED)
DEVICE = get_device()
print_versions()
print(f"Using device: {DEVICE}")


Python: 3.12.12
Pandas: 2.3.3
NumPy:  2.3.4
Torch:  2.9.0+cu128
CUDA available: True
CUDA device: NVIDIA GeForce RTX 5070 Ti Laptop GPU
Using device: cuda


## Raw data inspection

Loading the CSV, previewing the first rows, and inspecting dtypes/nulls before defining any schema.  
This verifies file access, column names, and potential data issues.

In [2]:
csv_path = "../data/framingham.csv"
df = pd.read_csv(csv_path)

print("Shape:", df.shape)
display(df.head(5))
df.info()

Shape: (4238, 16)


Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


In [6]:
# ─────────────────────────────────────────────────────────────────────────────
# Missing values and descriptive statistics
# ─────────────────────────────────────────────────────────────────────────────
# Missing per column
display(df.isna().sum().to_frame("missing").T)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
missing,0,0,105,0,29,53,0,0,0,50,0,0,19,1,388,0


In [7]:
# Numeric descriptives
display(df.describe(include=[np.number]).T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
male,4238.0,0.429212,0.495022,0.0,0.0,0.0,1.0,1.0
age,4238.0,49.584946,8.57216,32.0,42.0,49.0,56.0,70.0
education,4133.0,1.97895,1.019791,1.0,1.0,2.0,3.0,4.0
currentSmoker,4238.0,0.494101,0.500024,0.0,0.0,0.0,1.0,1.0
cigsPerDay,4209.0,9.003089,11.920094,0.0,0.0,0.0,20.0,70.0
BPMeds,4185.0,0.02963,0.169584,0.0,0.0,0.0,0.0,1.0
prevalentStroke,4238.0,0.005899,0.076587,0.0,0.0,0.0,0.0,1.0
prevalentHyp,4238.0,0.310524,0.462763,0.0,0.0,0.0,1.0,1.0
diabetes,4238.0,0.02572,0.158316,0.0,0.0,0.0,0.0,1.0
totChol,4188.0,236.721585,44.590334,107.0,206.0,234.0,263.0,696.0


In [9]:
# ─────────────────────────────────────────────────────────────────────────────
# Value counts for important binary/categorical columns
# ─────────────────────────────────────────────────────────────────────────────
key_cols = [
    "TenYearCHD", "male", "currentSmoker", "BPMeds", "prevalentStroke",
    "prevalentHyp", "diabetes", "education"
]
for col in key_cols:
    if col in df.columns:
        print(f"\n{col} — top values:")
        display(df[col].value_counts(dropna=False).head().to_frame("count"))



TenYearCHD — top values:


Unnamed: 0_level_0,count
TenYearCHD,Unnamed: 1_level_1
0,3594
1,644



male — top values:


Unnamed: 0_level_0,count
male,Unnamed: 1_level_1
0,2419
1,1819



currentSmoker — top values:


Unnamed: 0_level_0,count
currentSmoker,Unnamed: 1_level_1
0,2144
1,2094



BPMeds — top values:


Unnamed: 0_level_0,count
BPMeds,Unnamed: 1_level_1
0.0,4061
1.0,124
,53



prevalentStroke — top values:


Unnamed: 0_level_0,count
prevalentStroke,Unnamed: 1_level_1
0,4213
1,25



prevalentHyp — top values:


Unnamed: 0_level_0,count
prevalentHyp,Unnamed: 1_level_1
0,2922
1,1316



diabetes — top values:


Unnamed: 0_level_0,count
diabetes,Unnamed: 1_level_1
0,4129
1,109



education — top values:


Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
1.0,1720
2.0,1253
3.0,687
4.0,473
,105
