# Heart Disease Risk Prediction: Logistic Regression

## Step 1: Load and Prepare the Dataset

**Goal**: Load the Heart Disease dataset, clean it, and prepare it for training a logistic regression model.

**What we'll do**:
1. Load CSV into pandas DataFrame
2. Binarize target (Presence → 1, Absence → 0)
3. Exploratory Data Analysis (EDA)
4. Handle missing values/outliers
5. Select features
6. Split data (70/30 stratified)
7. Normalize features

In [14]:
# Standard imports for data science work
import numpy as np      # Numerical operations, arrays
import pandas as pd     # Data manipulation, DataFrames
import matplotlib.pyplot as plt  # Visualization

# Configure plots to look cleaner
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["axes.grid"] = True

## 1.1 Load the Dataset

**Why pandas?** It's the standard tool for tabular data. A DataFrame is like an Excel spreadsheet in Python - rows are samples, columns are features.


In [15]:
# Load the CSV file
# pd.read_csv() reads a CVS file and returns a DataFrame
df = pd.read_csv("dataset/Heart_Disease_Prediction.csv")

#Quick look at the data
print(f"Dataset shape: {df.shape}") # (rows, columns)
print(f"Samples: {df.shape[0]}, Features: {df.shape[1]}")
df.head() # Show first 5 rows

Dataset shape: (270, 14)
Samples: 270, Features: 14


Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


In [None]:
# See all column names - important to know what you're working with
print("Columns in dataset:")
for i, col in enumerate(df.columns):
    print(f"  {i+1}. {col}")