# Exercise 10: Logistic Regression Analysis
## Framingham Heart Study - Coronary Heart Disease Risk Factors

### Overview
This analysis examines risk factors for Coronary Heart Disease (CHD) using data from the 
Framingham Heart Study. I will build and evaluate a logistic regression model to identify 
significant predictors of CHD.

### Dataset Information
- **Source**: Framingham Heart Study, Levy (1999)
- **Observations**: 4,695 patients
- **Outcome**: CHD diagnosis (binary: 0 = No CHD, 1 = CHD)
- **Predictors**: Sex, blood pressure, cholesterol, age, BMI, season


## Questions 1-3: Research Question Overview

### 1. What is the outcome variable?
The outcome variable is **CHD fate (chdfate)**, representing Coronary Heart Disease diagnosis:
- **1** = CHD present
- **0** = CHD absent

### 2. What predictors are researchers interested in?
Cardiovascular risk factors being examined:
- **sex**: Gender (1 = Male, 2 = Female)
- **sbp**: Systolic Blood Pressure (mmHg)
- **dbp**: Diastolic Blood Pressure (mmHg)  
- **scl**: Serum Cholesterol Level (mg/dL)
- **age**: Age (years)
- **bmi**: Body Mass Index (kg/m²)
- **month**: Month of year at baseline (will be converted to seasons)

### 3. What is the hypothesis?
**Hypothesis**: Cardiovascular risk factors (blood pressure, cholesterol, BMI, age, and sex) 
are associated with increased risk of developing Coronary Heart Disease.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## Question 4: Data Exploration and Summary Statistics

I begin by loading the dataset and examining its structure, checking for missing values (second cell), 
and revealing descriptive statistics.

In [None]:
#open and read csv (which I saved from the xlsx)
framingham_df = pd.read_csv('framingham_dataset_mod.csv')

#quick look to see if we got it right
print(f"Dataset shape: {framingham_df.shape}")
print(f"Observations: {framingham_df.shape[0]:,} | Variables: {framingham_df.shape[1]}")

framingham_df.head(10)

#check data types
framingham_df.dtypes

#summary stats on the data
framingham_df.describe()


In [None]:
#missing data assessment
#identify and quantify missing values to determine the appropriate handling strategy

#count missing values
missing_counts = framingham_df.isnull().sum()
missing_counts[missing_counts > 0]

#create a missing values dataframe, accumulate to missing percentages
missing_pct = (framingham_df.isnull().sum() / len(framingham_df) * 100).round(2)
missing_summary_df = pd.DataFrame({
    'Missing_Count': missing_counts[missing_counts > 0],
    'Missing_Percentage': missing_pct[missing_counts > 0]
})

#show
missing_summary_df

#calculate
total_missing = missing_counts.sum()
total_cells = framingham_df.shape[0] * framingham_df.shape[1]

#some values are missing
print(f"Total missing values: {total_missing}")
print(f"Percentage of all data: {(total_missing / total_cells * 100):.2f}%")

## Missing Data Strategy
With only 42 missing values (0.10% of total data), I will use **complete case analysis**.

This minimal data loss should not significantly impact our results. So, onward we go...


In [None]:
#CHD outcome distribution

#determine chd factors
chd_counts = framingham_df['chdfate'].value_counts()
chd_prevalence = framingham_df['chdfate'].mean() * 100

#show
print("CHD Outcome Distribution:")
print(f"  No CHD (0): {chd_counts[0]:,} ({chd_counts[0]/len(framingham_df)*100:.1f}%)")
print(f"  CHD (1): {chd_counts[1]:,} ({chd_counts[1]/len(framingham_df)*100:.1f}%)")
print(f"\nCHD Prevalence: {chd_prevalence:.2f}%")

In [None]:
#same sex distribution

#determine sex counts
sex_counts = framingham_df['sex'].value_counts()

#show
print("Sex Distribution:")
print(f"  Males (sex=1): {sex_counts[1]:,}")
print(f"  Females (sex=2): {sex_counts[2]:,}")

## Question 5: Creating Seasonal Variables

The month variable needs transformation for regression modeling. I'll create four binary 
indicators for seasons, using winter as the reference category, as suggested.  [I love being able to format markdown cells]

**Season Definitions:**
- **Winter**: December (12), January (1), February (2)
- **Spring**: March (3), April (4), May (5)
- **Summer**: June (6), July (7), August (8)
- **Fall**: September (9), October (10), November (11)

In [None]:
#create season variables from the month
framingham_df['winter'] = framingham_df['month'].isin([12, 1, 2]).astype(int)
framingham_df['spring'] = framingham_df['month'].isin([3, 4, 5]).astype(int)
framingham_df['summer'] = framingham_df['month'].isin([6, 7, 8]).astype(int)
framingham_df['fall'] = framingham_df['month'].isin([9, 10, 11]).astype(int)

#load season variables into a separate dataframe
season_summary_df = pd.DataFrame({
    'Winter': [framingham_df['winter'].sum()],
    'Spring': [framingham_df['spring'].sum()],
    'Summer': [framingham_df['summer'].sum()],
    'Fall': [framingham_df['fall'].sum()]
})

#visual check
season_summary_df

## Question 6: Initial Logistic Regression Model

I'll fit an initial logistic regression model using all predictor variables (sex, sbp, dbp, 
scl, age, bmi) and seasonal indicators (spring, summer, fall), with winter as reference.

**Note**: I'll exclude ID and month variables. ID is not a predictor, and I'll use season indicators rather than months.

In [None]:
#for all predictor SeasonVariables
predictor_vars = ['sex', 'sbp', 'dbp', 'scl', 'age', 'bmi', 'spring', 'summer', 'fall']

#start with the data as loaded
original_size = len(framingham_df)

#drop nulls into a new dataframe
framingham_complete_df = framingham_df.dropna(subset=predictor_vars + ['chdfate'])

#how many were removed by dropping the nulls
removed_count = original_size - len(framingham_complete_df)

#show
print(f"Original dataset size: {original_size:,}")
print(f"After removing missing data: {len(framingham_complete_df):,}")
print(f"Observations removed: {removed_count} ({removed_count/original_size*100:.2f}%)")


In [None]:
#outcome and predictors
outcome_var = framingham_complete_df['chdfate']
predictor_data = framingham_complete_df[predictor_vars].copy()
predictor_data_with_const = sm.add_constant(predictor_data)

#show
print(f"Outcome variable: chdfate (n={len(outcome_var):,})")
print(f"Predictor variables: {len(predictor_vars)}")

In [None]:
#build logistic regression model (so easy in python!)
initial_logit_model = sm.Logit(outcome_var, predictor_data_with_const).fit()

#display what we got
print("✓ Initial logistic regression model fitted successfully")
print(f"  Convergence status: {initial_logit_model.mle_retvals['converged']}")
print(f"  Iterations: {initial_logit_model.mle_retvals['iterations']}")

#show
print(initial_logit_model.summary())

### Initial Model Results

The model converged successfully in 6 iterations. My key findings:
- **Pseudo R²**: 0.077 (7.68% variance explained)
- **Log-Likelihood**: -2,674.9
- **LLR p-value**: 3.553e-90 (highly significant overall model)

**The significant predictors (where p < 0.05):**
- sex, sbp, scl, age, bmi

**Non-significant predictors:**
- dbp, seasonal variables

## Question 7: Model Diagnostics

I'll conduct comprehensive diagnostic checks to assess model assumptions and identify 
potential issues across two key questions, starting with:

### 7a. Distribution of Predictor Variables
Check for skewness that might require transformation.

In [None]:
#continuous variables only
continuous_vars = ['sbp', 'dbp', 'scl', 'age', 'bmi']

#build the skewness list
skewness_results = []
for var in continuous_vars:
    skew_value = stats.skew(framingham_complete_df[var].dropna())
    
    if abs(skew_value) > 1:
        assessment = "HIGHLY SKEWED (suggest transforming)"
    elif abs(skew_value) > 0.5:
        assessment = "MODERATELY SKEWED (consider transforming)"
    else:
        assessment = "APPROXIMATELY SYMMETRIC (leave as is)"

    #build the skewness list
    skewness_results.append({
        'Variable': var,
        'Skewness': round(skew_value, 3),
        'Assessment': assessment
    })

#show skewness results
skewness_df = pd.DataFrame(skewness_results)
skewness_df

In [None]:
#let's take a visual view of the various distributions

#setup charts
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Distribution of Main Predictor Variables', fontsize=16, fontweight='bold')

plot_vars = ['sbp', 'dbp', 'scl', 'age', 'bmi', 'sex']

#build the plots' visualization
for idx, var in enumerate(plot_vars):
    row = idx // 3
    col = idx % 3
    
    axes[row, col].hist(framingham_complete_df[var], bins=30, edgecolor='black', alpha=0.7)
    axes[row, col].set_title(var.upper(), fontweight='bold')
    axes[row, col].set_xlabel('Value')
    axes[row, col].set_ylabel('Frequency')
    
    skew_val = stats.skew(framingham_complete_df[var])
    axes[row, col].text(0.02, 0.98, f'Skewness: {skew_val:.2f}', 
                        transform=axes[row, col].transAxes,
                        verticalalignment='top',
                        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

#show
plt.tight_layout()
plt.show()


### 7b. Multicollinearity Assessment (VIF)

For the second part of our model assumptions testing, I'll check for multicollinearity using Variance Inflation Factors (VIF). High VIF values 
(>10) indicate problematic multicollinearity.

**How to interpret VIF:**
- VIF < 5: No concern
- VIF 5-10: Moderate multicollinearity
- VIF > 10: High multicollinearity (problematic)

In [None]:
#again, our continuous variables
vif_vars = ['sex', 'sbp', 'dbp', 'scl', 'age', 'bmi']
vif_data = framingham_complete_df[vif_vars].copy()

#build the VIF list
vif_results = []
for i, var in enumerate(vif_vars):
    vif_value = variance_inflation_factor(vif_data.values, i)
    vif_results.append({'Variable': var, 'VIF': round(vif_value, 2)})

#show our VIFs
vif_df = pd.DataFrame(vif_results).sort_values('VIF', ascending=False)
vif_df

In [None]:
#find the high VIFs, build a dataframe for just those
high_vif_df = vif_df[vif_df['VIF'] > 5]

#show them, if there are any
print(f"Variables with VIF > 5: {len(high_vif_df)}")
if len(high_vif_df) > 0:
    print("\n⚠ Multicollinearity detected:")
    for _, row in high_vif_df.iterrows():
        print(f"  - {row['Variable']}: VIF = {row['VIF']:.2f}")
else:
    print("\n✓ No concerning multicollinearity detected")

In [None]:
#create the correlation matrix
correlation_matrix = framingham_complete_df[vif_vars].corr()
correlation_matrix.round(3)

#build the correlation matrix
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title('Correlation Matrix of Predictor Variables', fontsize=14, fontweight='bold')
plt.tight_layout()

#show the correlation matrix
plt.show()

### Key Finding: SBP and DBP Multicollinearity

SBP and DBP show severe multicollinearity (VIF > 100, correlation r = 0.783). This is 
expected as both measure blood pressure. 

**Remedial Action**: Remove DBP from the model, keeping SBP as it is more clinically 
relevant and commonly used in cardiovascular risk assessment.

### 7c. Linearity Assessment (Box-Tidwell Test)

I'll test whether continuous predictors have linear relationships with the log-odds of CHD. 
The Box-Tidwell test adds interaction terms between each predictor and its logarithm.

**Interpretation**: If p-value > 0.05, the linearity assumption holds.

In [None]:
#Box-Tidwell setup and calcuoation

#Box-Tidwell setup
linearity_predictors = ['sex'] + continuous_vars + ['spring', 'summer', 'fall']
linearity_data = framingham_complete_df[linearity_predictors].copy()

#add log interaction terms
for var in continuous_vars:
    linearity_data[f'{var}_log_interaction'] = (
        linearity_data[var] * np.log(linearity_data[var] + 0.1)
    )

#add constant
linearity_data_with_const = sm.add_constant(linearity_data)

print(f"✓ Created interaction terms for {len(continuous_vars)} continuous variables")

#build the Box-Tidwell model
linearity_model = sm.Logit(outcome_var, linearity_data_with_const).fit(disp=0)

print("✓ Box-Tidwell model fitted successfully")

#build a list of the Box-Tidwell results
linearity_results = []
for var in continuous_vars:
    interaction_term = f'{var}_log_interaction'
    coef = linearity_model.params[interaction_term]
    pval = linearity_model.pvalues[interaction_term]
    is_linear = 'Yes' if pval > 0.05 else 'No'
    
    linearity_results.append({
        'Variable': var,
        'Coefficient': round(coef, 4),
        'P-value': round(pval, 4),
        'Linear_Relationship': is_linear
    })

#show the Box-Tidwell results
linearity_df = pd.DataFrame(linearity_results)
linearity_df


### Linearity Assessment Results

All continuous predictors show linear relationships with the log-odds of CHD (all p-values are greater than 0.05). No transformations are needed based on the linearity assumption.

### Linearity Assessment Results

All continuous predictors show linear relationships with the log-odds of CHD (all p-values 
> 0.05). No transformations are needed based on the linearity assumption.



### 7d. Outlier Assessment

Identify potential outliers and influential observations using:
- **Pearson residuals**: Values > 3 indicate potential outliers
- **Cook's Distance**: Values > 4/n indicate influential observations

In [None]:
#calculate Pearson residuals
predicted_probs = initial_logit_model.predict(predictor_data_with_const)
pearson_residuals = initial_logit_model.resid_pearson

outlier_threshold = 3
outlier_mask = np.abs(pearson_residuals) > outlier_threshold
outlier_count = outlier_mask.sum()

#show Pearson residuals
print(f"Observations with |Pearson residual| > {outlier_threshold}: {outlier_count}")
print(f"Percentage of data: {outlier_count/len(pearson_residuals)*100:.2f}%")

#calculate Cook's distance
model_influence = initial_logit_model.get_influence()
cooks_distance = model_influence.cooks_distance[0]

cook_threshold = 4 / len(framingham_complete_df)
influential_mask = cooks_distance > cook_threshold
influential_count = influential_mask.sum()

#show Cook's distance
print(f"Influential observations (Cook's D > 4/n): {influential_count}")
print(f"Percentage of data: {influential_count/len(framingham_complete_df)*100:.2f}%")

In [None]:
#now I'll visualize our outlier diagnostic information

#setup visualizaton
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Diagnostic Plots for Outliers', fontsize=16, fontweight='bold')

#residual plot
axes[0, 0].scatter(predicted_probs, pearson_residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].axhline(y=3, color='orange', linestyle='--', alpha=0.5)
axes[0, 0].axhline(y=-3, color='orange', linestyle='--', alpha=0.5)
axes[0, 0].set_xlabel('Predicted Probability')
axes[0, 0].set_ylabel('Pearson Residuals')
axes[0, 0].set_title('Residual Plot')

#Q-Q plot
stats.probplot(pearson_residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')

#Cook's distance
axes[1, 0].stem(range(len(cooks_distance)), cooks_distance, markerfmt=',', basefmt=" ")
axes[1, 0].axhline(y=cook_threshold, color='r', linestyle='--', label="Threshold (4/n)")
axes[1, 0].set_xlabel('Observation Index')
axes[1, 0].set_ylabel("Cook's Distance")
axes[1, 0].set_title("Cook's Distance")
axes[1, 0].legend()

#leverage vs residuals
leverage_values = model_influence.hat_matrix_diag
axes[1, 1].scatter(leverage_values, pearson_residuals, alpha=0.5)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Pearson Residuals')
axes[1, 1].set_title('Leverage vs Residuals')

#show
plt.tight_layout()
plt.show()

### Outlier Assessment Conclusion

- **Pearson residuals > 3**: Only 8 observations (0.17%)
- **Influential observations**: 98 observations (2.11%)

The number of outliers is minimal and within expected ranges for a dataset of this size. 
These observations represent valid data points and will be retained.

### 7e. Adequate Outcomes Per Predictor Category

Verifying that we have at least 5 CHD outcomes in each category of the sex variable. 
This ensures stable coefficient estimation.

In [None]:
#CHD by sex crosstab
sex_chd_crosstab = pd.crosstab(
    framingham_complete_df['sex'], 
    framingham_complete_df['chdfate'], 
    margins=True
)

#setup
sex_chd_crosstab

#outcomes per sex
for sex_value in framingham_complete_df['sex'].unique():
    chd_count = len(framingham_complete_df[
        (framingham_complete_df['sex'] == sex_value) & 
        (framingham_complete_df['chdfate'] == 1)
    ])
    sex_label = "Male" if sex_value == 1 else "Female"
    status = "✓ Sufficient" if chd_count >= 5 else "✗ Insufficient"
    
    print(f"{sex_label} (sex={sex_value}): {chd_count} CHD cases - {status}")

Both sex categories have far more than the minimum 5 required outcomes, ensuring stable 
coefficient estimation (Males: 818 cases, Females: 645 cases).


## Question 8: Addressing Issues and Refitting Model

### Summary of Diagnostic Findings

The diagnostic process identified one primary issue requiring remediation:

**Multicollinearity**: Severe multicollinearity between SBP and DBP (VIF > 100, r = 0.783)

### Remedial Action

**I suggest removing DBP** from the model and retain SBP.

**My rationale**: 
- SBP is more clinically relevant for cardiovascular disease prediction
- SBP is the standard measure used in clinical practice
- High correlation means both variables measure the same underlying construct

In [None]:
#based on our diagnostics and analysis, I'll now build the final model taking into account our above recommendations

#setup with my final model predictors in a list
final_predictor_vars = ['sex', 'sbp', 'scl', 'age', 'bmi', 'spring', 'summer', 'fall']

#show the final model predictors I've selected
print("Final model predictors:")
for var in final_predictor_vars:
    print(f"  - {var}")

#prepare final model data
final_outcome_var = framingham_complete_df['chdfate']
final_predictor_data = framingham_complete_df[final_predictor_vars].copy()
final_predictor_data_with_const = sm.add_constant(final_predictor_data)

#show the suggested model
print(f"✓ Final model data prepared")
print(f"  Observations: {len(final_outcome_var):,}")
print(f"  Predictors: {len(final_predictor_vars)}")

#fit final model
final_logit_model = sm.Logit(final_outcome_var, final_predictor_data_with_const).fit()

#show the final model
print("✓ Final logistic regression model fitted successfully")
print(f"  Convergence: {final_logit_model.mle_retvals['converged']}")
print(f"  Iterations: {final_logit_model.mle_retvals['iterations']}")

#final model summary
print(final_logit_model.summary())


### Final Model Performance

- **Pseudo R²**: 0.076 (7.65% variance explained)
- **Log-Likelihood**: -2,675.8
- **LLR p-value**: 1.167e-90 (highly significant)
- **Convergence**: Successful in 6 iterations

All five cardiovascular risk factors remain highly significant (p < 0.001): sex, sbp, scl, 
age, and bmi. 

Seasonal variables are not significant.

## Question 9: Odds Ratios and Interpretation

Odds ratios represent the multiplicative change in CHD odds for a one-unit increase in each 
predictor, holding all other variables constant.

**Interpretation Guide:**
- OR > 1: Increased odds of CHD
- OR < 1: Decreased odds of CHD
- OR = 1: No association with CHD

In [None]:
#setup odds calculations
odds_ratios = np.exp(final_logit_model.params)
confidence_intervals = np.exp(final_logit_model.conf_int())
confidence_intervals.columns = ['CI_Lower', 'CI_Upper']

#do the odds calcs and load into a dataframe just for this purpose
odds_ratio_results_df = pd.DataFrame({
    'Odds_Ratio': odds_ratios,
    'CI_Lower_95': confidence_intervals['CI_Lower'],
    'CI_Upper_95': confidence_intervals['CI_Upper'],
    'P_value': final_logit_model.pvalues,
    'Significant': final_logit_model.pvalues < 0.05
}).round(4)

#show
odds_ratio_results_df

### Interpretation of Significant Predictors

#### Sex (Female vs Male)
**OR = 0.447 (95% CI: 0.392 - 0.510), p < 0.001**

Females have approximately **55% lower odds** of developing CHD compared to males. This 
represents a strong protective effect consistent with known sex differences in evaluating cardiovascular 
disease.

#### Systolic Blood Pressure (SBP) (per 1 mmHg)
**OR = 1.011 (95% CI: 1.007 - 1.014), p < 0.001**

Each 1 mmHg increase in SBP increases CHD odds by 1.1%. More meaningfully, a **10 mmHg 
increase** corresponds to an **11.6% increase** in CHD odds (1.011^10 = 1.116).

#### Serum Cholesterol (per 1 mg/dL)
**OR = 1.007 (95% CI: 1.005 - 1.008), p < 0.001**

Each 1 mg/dL increase in cholesterol increases CHD odds by 0.7%. A **40 mg/dL increase** 
(e.g., 200 to 240 mg/dL) corresponds to a **32% increase** in CHD odds (1.007^40 = 1.32).

#### Age (per 1 year)
**OR = 1.018 (95% CI: 1.009 - 1.026), p < 0.001**

Each additional year of age increases CHD odds by 1.8%. Over **10 years**, this translates 
to a **19.7% increase** in CHD odds (1.018^10 = 1.197).

#### Body Mass Index (per 1 kg/m²)
**OR = 1.048 (95% CI: 1.031 - 1.066), p < 0.001**

Each 1-unit BMI increase raises CHD odds by 4.8%. A **5-unit BMI increase** (e.g., BMI 25 
to 30) corresponds to a **26.5% increase** in CHD odds (1.048^5 = 1.265).

#### Seasonal Variables
Spring, summer, and fall showed no significant associations with CHD (all p > 0.3), 
suggesting season of baseline examination does not meaningfully predict CHD risk.


In [None]:
#setup data frames to chart
plot_data_df = odds_ratio_results_df[odds_ratio_results_df.index != 'const'].copy()
plot_data_df = plot_data_df.sort_values('Odds_Ratio')

#build data frames to plot
fig, ax = plt.subplots(figsize=(10, 8))

for i, (idx, row) in enumerate(plot_data_df.iterrows()):
    color = 'green' if row['Significant'] else 'gray'
    ax.errorbar(
        row['Odds_Ratio'], i,
        xerr=[[row['Odds_Ratio'] - row['CI_Lower_95']], 
              [row['CI_Upper_95'] - row['Odds_Ratio']]],
        fmt='o', markersize=10, capsize=5, color=color, ecolor='black'
    )

#setup charts
ax.axvline(x=1, color='red', linestyle='--', linewidth=2, label='OR = 1 (No effect)')
ax.set_yticks(range(len(plot_data_df)))
ax.set_yticklabels(plot_data_df.index)
ax.set_xlabel('Odds Ratio', fontsize=12, fontweight='bold')
ax.set_title('Odds Ratios for CHD with 95% Confidence Intervals', 
             fontsize=14, fontweight='bold')
ax.set_xscale('log')
ax.legend()
ax.grid(True, alpha=0.3)

#show our great work
plt.tight_layout()
plt.show()

## Model Performance Evaluation

Assess my final model's predictive performance using classification metrics and ROC 
analysis.

In [None]:
#generate predictions
predicted_probabilities = final_logit_model.predict(final_predictor_data_with_const)
predicted_classes = (predicted_probabilities > 0.5).astype(int)

#show predictions
print("✓ Predictions generated")
print(f"  Prediction threshold: 0.5")

In [None]:
#let's continue our assessment with my favorite view - the confusoin matrix!

#create the confusion matrix dataframe
conf_matrix = confusion_matrix(final_outcome_var, predicted_classes)

conf_matrix_df = pd.DataFrame(
    conf_matrix,
    columns=['Predicted_No_CHD', 'Predicted_CHD'],
    index=['Actual_No_CHD', 'Actual_CHD']
)

#show it
conf_matrix_df

In [None]:
#show the classification report
print(classification_report(
    final_outcome_var, 
    predicted_classes, 
    target_names=['No CHD', 'CHD']
))

In [None]:
#continue to evaluate my model by looking at AUC-ROC

#calculate AUC-ROC
auc_score = roc_auc_score(final_outcome_var, predicted_probabilities)

#show it
print(f"AUC-ROC Score: {auc_score:.4f}")
print("\nInterpretation:")
print("  0.5-0.7: Poor discrimination")
print("  0.7-0.8: Acceptable discrimination")
print("  0.8-0.9: Excellent discrimination")
print("  >0.9: Outstanding discrimination")

#ROC curve
fpr, tpr, thresholds = roc_curve(final_outcome_var, predicted_probabilities)

#plot curve
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc_score:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curve for CHD Prediction Model', fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

#show curve
plt.tight_layout()
plt.show()

## Conclusions

### Key Findings

1. **Model Performance**: The final logistic regression model achieved moderate discrimination 
   (AUC-ROC = 0.689) and is highly statistically significant (p < 0.001).

2. **Significant Risk Factors**: Five cardiovascular risk factors showed significant 
   associations with CHD:
   - Female sex (protective)
   - Systolic blood pressure
   - Serum cholesterol
   - Age
   - Body mass index

3. **Protective Effect of Female Sex**: Females demonstrated 55% lower CHD odds compared to 
   males, consistent with known sex differences.

4. **Modifiable Risk Factors**: Blood pressure, cholesterol, and BMI represent important 
   modifiable targets for CHD prevention.

5. **Model Diagnostics**: After addressing multicollinearity by removing DBP, all diagnostic 
   checks were satisfactory with no major violations of assumptions.

### Limitations

- **Model Fit**: Pseudo R² of 7.65% indicates substantial unexplained variation in CHD risk
- **Missing Data**: Complete case analysis excluded 41 observations (0.87% loss)
- **Binary Outcome**: Does not capture disease severity or time to event

### Clinical Implications

The findings support the importance of traditional cardiovascular risk factor management. 
Clinical interventions targeting blood pressure control, cholesterol reduction, and weight 
management are likely to yield meaningful reductions in CHD risk. The strong protective 
effect of female sex suggests that understanding hormonal and biological mechanisms could 
inform prevention strategies.

### What Might Improve My Model

It seems another potential indicator for CHD is hereditary indicators. I'd love to have 
that data to include in our model and test if it's a more influential indicator, an
enhancement to existing indicators or no influence at all. In my personal case, I note
my doctors, over the years, have keyed on my father's CHD in determining appropriate
diagnostic examinations for me through my life. This is the case, even though he remains
fit and healthy at the age of 96!

### FINAL CONCLUSION

I'll be spending more time in the gym!!!