---
title: "Logistic Regression and Survival Analysis"
output-file: "04_logistic_and_survival.html"
format: html
---

# ðŸ“Š 4.6 Logistic Regression and Survival Analysis

This notebook introduces logistic regression and survival analysis for nutrition research, focusing on binary outcomes and time-to-event data.

**Objectives**:
- Apply logistic regression to predict binary outcomes.
- Perform survival analysis to model time-to-event data.
- Use `vitamin_trial.csv` to analyze vitamin D trial outcomes.

**Context**: Logistic regression predicts outcomes like improved health, while survival analysis models time to events, such as response to treatment, in nutrition studies.

<details><summary>Fun Fact</summary>
Hippos may not run clinical trials, but their vitamin D data helps us model health outcomes with statistical flair! ðŸ¦›
</details>

In [1]:
# Install required packages
%pip install pandas numpy scikit-learn lifelines  # Ensures compatibility in Colab
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
from sklearn.linear_model import LogisticRegression  # For logistic regression
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from lifelines import KaplanMeierFitter  # For survival analysis
import matplotlib.pyplot as plt  # For visualization
print('Analysis environment ready.')

Analysis environment ready.


## Data Preparation

Load `vitamin_trial.csv`, containing vitamin D trial data, and preprocess for analysis.

In [2]:
# Load the dataset
df = pd.read_csv('data/vitamin_trial.csv')  # Path relative to notebook

print(f'Data shape: {df.shape}')  # Display rows and columns
print(f'Sample row: ID={df.iloc[0]["ID"]}, Group={df.iloc[0]["Group"]}, Vitamin_D={df.iloc[0]["Vitamin_D"]}, Time={df.iloc[0]["Time"]}, Outcome={df.iloc[0]["Outcome"]}')  # Show first row

Data shape: (200, 5)
Sample row: ID=V1, Group=Control, Vitamin_D=12.5, Time=6, Outcome=Normal


## Logistic Regression

Model the probability of `Outcome` = Improved using `Vitamin_D` and `Group`.

In [3]:
# Encode categorical variables
le_group = LabelEncoder()
df['Group_Encoded'] = le_group.fit_transform(df['Group'])  # Control=0, Treatment=1
le_outcome = LabelEncoder()
df['Outcome_Encoded'] = le_outcome.fit_transform(df['Outcome'])  # Normal=0, Improved=1

# Prepare features and target
X = df[['Vitamin_D', 'Group_Encoded']]  # Features
y = df['Outcome_Encoded']  # Target

# Fit logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X, y)

# Print coefficients
print('Logistic Regression Coefficients:')
print(f'- Vitamin_D: {model.coef_[0][0]:.3f}')
print(f'- Group (Treatment): {model.coef_[0][1]:.3f}')

Logistic Regression Coefficients:
- Vitamin_D: 0.045
- Group (Treatment): 0.210


## Survival Analysis

Estimate Kaplan-Meier survival curves for `Time` to `Outcome` = Improved, stratified by `Group`.

In [4]:
# Prepare data for survival analysis
df['Event'] = df['Outcome'].apply(lambda x: 1 if x == 'Improved' else 0)  # 1=Improved, 0=Normal

# Initialize Kaplan-Meier fitter
kmf = KaplanMeierFitter()

# Plot survival curves for each group
plt.figure(figsize=(8, 6))
for group in ['Control', 'Treatment']:
    mask = df['Group'] == group
    kmf.fit(df[mask]['Time'], df[mask]['Event'], label=group)
    kmf.plot_survival_function()

plt.title('Kaplan-Meier Survival Curves by Group')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.show()  # Display survival curves

<Figure size 800x600 with 1 Axes>

## Exercise: Extend the Analysis

Modify the logistic regression to include `Time` as a predictor and report the new coefficients. For survival analysis, compute the median survival time for each group.

**Guidance**:
- Add `Time` to `X` in the logistic regression model.
- Use `kmf.median_survival_time_` to get median survival times.

**Answer**:

My extended analysis code and results are as follows:

```python
# Your code here
```

**Logistic Regression Coefficients**:

- Vitamin_D: [Your Result]
- Group (Treatment): [Your Result]
- Time: [Your Result]

**Median Survival Times**:

- Control: [Your Result]
- Treatment: [Your Result]

## Conclusion

Youâ€™ve applied logistic regression and survival analysis to model vitamin D trial outcomes, uncovering predictors of improvement and time-to-event patterns.

**Next Steps**: Explore advanced topics in `notebooks/05_advanced/` or revisit earlier data analysis notebooks.

**Resources**:
- [Scikit-Learn Documentation](https://scikit-learn.org/)
- [Lifelines Documentation](https://lifelines.readthedocs.io/)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)