<a href="https://colab.research.google.com/github/g-e-mm/SupervisedLearning/blob/main/LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vaccine Usage Prediction

This notebook demonstrates a complete workflow to predict whether respondents have received the H1N1 vaccine using logistic regression. Two estimation methods are used:

- **Maximum Likelihood Estimation (MLE)** using `statsmodels`
- **Stochastic Gradient Descent (SGD)** using `scikit-learn`

Assume the dataset is already loaded as a pandas DataFrame in the variable `data`.

The notebook contains the following sections:
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Training
   - Logistic Regression with MLE
   - Logistic Regression with SGD
4. Model Evaluation
5. Conclusions


## Loading dataset and importing libraries
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve

# For logistic regression
from sklearn.linear_model import SGDClassifier
import statsmodels.api as sm

# For encoding categorical variables and scaling
from sklearn.preprocessing import StandardScaler, LabelEncoder

import warnings
warnings.filterwarnings('ignore')

sns.set(style="whitegrid")


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path ='/content/drive/MyDrive/Data Science/Logistic Regression/h1n1_vaccine_prediction.csv'
data = pd.read_csv(path)

## Exploratory Data Analysis
---

In [None]:
print("First 5 rows of the data:")
display(data.head())

print("Data Information:")
data.info()

print("Summary Statistics:")
display(data.describe())

# print("Class distribution:")
# print(data['H1N1_vaccine'].value_counts())
print("Missing values per column:")
print(data.isnull().sum())

The exploratory data analysis revealed the presence of missing values across several columns, indicating the need for appropriate imputation strategies during data preprocessing.

## 2. Data Preprocessing

In this section, I will:
- Handle missing values.
- Encode categorical variables.
- Scale the data.

### 2.1 Handling Missing Values
- For numerical variables, I'll use median imputation.
- For categorical variables, I'll use the mode (most frequent value).

### 2.2 Encoding Categorical Variables
- Ordinal variables are encoded based on the provided order.
- Nominal variables are encoded using one-hot encoding.


In [None]:
data_clean = data.copy()

# Handling missing values
for col in data_clean.columns:
    if data_clean[col].dtype == 'object':
        data_clean[col] = data_clean[col].fillna(data_clean[col].mode()[0])
    else:
        data_clean[col] = data_clean[col].fillna(data_clean[col].median())


ordinal_cols = ['h1n1_worry', 'h1n1_awareness', 'is_h1n1_vacc_effective',
                'is_h1n1_risky', 'sick_from_h1n1_vacc',
                'is_seas_vacc_effective', 'is_seas_risky', 'sick_from_seas_vacc']

for col in ordinal_cols:
    data_clean[col] = pd.to_numeric(data_clean[col], errors='coerce')

# Identify nominal categorical columns (adjust according to your dataset)
nominal_cols = ['age_bracket', 'qualification', 'race', 'sex', 'income_level',
                'marital_status', 'housing_status', 'employment', 'census_msa']

# One-hot encode nominal categorical variables
data_encoded = pd.get_dummies(data_clean, columns=nominal_cols, drop_first=True)

cols_to_scale = [col for col in data_encoded.columns if col not in ['unique_id', 'h1n1_vaccine']]
scaler = StandardScaler()
data_encoded[cols_to_scale] = scaler.fit_transform(data_encoded[cols_to_scale])

print("Data preprocessing completed. Cleaned and encoded data:")
display(data_encoded.head())


In [None]:
# Separate features and target
X = data_encoded.drop(columns=['unique_id', 'h1n1_vaccine'])
y = data_encoded['h1n1_vaccine']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training data shape:", X_train.shape)
print("Test data shape:", X_test.shape)


## Model Training
---

In [None]:
X_train_sm = sm.add_constant(X_train)

logit_model = sm.Logit(y_train, X_train_sm)
result = logit_model.fit()

print(result.summary())

- Dependent Variable: h1n1_vaccine (Binary outcome: vaccinated or not)

- Pseudo R-squared: 0.2592 → Decent explanatory power for a logistic model

- LLR p-value: 0.000 → The model is statistically significant overall

- Converged: Yes → Optimization was successful

| **Feature**                        | **Coefficient** | **P-value** | **Significant** | **Takes Vaccine?** | **Effect**                                                                 |
|-----------------------------------|-----------------|-------------|------------------|---------------------|------------------------------------------------------------------------------|
| dr_recc_h1n1_vacc                 | +0.7842         | 0.000       | ✅               | ✅                  | Doctor's recommendation strongly increases vaccine uptake                   |
| is_h1n1_vacc_effective            | +0.6330         | 0.000       | ✅               | ✅                  | Belief that vaccine is effective strongly encourages vaccination            |
| is_h1n1_risky                     | +0.4861         | 0.000       | ✅               | ✅                  | Belief that H1N1 is risky increases the likelihood of vaccination           |
| is_health_worker                  | +0.2681         | 0.000       | ✅               | ✅                  | Health workers are more likely to be vaccinated                             |
| age_bracket_65+                  | +0.1851         | 0.000       | ✅               | ✅                  | Elderly people are more likely to vaccinate                                 |
| qualification_College Graduate   | +0.0970         | 0.000       | ✅               | ✅                  | Higher education boosts vaccine uptake                                      |
| bought_face_mask                 | +0.0575         | 0.002       | ✅               | ✅                  | Cautious behavior correlates positively with vaccine uptake                 |
| sex_Male                         | +0.0747         | 0.000       | ✅               | ✅                  | Men are slightly more likely to get vaccinated                              |
| chronic_medic_condition          | +0.0521         | 0.000       | ✅               | ✅                  | Chronic illness increases the likelihood of vaccination                     |
| cont_child_undr_6_mnths          | +0.0787         | 0.000       | ✅               | ✅                  | Living with young children increases vaccine uptake                         |
| census_msa_Non-MSA               | +0.0466         | 0.020       | ✅               | ✅                  | Rural/less urban areas show slight increase in vaccination                  |
| race_White, Hispanic, Other      | +ve             | < 0.05      | ✅               | ✅                  | These races more likely to vaccinate than reference group                   |
| avoid_large_gatherings           | -0.1157         | 0.000       | ✅               | ❌                  | Avoidance of gatherings correlates with less likelihood of vaccination      |
| marital_status_Not Married       | -0.0791         | 0.000       | ✅               | ❌                  | Not married people are slightly less likely to get vaccinated               |
| no_of_children                   | -0.0516         | 0.003       | ✅               | ❌                  | More children correlates slightly with less vaccination                     |
| qualification_< 12 Years         | -0.0519         | 0.021       | ✅               | ❌                  | Less education slightly reduces likelihood of vaccination                   |
| dr_recc_seasonal_vacc            | -0.2038         | 0.000       | ✅               | ❌                  | Possibly people follow only one recommendation or view H1N1 differently     |
| dr_recc_h1n1_vacc                 | +0.7842         | 0.000       | ✅               | ✅                  | Doctor's recommendation strongly increases vaccine uptake                   |
| is_h1n1_vacc_effective            | +0.6330         | 0.000       | ✅               | ✅                  | Belief that vaccine is effective strongly encourages vaccination            |
| is_h1n1_risky                     | +0.4861         | 0.000       | ✅               | ✅                  | Belief that H1N1 is risky increases the likelihood of vaccination           |
| is_health_worker                  | +0.2681         | 0.000       | ✅               | ✅                  | Health workers are more likely to be vaccinated                             |
| age_bracket_65+                  | +0.1851         | 0.000       | ✅               | ✅                  | Elderly people are more likely to vaccinate                                 |
| qualification_College Graduate   | +0.0970         | 0.000       | ✅               | ✅                  | Higher education boosts vaccine uptake                                      |
| bought_face_mask                 | +0.0575         | 0.002       | ✅               | ✅                  | Cautious behavior correlates positively with vaccine uptake                 |
| sex_Male                         | +0.0747         | 0.000       | ✅               | ✅                  | Men are slightly more likely to get vaccinated                              |
| chronic_medic_condition          | +0.0521         | 0.000       | ✅               | ✅                  | Chronic illness increases the likelihood of vaccination                     |
| cont_child_undr_6_mnths          | +0.0787         | 0.000       | ✅               | ✅                  | Living with young children increases vaccine uptake                         |
| census_msa_Non-MSA               | +0.0466         | 0.020       | ✅               | ✅                  | Rural/less urban areas show slight increase in vaccination                  |
| race_White, Hispanic, Other      | +ve             | < 0.05      | ✅               | ✅                  | These races more likely to vaccinate than reference group                   |

**Negative Influences**

| **Feature**                        | **Coefficient** | **P-value** | **Significant** | **Takes Vaccine?** | **Effect**                                                                 |
|-----------------------------------|-----------------|-------------|------------------|---------------------|------------------------------------------------------------------------------|
| avoid_large_gatherings           | -0.1157         | 0.000       | ✅               | ❌                  | Avoidance of gatherings correlates with less likelihood of vaccination      |
| marital_status_Not Married       | -0.0791         | 0.000       | ✅               | ❌                  | Not married people are slightly less likely to get vaccinated               |
| no_of_children                   | -0.0516         | 0.003       | ✅               | ❌                  | More children correlates slightly with less vaccination                     |
| qualification_< 12 Years         | -0.0519         | 0.021       | ✅               | ❌                  | Less education slightly reduces likelihood of vaccination                   |
| dr_recc_seasonal_vacc            | -0.2038         | 0.000       | ✅               | ❌                  | Possibly people follow only one recommendation or view H1N1 differently     |


In [None]:
sgd_model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42)
sgd_model.fit(X_train, y_train)

print("SGD Logistic Regression training complete.")

In [None]:
# --- Evaluate MLE Model ---
X_test_sm = sm.add_constant(X_test)
y_pred_prob_mle = result.predict(X_test_sm)
y_pred_mle = (y_pred_prob_mle > 0.5).astype(int)

print("MLE Logistic Regression Evaluation:")
print("Accuracy:", accuracy_score(y_test, y_pred_mle))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_mle))
print("Classification Report:\n", classification_report(y_test, y_pred_mle))

# ROC-AUC and ROC Curve for MLE model
roc_auc_mle = roc_auc_score(y_test, y_pred_prob_mle)
fpr_mle, tpr_mle, _ = roc_curve(y_test, y_pred_prob_mle)

plt.figure(figsize=(6,4))
plt.plot(fpr_mle, tpr_mle, label=f'MLE Logistic Regression (AUC = {roc_auc_mle:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - MLE Logistic Regression')
plt.legend()
plt.show()

- ROC Curve (AUC = 0.82)
AUC of 0.82 shows the model separates classes well.

- Curve is above the diagonal, indicating performance is better than random.

- Good balance across thresholds.

Classification Metrics

| **Metric**        | **Value** | **Interpretation**                                 |
|-------------------|-----------|----------------------------------------------------|
| **Accuracy**      | 0.84      | High overall correctness                          |
| **Precision** | 0.68      | 68% of predicted vaccinated are correct           |
| **Recall **    | 0.42      | Only 42% of actual vaccinated detected            |
| **F1-Score **  | 0.52      | Moderate balance of precision & recall            |
| **AUC**           | 0.82      | Strong class separation ability                   |
- Needs improvement in recalling vaccinated individuals.

In [None]:
# --- Evaluate SGD Model ---
y_pred_prob_sgd = sgd_model.decision_function(X_test)
y_pred_sgd = sgd_model.predict(X_test)

print("SGD Logistic Regression Evaluation:")
print("Accuracy:", accuracy_score(y_test, y_pred_sgd))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_sgd))
print("Classification Report:\n", classification_report(y_test, y_pred_sgd))

# ROC-AUC and ROC Curve for SGD model
roc_auc_sgd = roc_auc_score(y_test, y_pred_prob_sgd)
fpr_sgd, tpr_sgd, _ = roc_curve(y_test, y_pred_prob_sgd)

plt.figure(figsize=(6,4))
plt.plot(fpr_sgd, tpr_sgd, label=f'SGD Logistic Regression (AUC = {roc_auc_sgd:.2f})', color='green')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - SGD Logistic Regression')
plt.legend()
plt.show()

- The ROC Curve visualizes the trade-off between the True Positive Rate (Recall) and False Positive Rate at various threshold levels.

- The curve bows well above the diagonal baseline, indicating the model performs significantly better than random guessing.

- AUC = 0.81 confirms that the classifier has a strong discriminatory ability to separate vaccinated vs. non-vaccinated individuals.

- Though slightly lower than MLE Logistic Regression’s AUC (0.82), it’s still quite effective for classification tasks.

| **Metric**        | **Value** | **Interpretation**                                 |
|-------------------|-----------|----------------------------------------------------|
| **Accuracy**      | 0.83      | High overall correctness                          |
| **Precision ** | 0.66      | 66% of predicted vaccinated are correct           |
| **Recall**    | 0.43      | Model detects 43% of actual vaccinated            |
| **F1-Score**  | 0.52      | Balanced metric indicating moderate performance   |
| **AUC**           | 0.81      | Strong class separation ability                   |
