## 1. Business Understanding
### 1.1 Business Overview
#### 1.1.0 H1N1 influenza
 H1N1 is a type of Influenza A virus that infects the nose, throat, and lungs. It was originally called "swine flu" because it resembled a flu virus that affects pigs. The H1N1 virus that caused the 2009 pandemic contained genetic material from avian, swine, and human influenza viruses. The H1N1 virus caused the deadly "Spanish flu" pandemic in 1918. A different strain of H1N1 caused a pandemic in 2009, but it is now considered a regular, seasonal flu virus. H1N1 is contagious and spreads from person to person through coughs and sneezes. Its symptoms are similar to other types of flu and can include fever, cough, sore throat, body aches, and fatigue. 

 The 2009 H1N1 influenza pandemic highlighted the importance of vaccination campaigns in preventing widespread illness. Despite the availability of both H1N1 and seasonal flu vaccines, vaccination rates varied significantly across demographic groups. Factors such as personal health behaviors, medical access, and perceptions of vaccine safety played a major role in whether individuals chose to get vaccinated.

As the world continues to face public health challenges, such as the COVID-19 pandemic, gaining insight into the drivers of vaccine uptake remains critical. By analyzing survey data from the 2009 National H1N1 Flu Survey, we can explore how people’s backgrounds, health conditions, opinions, and behaviors relate to their vaccination decisions.

#### 1.1.1 Seasonal flu vaccine
 Seasonal flu vaccine is an annual vaccination that protects against the influenza viruses predicted to be the most common in the coming season. The vaccine is updated every year. It typically includes protection against two influenza A viruses (one H1N1 and one H3N2 subtype) and one or two influenza B viruses. Administration, It is offered as either an injection (shot) containing an inactivated (killed) virus or as a nasal spray containing a weakened live virus.


**NOTE: H1N1 is a type of flu, while the seasonal flu vaccine is a tool used to prevent it and other circulating flu strains.**

### 1.2 Problem Statement

* During the 2009 H1N1 influenza pandemic, vaccination was one of the most effective strategies for reducing infection and preventing severe illness. However, not all individuals chose to get vaccinated, and uptake rates varied across different demographic and socioeconomic groups. Understanding the factors that influence vaccine adoption is critical for designing effective public health interventions.This project aims to develop a machine learning model that predicts whether an individual received the H1N1 flu vaccine based on their demographic information, health conditions, behaviors, and opinions collected in the National 2009 H1N1 Flu Survey.



### 1.3 Business Objectives

#### 1.3.0 Main objective
* To develop a model that will Predict whether an individual received the H1N1 vaccine.

#### 1.3.1 Specific Objectives.
* To identify which age group, sex, race and chronic_med_condition persons received the H1N1 vaccine
* To analyze the demographic, behavioral, and medical characteristics of survey respondents and determine their association with H1N1 vaccine uptake.
* To identify the most influential features affecting vaccine uptake, providing insights that can guide targeted public health interventions and awareness campaigns.

#### 1.3.2 Research Questions.
* What demographic factors influence whether an individual receives the H1N1 vaccine?
* What behavioral factors influence whether an individual receives the H1N1 vaccine?
* What are the medical factors influence whether an individual receives the H1N1 vaccine?
* How can the insights from the predictive model help public health officials design targeted campaigns to increase vaccination rates?

### 1.4 Success criteria
* If we build a model that will corectly predict whether an individual received the H1N1 vaccine.

## 2. Data Understanding.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
#H1N1 and Seasonal Flu Vaccines
df1 = pd.read_csv('H1N1_Flu_Vaccines.csv')
df1.head()

FileNotFoundError: [Errno 2] No such file or directory: 'H1N1_Flu_Vaccines.csv'

### Describtion of Data

* Data format: CSV file (H1N1_Flu_Vaccines.csv).
* Number of records (rows): 26,000 respondents (varies depending on version of dataset).
* Number of fields (columns): 36 features + 2 target labels (h1n1_vaccine, seasonal_vaccine).
* Field identities: Columns represent demographic, behavioral, medical, and opinion-based survey responses (e.g., age_group, sex, h1n1_concern, doctor_recc_h1n1, employment_status).

* Missing values: Some fields (like employment_industry, employment_occupation, health_insurance) have missing data.

* Inconsistent entries: Check for unexpected values (e.g., misspelled categories, out-of-range numbers).

* Imbalanced data: If far more people did not vaccinate compared to those who did, models may get biased toward predicting "not vaccinated."

* Data types: Ensure categorical features are encoded properly (e.g., strings vs. integers).

* Duplicates: Check if any respondents appear more than once.

### Vaccine Targets (Dependent Variables)

**h1n1_vaccine → Did the person receive the H1N1 vaccine? (1 = yes, 0 = no).**

**seasonal_vaccine → Did the person receive the seasonal flu vaccine? (1 = yes, 0 = no).**
* respondent_id → Unique ID for each survey participant.
* h1n1_concern → Level of concern about H1N1 flu (e.g., not at all, somewhat, very concerned).
* h1n1_knowledge → Self-rated knowledge about H1N1 flu (low, medium, high).
* behavioral_antiviral_meds → Did they take antiviral medications to prevent flu?
* behavioral_avoidance → Did they avoid large gatherings?
* behavioral_face_mask → Did they wear a face mask?
* behavioral_wash_hands → Did they wash hands frequently?
* behavioral_large_gatherings → Did they avoid large gatherings?
* behavioral_outside_home → Did they limit time outside home?
* behavioral_touch_face → Did they avoid touching face?
* doctor_recc_h1n1 → Did a doctor recommend the H1N1 vaccine?
* doctor_recc_seasonal → Did a doctor recommend the seasonal flu vaccine?
* chronic_med_condition → Does the person have a chronic medical condition (e.g., diabetes, asthma)?
* child_under_6_months → Is there a child under 6 months old in the household? (since babies can’t be vaccinated).
* health_worker → Is the person a health care worker?
* health_insurance → Do they have health insurance coverage?
* opinion_h1n1_vacc_effective → Belief in H1N1 vaccine effectiveness.
* opinion_h1n1_risk → Perceived personal risk of getting H1N1 flu.
* opinion_h1n1_sick_from_vacc → Belief that H1N1 vaccine might cause sickness.
* opinion_seas_vacc_effective → Belief in seasonal flu vaccine effectiveness.
* opinion_seas_risk → Perceived personal risk of getting seasonal flu.
* opinion_seas_sick_from_vacc → Belief that seasonal flu vaccine might cause sickness.
* age_group → Age category of respondent (e.g., 18–34, 35–44, 65+).
* education → Education level (e.g., less than high school, some college, college graduate).
* race → Self-reported race/ethnicity.
* sex → Gender of respondent.
* income_poverty → Household income relative to poverty line.
* marital_status → Married, single, divorced, etc.
* rent_or_own → Housing situation (rent or own).
* employment_status → Employment status (employed, unemployed, retired).
* employment_industry → Industry where respondent works.
* employment_occupation → Occupation of respondent.
* hhs_geo_region → U.S. Department of Health and Human Services region (geographic code).
* census_msa → Metropolitan Statistical Area (urban/rural type).
* household_adults → Number of adults in the household.
* household_children → Number of children in the household.

In [None]:
df1.info()

In [None]:
df1.columns

In [None]:
df1.isnull().sum()/df1.shape[0]*100

In [None]:
df1['h1n1_vaccine'].unique()

In [None]:
df1['employment_occupation'].unique()

In [None]:
df1['employment_industry'].unique()

In [None]:
df1.duplicated().sum()

In [None]:
# for i in [ 'h1n1_knowledge',
#        'behavioral_antiviral_meds', 'behavioral_avoidance',
#        'behavioral_face_mask', 'behavioral_wash_hands',
#        'behavioral_large_gatherings', 'behavioral_outside_home',
#        'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
#        'chronic_med_condition', 'child_under_6_months', 'health_worker',
#         'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
#        'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
#        'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults',
#        'household_children']:
#     sns.scatterplot(data=df1, x=i, y='h1n1_vaccine')
#     plt.show()

In [None]:
df1.select_dtypes(include='number').columns


In [None]:
numeric_cols = df1.select_dtypes(include='number').columns
n = len(numeric_cols)

# Create subplots (adjust rows/cols depending on number of features)
fig, axes = plt.subplots(nrows=(n // 3) + 1, ncols=3, figsize=(15, 12))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    sns.boxplot(y=df1[col], ax=axes[i])
    axes[i].set_title(f"Boxplot of {col}")

# Remove empty subplots if any
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

In [None]:
df1_new.value_counts()

In [None]:
# checking classing imbalance
df1_new['h1n1_vaccine'].value_counts(normalize=False)


In [None]:
df1_new.columns

In [None]:
df1_new.info()

In [None]:
df1_new.shape

In [None]:
df1_new.columns

In [None]:
df1_new['hhs_geo_region'].unique()

In [None]:
# check for categorical columns check which method to use for encoding
categorical_cols = df1.select_dtypes(include='object').columns
for col in categorical_cols:
    print(f"Column: {col}")
    print(df1[col].value_counts())
    print("\n")
    

## 3. Data Preparation


In [None]:
df1.head()

In [None]:
df1['household_children'].value_counts()

In [None]:
df1.info()

* droping irrelevant columns and also columns with highest missing values
* impute columns with less missing values 
* perform class imbalance
* onehot encoding
* 

In [None]:
# picking the columns that i want to use for my analysis
df1_new = df1[['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'h1n1_vaccine']].copy()


**We decided to drop the `employment_occupation`, `employment_industry` and `health_insurance` since they have over 40% missing values**

In [None]:
df1_new.columns

In [None]:
df1_new.isnull().sum()

In [None]:
# impute missing categorical columns with mode
for col in df1_new.select_dtypes(include='float64').columns:
    df1_new[col].fillna(df1_new[col].mode()[0], inplace=True)
df1_new.isnull().sum()



In [None]:
df1_new.head()

In [None]:
# onehot encoding for categorical variables for int too
df1_encoded = pd.get_dummies(df1_new, drop_first=True).astype(int)
df1_encoded.head()



In [None]:
df1_encoded.dtypes

In [None]:
# checking for class imbalance
df1_encoded['h1n1_vaccine'].value_counts()

In [None]:
# handling class imbalance with SMOTE where minority class is oversampled
from imblearn.over_sampling import SMOTE
X = df1_encoded.drop('h1n1_vaccine', axis=1)
y = df1_encoded['h1n1_vaccine']
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
X_res.shape, y_res.shape


In [None]:
# distribution of target variable after resampling
y_res.value_counts()

## 4. MODELING LOGISTIC

In [None]:
# importing libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


## EVALUATION LOGISTIC

In [None]:
acc = accuracy_score(y_test, y_pred)
acc

In [None]:
# using Recall, Precision, F1-score for model evaluation
print(classification_report(y_test, y_pred))


In [None]:
# CHECKING FOR OVERFITTING
print("Training Accuracy:", accuracy_score(y_train, clf.predict(X_train)))
print("Testing Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
# using confusion matrix for model evaluation
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(12,10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## MODELING DECISION TREE

In [None]:
# developing a decision tree model
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=5)



In [None]:
dt_clf.fit(X_train, y_train)    
y_dt_pred = dt_clf.predict(X_test)

## EVALUATION DECISION TREE

In [None]:
acc_dt = accuracy_score(y_test, y_dt_pred)
acc_dt

In [None]:
# CHECKING FOR OVERFITTING
print("Training Accuracy:", accuracy_score(y_train, dt_clf.predict(X_train)))
print("Testing Accuracy:", accuracy_score(y_test, y_dt_pred))


In [None]:
# using confusion matrix for model evaluation for decision tree
cm_dt = confusion_matrix(y_test, y_dt_pred)
plt.figure(figsize=(12,10))
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')


In [None]:
# checking for correlation for numerical columns
# plt.figure(figsize=(10,8))
# sns.heatmap(df1_new.corr(), annot=True, cmap='coolwarm')
# plt.title('Correlation Matrix')


In [None]:

# df1_new

In [None]:
# s = df1.select_dtypes(include='number').corr()
# s

In [None]:

# plt.figure(figsize=(25,15))
# sns.heatmap(s, annot=True, cmap='coolwarm')

In [None]:
# df1.describe()

In [None]:
# df1.info()

In [None]:
# df1.isnull().sum()