Multilabel Classification Model for Vaccine Prediction

 Output of this notebook (submission.csv) is the file containing the predictions for xyz and seasonal flu vaccines.

First, we need to load both the training and test datasets.


In [14]:
import pandas as pd

# Load the datasets
train_data = pd.read_csv('training_set_features.csv')
test_data = pd.read_csv('test_set_features.csv')


# Display the first few rows of each dataset
print(train_data.head())
print(test_data.head())


   respondent_id  xyz_concern  xyz_knowledge  behavioral_antiviral_meds  \
0              0          1.0            0.0                        0.0   
1              1          3.0            2.0                        0.0   
2              2          1.0            1.0                        0.0   
3              3          1.0            1.0                        0.0   
4              4          2.0            1.0                        0.0   

   behavioral_avoidance  behavioral_face_mask  behavioral_wash_hands  \
0                   0.0                   0.0                    0.0   
1                   1.0                   0.0                    1.0   
2                   1.0                   0.0                    0.0   
3                   1.0                   0.0                    1.0   
4                   1.0                   0.0                    1.0   

   behavioral_large_gatherings  behavioral_outside_home  \
0                          0.0                      1.0  

Checking for missing values in the datasets and handling them by filling median or mode values.


In [15]:
# Check for missing values in training data
print(train_data.isnull().sum())

# Check for missing values in test data
print(test_data.isnull().sum())

# Fill missing values for numeric columns with median
numeric_cols = train_data.select_dtypes(include=['number']).columns
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].median())
test_data[numeric_cols] = test_data[numeric_cols].fillna(test_data[numeric_cols].median())

# Fill missing values for non-numeric columns with mode
non_numeric_cols = train_data.select_dtypes(exclude=['number']).columns
train_data[non_numeric_cols] = train_data[non_numeric_cols].fillna(train_data[non_numeric_cols].mode().iloc[0])
test_data[non_numeric_cols] = test_data[non_numeric_cols].fillna(test_data[non_numeric_cols].mode().iloc[0])


respondent_id                      0
xyz_concern                       92
xyz_knowledge                    116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_xyz                 2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_xyz_vacc_effective       391
opinion_xyz_risk                 388
opinion_xyz_sick_from_vacc       395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

Convert categorical features into numerical values using one-hot encoding for both training and test datasets.


In [16]:
# One-hot encode categorical variables for training data
train_data = pd.get_dummies(train_data, columns=['age_group', 'education', 'race', 'sex', 'income_poverty', 
                                                 'marital_status', 'rent_or_own', 'employment_status', 
                                                 'hhs_geo_region', 'census_msa', 'employment_industry', 
                                                 'employment_occupation'], drop_first=True)

# One-hot encode categorical variables for test data
test_data = pd.get_dummies(test_data, columns=['age_group', 'education', 'race', 'sex', 'income_poverty', 
                                               'marital_status', 'rent_or_own', 'employment_status', 
                                               'hhs_geo_region', 'census_msa', 'employment_industry', 
                                               'employment_occupation'], drop_first=True)


Separating the features and labels in the training dataset.


In [17]:
# Define features and labels
train_data2= pd.read_csv('training_set_labels.csv')

# Assuming the first column is the key column for both dataframes
key_column = train_data.columns[0]  # Get the name of the first column

# Merge the dataframes based on the common column
train_data = pd.merge(train_data, train_data2, on=key_column)
X_train = train_data.drop(columns=['respondent_id', 'xyz_vaccine', 'seasonal_vaccine'])
y_train = train_data[['xyz_vaccine', 'seasonal_vaccine']]

# Define test features
X_test = test_data.drop(columns=['respondent_id'])


Training the machine learning model such as a Random Forest classifier with multi-output capabilities.


In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier

# Initialize the model
model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))

# Train the model
model.fit(X_train, y_train)


We will split the training data into training and validation sets to evaluate the model's performance.


In [19]:
from sklearn.model_selection import train_test_split

# Split the data
X_train_split, X_valid, y_train_split, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


Training the model on the split training data and evaluating it on the validation data using ROC AUC score.


In [20]:
# Train the model
model.fit(X_train_split, y_train_split)

# Make predictions
y_valid_pred_proba = model.predict_proba(X_valid)

# Extract the probabilities for each label
xyz_vaccine_valid_proba = y_valid_pred_proba[0][:, 1]
seasonal_vaccine_valid_proba = y_valid_pred_proba[1][:, 1]

# Evaluate the model using ROC AUC score
from sklearn.metrics import roc_auc_score

# Calculate ROC AUC scores
roc_auc_xyz = roc_auc_score(y_valid['xyz_vaccine'], xyz_vaccine_valid_proba)
roc_auc_seasonal = roc_auc_score(y_valid['seasonal_vaccine'], seasonal_vaccine_valid_proba)

# Calculate the mean ROC AUC score
mean_roc_auc = (roc_auc_xyz + roc_auc_seasonal) / 2

print(f'Mean ROC AUC: {mean_roc_auc}')


Mean ROC AUC: 0.8380889922596537


Generating the predictions for the test dataset.


In [21]:
# Making predictions on the test dataset
y_test_pred_proba = model.predict_proba(X_test)

# Extracting the probabilities for each label
xyz_vaccine_test_proba = y_test_pred_proba[0][:, 1]
seasonal_vaccine_test_proba = y_test_pred_proba[1][:, 1]


Preparing the submission file in the required format with respondent_id, xyz_vaccine, and seasonal_vaccine.


In [22]:
# Create submission DataFrame
submission = pd.DataFrame({
    'respondent_id': test_data['respondent_id'],
    'xyz_vaccine': xyz_vaccine_test_proba,
    'seasonal_vaccine': seasonal_vaccine_test_proba
})

# Save to CSV
submission.to_csv('submission.csv', index=False)
