# Vector Borne Disease Classification

The notebook is intended to predict the Vector Borne Disease given a set of symptoms and prognosis.
The list of disease classes is:
- Chikungunya
- Dengue
- Zika
- Yellow Fever
- Raft Valley Fever
- West Nile Fever
- Malaria
- Tungiasis
- Japanese Encephalitis
- Plague
- Lyme Disease

[Vectors](https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases#:~:text=and%20community%20mobilisation.-,Vectors,-Vectors%20are%20living) are living organisms that can transmit infectious pathogens between humans, or from animals to humans. Many of these vectors are bloodsucking insects, which ingest disease-producing microorganisms during a blood meal from an infected host (human or animal) and later transmit it into a new host, after the pathogen has replicated. Often, once a vector becomes infectious, they are capable of transmitting the pathogen for the rest of their life during each subsequent bite/blood meal.

[Vector-borne diseases](https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases#:~:text=bite/blood%20meal.-,Vector%2Dborne%20diseases,-Vector%2Dborne%20diseases) are human illnesses caused by parasites, viruses and bacteria that are transmitted by vectors. Every year there are more than 700,000 deaths from diseases such as malaria, dengue, schistosomiasis, human African trypanosomiasis, leishmaniasis, Chagas disease, yellow fever, Japanese encephalitis and onchocerciasis.

## Resoruces
- [Reference Kaggle Challenge](https://www.kaggle.com/competitions/playground-series-s3e13)
- [EDA Inspirational Notebook](https://www.kaggle.com/code/sergiosaharovskiy/ps-s3e13-2023-eda-and-submission)

In [None]:
# Import Standard Libraries
import os

import mlflow

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path
from colorama import Style, Fore

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

# Set Pandas Options
pd.set_option('display.max_columns', 500)

In [None]:
# Define Seaborn theme parameters
theme_parameters =  {
    'axes.spines.right': False,
    'axes.spines.top': False,
    'grid.alpha':0.3,
    'figure.figsize': (16, 6),
    'font.family': 'Andale Mono',
    'axes.titlesize': 24,
    'figure.facecolor': '#E5E8E8',
    'axes.facecolor': '#E5E8E8'
}

# Set the theme
sns.set_theme(style='whitegrid',
              palette=sns.color_palette('deep'), 
              rc=theme_parameters)

In [None]:
# Define Colors
black = Style.BRIGHT + Fore.BLACK
magenta = Style.BRIGHT + Fore.MAGENTA
red = Style.BRIGHT + Fore.RED
blue = Style.BRIGHT + Fore.BLUE
reset_colors = Style.RESET_ALL

# Read Data

In [None]:
# Switch flag for Kaggle Cloud
kaggle = False

In [None]:
# Read training data
if kaggle:
    
    # Read data from Kaggle FS
    train_data = pd.read_csv('/kaggle/input/playground-series-s3e1/train.csv')
    test_data = pd.read_csv('/kaggle/input/playground-series-s3e1/test.csv')
    
else:
    
    # Define local data file paths
    train_data_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'S3E13' / 'vector_borne_disease_train.csv'
    test_data_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'S3E13' / 'vector_borne_disease_test.csv'
    train_data_original_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'S3E13' / 'vector_borne_disease_original_train.csv'
    test_data_original_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'S3E13' / 'vector_borne_disease_original_test.csv'
    
    train_data = pd.read_csv(train_data_file_path)
    test_data = pd.read_csv(test_data_file_path)
    train_original_data = pd.read_csv(train_data_file_path)
    test_original_data = pd.read_csv(test_data_file_path)

In [None]:
train_data.head()

In [None]:
train_data.info()

# Exploratory Data Analysis (EDA)

## Shape Information

In [None]:
# Print shapes information
print(f'{blue}Data Shapes:'
      f'{blue}\n- Train Data           -> {red}{train_data.shape}'
      f'{blue}\n- Test Data            -> {red}{test_data.shape}'
      f'{blue}\n- Train Original Data  -> {red}{train_original_data.shape}'
      f'{blue}\n- Test Original Data   -> {red}{test_original_data.shape}\n')

## Null Values Information

In [None]:
# Print null values information
print(f'{blue}Data Null Values:'
      f'{blue}\n- Train Data           -> {red}{train_data.isnull().any().sum()}'
      f'{blue}\n- Test Data            -> {red}{test_data.isnull().any().sum()}'
      f'{blue}\n- Train Original Data  -> {red}{train_original_data.isnull().any().sum()}'
      f'{blue}\n- Test Original Data   -> {red}{test_original_data.isnull().any().sum()}\n')

## Features Distributions

In [None]:
# Plot the KDE of each feature
figure, ax = plt.subplots(13, 5, figsize=(16, 40))
ax = ax.flatten()

# Fetch the data to plot (exclude the 'id' and 'label' columns)
for index, column_name in enumerate(train_data.columns[1:-1]):
    
    # Plot data
    sns.kdeplot(data=train_data[column_name],
                label='Train',
                ax=ax[index])
    
    sns.kdeplot(data=test_data[column_name],
                label='Test',
                ax=ax[index])
    
    sns.kdeplot(data=train_original_data[column_name], 
                label='Original Train', 
                ax=ax[index])
    
    ax[index].set_title(column_name, fontsize=14)
    
    ax[index].tick_params(labelrotation=45)
    
    # Retrieve legend information
    handles = ax[index].get_legend_handles_labels()[0]
    labels = ax[index].get_legend_handles_labels()[1]
    ax[index].legend().remove()
    
# Remove the empty subplot
figure.delaxes(ax[-1])

# Set the legend
figure.legend(handles, 
              labels, 
              loc='upper center', 
              bbox_to_anchor=(0.5, 1.03), 
              fontsize=12,
              ncol=3)

plt.tight_layout()

- Distributions among Train and Test sets (both competition and original) are the same
- Skewed distributions in few features

## Label Classes Distribution

In [None]:
# Plot the KDE of each feature
figure, ax = plt.subplots(1, 2, figsize=(16, 12))
ax = ax.flatten()

# Plot 'prognosis' Pie Chart for competition data
sns.countplot(data=train_data, 
              y='prognosis', 
              orient='v',
              order=train_data['prognosis'].value_counts().index,
              ax=ax[0])

# Plot 'prognosis' Pie Chart for original data
sns.countplot(data=train_original_data, 
              y='prognosis', 
              orient='v', 
              order=train_original_data['prognosis'].value_counts().index,
              ax=ax[1])

# Set plot titles
ax[0].set_title('Train Data', fontsize=14)
ax[1].set_title('Original Train Data', fontsize=14)
    
plt.tight_layout()

- Identical label classes distribution among competition and original data

## Symptoms per Prognosis Distributions

In [None]:
# Plot the KDE of each feature
figure, ax = plt.subplots(3, 4, figsize=(16, 12))
ax = ax.flatten()

# Fetch prognosis
for index, prognosis in enumerate(train_data['prognosis'].unique()):
    
    # Plot symptoms' values per prognosis
    ax[index].imshow(train_data[train_data['prognosis'] == prognosis].iloc[:, 1:-1].values.T, cmap='Spectral')
    
    # Set y label
    # TODO: Fix Warning
    ax[index].set_yticklabels(train_data[train_data['prognosis'] == prognosis].iloc[:, 1:-1].columns.tolist(), 
                              fontdict={'fontsize': 10})
    
    # Set titles
    ax[index].set_title(prognosis, fontsize=14)
    
# Remove the empty subplot
figure.delaxes(ax[-1])
    
plt.tight_layout()

# Data Preparation

## Train & Test Split

In [None]:
# Define X and y for the training set
X = train_data.iloc[:, 1:-1]
y = train_data.iloc[:, -1]

In [None]:
# Split training data into train and validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Model Training

In [None]:
# Set MLflow Experiment
mlflow_experiment_name = 'Vector Borne Disease'

# Create experiment or retrieve already existing experiment
try:
    mlflow_experiment_id = mlflow.create_experiment(name=mlflow_experiment_name)
except Exception as e:
    mlflow_experiment_id = mlflow.get_experiment_by_name(mlflow_experiment_name).experiment_id

In [None]:
# Define the used metrics
metrics = ['Log Loss']

In [None]:
# Initialize DataFrame of model performance
performance = pd.DataFrame(columns=metrics)

## Logistic Regression

In [None]:
%%time

# Start MLflow Run
with mlflow.start_run(experiment_id=mlflow_experiment_id, 
                      run_name='Logistic Regression'):
    
    # Define the model
    model_lr = LogisticRegression(multi_class='ovr', 
                                  solver='lbfgs')
    
    # Train the model
    model_lr.fit(X_train, 
                 y_train)
    
    # Get predictions
    predictions_lr = model_lr.predict(X_test)
    
    # Compute metrics
    log_loss_lr = round(log_loss(y_test,
                                 predictions_lr), 2)

    print('Log Loss: {}'.format(log_loss_lr))
    print('\n')
    
    # Log model's evaluation metrics
    mlflow.log_metrics({'Log Loss': log_loss_lr})