# Vector Borne Disease Classification Challenge

### Classify patient records to detect potential vector borne infections

### Overview

- Vector borne diseases such as Dengue, Malaria, Lyme Disease, and West Nile Fever are major public health concerns globally. Early identification of these diseases can save lives and optimize treatment strategies. In this competition, your task is to build a machine learning model that can predict the most likely disease(s) a patient may have based on clinical and environmental features.

- You will work with a synthetically generated tabular dataset that mimics real-world patterns of disease outbreaks and patient characteristics. This makes the task realistic and challenging, while ensuring data privacy.

- Use case: Such models can be integrated into early warning systems for hospitals or public health departments to improve disease response strategies.


- The prediction target is the prognosis — which refers to the predicted vector borne disease(s) associated with each patient ID.

### Description

Synthetically-Generated Datasets
- Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve!

### Evaluation

- Submissions will be evaluated based on MPA@3. Each submission can contain up to 3 predictions (all separated by spaces), and the earlier a correct prediction occurs, the higher score it will receive.
Submission File

- For each id in the test set, you must predict the target prognosis. The file should contain a header and have the following format:

    id,prognosis
    707,Dengue West_Nile_fever Malaria
    708,Lyme_disease West_Nile_fever Dengue
    709,Dengue West_Nile_fever Lyme_disease
    etc.

- The goal is to maximize MAP@3, so:

    Ranking the correct label(s) higher improves your score.

    Submissions with the correct disease ranked in the top 3 will score, with higher weight given to those ranked earlier.

### Timeline

    Start Date - April 18, 2023
    Entry Deadline - Same as the Final Submission Deadline
    Team Merger Deadline - Same as the Final Submission Deadline
    Final Submission Deadline - May 1, 2023

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

In [2]:
import pandas as pd 
import numpy as np

In [5]:
df = pd.read_csv('/home/dataopske/Desktop/Vector-borne-disease-classification-challenge/data/test.csv')

In [6]:
df.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
0,707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,708,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,709,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,710,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,711,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
df.columns

Index(['id', 'sudden_fever', 'headache', 'mouth_bleed', 'nose_bleed',
       'muscle_pain', 'joint_pain', 'vomiting', 'rash', 'diarrhea',
       'hypotension', 'pleural_effusion', 'ascites', 'gastro_bleeding',
       'swelling', 'nausea', 'chills', 'myalgia', 'digestion_trouble',
       'fatigue', 'skin_lesions', 'stomach_pain', 'orbital_pain', 'neck_pain',
       'weakness', 'back_pain', 'weight_loss', 'gum_bleed', 'jaundice', 'coma',
       'diziness', 'inflammation', 'red_eyes', 'loss_of_appetite',
       'urination_loss', 'slow_heart_rate', 'abdominal_pain',
       'light_sensitivity', 'yellow_skin', 'yellow_eyes', 'facial_distortion',
       'microcephaly', 'rigor', 'bitter_tongue', 'convulsion', 'anemia',
       'cocacola_urine', 'hypoglycemia', 'prostraction', 'hyperpyrexia',
       'stiff_neck', 'irritability', 'confusion', 'tremor', 'paralysis',
       'lymph_swells', 'breathing_restriction', 'toe_inflammation',
       'finger_inflammation', 'lips_irritation', 'itchiness', 'ul

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     303 non-null    int64  
 1   sudden_fever           303 non-null    float64
 2   headache               303 non-null    float64
 3   mouth_bleed            303 non-null    float64
 4   nose_bleed             303 non-null    float64
 5   muscle_pain            303 non-null    float64
 6   joint_pain             303 non-null    float64
 7   vomiting               303 non-null    float64
 8   rash                   303 non-null    float64
 9   diarrhea               303 non-null    float64
 10  hypotension            303 non-null    float64
 11  pleural_effusion       303 non-null    float64
 12  ascites                303 non-null    float64
 13  gastro_bleeding        303 non-null    float64
 14  swelling               303 non-null    float64
 15  nausea