# Project 4: Predict Dengue Cases

**Notebook 1.1 - Contents:**<br>
[Context](#Context)<br>
[Methodology, Models and Metrics](#Methodology,-Models-and-Metrics)<br>
[Data Collection](#Data-Collection)<br>

## Context

[Dengue fever](https://www.healthhub.sg/a-z/diseases-and-conditions/192/topic_dengue_fever_MOH) is a mosquito-borne viral disease caused by the bite of the Aedes mosquito. It poses a significant public health threat in tropical and subtropical regions, including Singapore. To prevent dengue fever, the key strategy is to eliminate the breeding habitats of the Aedes mosquito. The National Environment Agency (NEA) in Singapore has launched the STOP Dengue Now campaign to combat the spread of dengue by encouraging citizens to actively participate in mosquito breeding prevention.

In pursuit of effective dengue control, innovative approaches such as the Wolbachia project have been implemented. The Wolbachia method involves releasing male mosquitoes infected with the Wolbachia bacteria, which curtails the ability of Aedes mosquitoes to transmit dengue. This approach has shown promise in reducing mosquito populations and subsequently limiting disease transmission.

The factors influencing dengue transmission are complex and include weather patterns, particularly rainfall and temperature, which impact the availability of standing water for mosquito breeding. Additionally, there is evidence to suggest that online search behavior for dengue-related terms might be indicative of disease prevalence, as those at higher risk of infection may be more likely to search for information on the topic. The integration of these factors, along with the influence of the Wolbachia project, forms the basis for our comprehensive study to predict and address dengue cases and their impacts in Singapore.

### Background information on Dengue

Source: ["Dengue Prevention and 35 Years of Vector Control in Singapore" by Eng Eong Ooi et al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3373041/).

Dengue fever (DF), dengue hemorrhagic fever (DHF), and Dengue Shock Syndrome (DSS) are diseases endemic in tropical areas like Singapore. They are transmitted principally by the Aedes aegypti mosquito (i.e. vector).

**4 serotypes**

The virus has 4 immunologically distinct serotypes. Infection confers lifelong immunity to the infecting serotype but not to the remaining 3; therefore, a person can be infected with dengue virus up to 4 times during his or her lifetime. Furthermore, epidemiologic observations suggest that previous infection increases risk for DHF and DSS in subsequent infections.

**Morbidity, treatment, and vaccination**

While DF may cause substantial morbidity, the death ratio of DHF and DSS can be as high as 30%. As yet, no specific treatment for DF or DHF is available. A vaccine is currently available, but it does not fully protect all individuals against dengue infection. It is also not recommended for individuals who have no previous dengue infection, as vaccination for uninfected individuals would lead to worsened outcomes (increased risk of having DHF/DSS).

**History of dengue in Singapore**

DHF appeared in Singapore in the 1960s and quickly became a major cause of childhood death. A vector control programme was completed in 1973 and the premises index (% of inspected premises found to have containers with *A. aegypti* larvae or pupae) since then has been low (below 5%). Despite the low premises index, there have been resurgences of dengue incidence rates.

### Problem Statement

There are 2 parts to our Problem Statement

**Part 1: Short-term prediction model**<br>
Develop a reasonably accurate model to predict dengue case numbers for the a subsequent short-term period (to be determined after EDA) by using:
1) Climate data; and
2) Google search trends.

Having an accurate forecast of upcoming dengue cases would allow mitigating actions to be taken by NEA, such as:
- Stepping up on premises checks with regard to potential breeding areas
- Stepping up on managing public spaces that may be potential breeding grounds, such as drains and cutting of grass/pruning of trees.
- Increased messaging to the general public with regard to ongoing dengue prevention campaigns.
- Increased roll out of Project Wolbachia


**Part 2: Cost-Benefit Analysis of Wolbachia Implementation** <br>
Perform a cost-benefit analysis of the Wolbachia implementation by using information from external research, and determine the decision threshold for rolling out Project Wolbachia, based on predictions from our model.

## Methodology, Models and Metrics

### Sources

**Data sources**
- Weather data: [weather.gov.sg](http://www.weather.gov.sg/climate-historical-daily/)
- Weekly dengue data: [data.gov.sg](https://data.gov.sg/dataset/weekly-infectious-disease-bulletin-cases)
- Google trends: [trends.google.com](https://trends.google.com/trends/explore?date=today%205-y&geo=SG&q=%2Fm%2F09wsg)

**Information sources**

*Climate*
- Overview of Singapore's climate: [weather.gov.sg](http://www.weather.gov.sg/climate-climate-of-singapore/)

*Dengue*
- [General information on Dengue in Singapore, NEA](https://www.nea.gov.sg/dengue-zika)
- [Dengue Prevention and 35 Years of Vector Control in Singapore](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3373041/)
- [Climate variability and increase in intensity and magnitude of dengue incidence in Singapore](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2799326/#:~:text=The%20weekly%20mean%20temperature%20and,elevated%20temperature%20and%20precipitation%2C%20respectively)
- [Dengue vaccine information](https://www.healthhub.sg/a-z/medications/661/Dengue%20Vaccine)

*Dengue trend explanations*
- [Epidemic Dengue during COVID-19 Pandemic - NCID Report](https://www.ncid.sg/Health-Professionals/Articles/Pages/Epidemic-Dengue-in-Singapore-During-COVID-19-Pandemic.aspx)
- [Low dengue cases in 2017](https://www.straitstimes.com/singapore/2772-dengue-cases-in-2017-the-lowest-in-the-last-16-years-nea)
- [2022 Dengue Review by The Straits Times](https://www.straitstimes.com/singapore/health/singapore-records-19-dengue-deaths-in-2022-nearly-four-times-2021-s-toll)

*Project Wolbachia*
- [Project Wolbachia information, NEA](https://www.nea.gov.sg/corporate-functions/resources/research/wolbachia-aedes-mosquito-suppression-strategy/)
- [Project Wolbachia expansion, NEA](https://www.nea.gov.sg/media/news/news/index/nea-s-project-wolbachia-singapore-to-be-expanded-to-eight-additional-sites#:~:text=Today%2C%20Project%20Wolbachia%20has%20covered,300%2C000%20homes%20will%20be%20covered)
- [Strategies to mitigate establishment under the Wolbachia incompatible insect technique, by Stacy Soh et al](https://www.mdpi.com/1999-4915/14/6/1132)
- [Economic impact of dengue in Singapore from 2010 to 2020 and the cost effectiveness of Wolbachia interventions, by Stacy Soh et al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10021432/)


### Workflow

Below is our project workflow:

<img src="../images/Workflow Diagram 1.png" alt="drawing" width="700"/>

### Models

The data models used can be split into 2 main categories: Time Series Forecasting models and Regression Models.

* **Time Series Forecasting Models**
    * `SARIMA` (Seasonal Autoregressive Integrated Moving Average) and `SARIMAX` (Seasonal Autoregressive Integrated Moving Average with Exogenous Variables) are used, since we are analyzing and predicting time-dependent data.
    * Instead of the ARIMA and ARIMAX, we used SARIMA and SARIMAX they can handle seasonality, since our data likely has recurring patterns, such as yearly or monthly fluctuations (as dengue is related to weather conditions).
    * SARIMAX is an extension of our baseline SARIMA model, allowing for the inclusion of exogenous variables.<br>

* **Regression Models**
    * `Decision Tree`, `Random Forest` and `SVM` (Support Vector Machines) will be tested as well.
    * Decision Trees have clear interpretability and allows us to inspect the relative importance of different features in making predictions. 
    * Random Forest models are robust and  can automatically perform feature selection by evaluating the importance of each feature across multiple trees. 
    * SVM is effective in capturing complex relationships between predictors and the target variable.


### Metrics

Since the audience would be a mix of technical and non-technical people, it would be beneficial to present both RMSE and MAPE.

**Root Mean Square Error (RMSE)**

![](../images/RMSE.png)

For optimisation and selection of models, we will use RMSE for the following reasons:
1) Widely used: RMSE is widely used as a metric for regression problems.  
2) Ease of interpretation: RMSE is expressed in the same units as the target variable, making it easy to interpret.
3) Sensitivity to errors: RMSE gives more weight to larger errors (as errors are squared) compared to other metrics like MAE (mean absolute error) and MAPE (mean absolute percentage error).
    - In this case where our target variable (number of dengue cases) has a large range (peaks and troughs due to seasonality), we may expect larger errors on the high peak values in absolute terms (given the higher underlying value).
    - Given the nature of the problem (dengue outbreak prediction), we would be more interested in predicting those peaks accurately.
    - Therefore, sensitivity to large errors would be especially desirable in this context.


**Mean Absolute Percentage Error (MAPE)**

![](../images/MAPE.png)

For presentation and dashboarding, we will also use MAPE due to its ease of interpretability, especially for non-technical audiences who would intuitively make comparisons with percentage change/error. <br>

*Note: We would not use MAPE for model optimisation and selection as MAPE penalises errors on smaller values, with large error values for actual values that are close to 0 (as the actual values form the denominator of the error formula, and dividing by a small number would yield a large number). This is counter to our focus on predicting the peaks correctly (rather than the low troughs).* 


## Data Collection

### Imports

In [1]:
import pandas as pd
import numpy as np

import io
import os # to work with files/directories

import requests
import time
from bs4 import BeautifulSoup

import pickle

### Metereological Service Singapore

While we are able to simply download datasets from the data.gov and google.trends, we will need to pull data from the [Metereological Service Singapore](http://www.weather.gov.sg/climate-historical-daily/), where historical daily records of weather are posted. Data extraction via the code below enables a more efficient process of directly concatenating all the monthly weather data into a single dataframe, instead of manually downloading the monthly data month-by-month.

In [2]:
#create function to download MSS data
def weather_data(station_no):
    base_url = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_"
    stations_list = [station_no]
    
    data_frames = []
    
    for station in stations_list:
        station_string = "S" + str(station).zfill(2)
        
        for year in range(2012, 2023):
            for month in range(1, 13):
                month_string = str(month).zfill(2)
                url = f"{base_url}{station_string}_{year}{month_string}.csv"
    
                try:
                    response = requests.get(url, allow_redirects=True)
                    if response.status_code == 200:
                        csv_data = response.content
                        # Use io.StringIO to create a string buffer for reading CSV data
                        buffer = io.StringIO(csv_data.decode("ISO-8859-1"))
                        df = pd.read_csv(buffer)
                        # Remove BOM characters and convert column names to lowercase
                        df.columns = df.columns.str.replace("ï»¿", "")
                        df.columns = df.columns.str.replace("Â", "")
                        df.columns = df.columns.str.lower()
                        data_frames.append(df)
                    else:
                        print(f"Failed to download: {url} (Status code: {response.status_code})")
                except requests.exceptions.RequestException as e:
                    print(f"Error downloading {url}: {e}")
                    continue
    
    # Concatenate all data frames into a single DataFrame
    weather_df = pd.concat(data_frames, axis=0, ignore_index=True)
    return weather_df

Using the function above, we can download the data for various meteorological stations.
We will download data from years 2012 to 2022, as dengue data is only available for these years.<br>

We picked climate data from Changi station because it is the [main climate station](http://www.weather.gov.sg/learn_climate/#:~:text=Since%201984%2C%20the%20climate%20station%20has%20been%20located%20at%20Changi.) for Singapore - it monitors the climate over a long-term period, and has minimally 30 years of rainfall and temperature data. Changi is also the closest weather station to Tampines, one of the areas of Singapore involved in Project Wolbachia since early stages.

##### Generate weather data for Changi (no 24)

In [3]:
changi_df = weather_data(24)
changi_df.head()

Unnamed: 0,station,year,month,day,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
0,Changi,2012,1,1,0.6,,,,27.2,31.4,25.2,8.4,28.4
1,Changi,2012,1,2,0.0,,,,27.7,31.3,25.4,13.6,33.1
2,Changi,2012,1,3,0.0,,,,27.6,30.9,25.7,15.4,34.6
3,Changi,2012,1,4,0.0,,,,27.4,31.0,25.0,13.3,33.8
4,Changi,2012,1,5,0.0,,,,27.0,30.7,24.5,12.2,33.8


### Data export

In [4]:
if not os.path.exists('../data/weather.gov'): 
    os.makedirs('../data/weather.gov') 

# Save the DataFrame to a CSV file
changi_df.to_csv('../data/weather.gov/changi_weather.csv', index=False)