# Project 4: Predict Dengue Cases

**Notebook 1.1 - Contents:**<br>
[1.1.1 Context](#1.1.1-Context)<br>
[1.1.2 Problem Statement](#1.1.2-Problem-Statement)<br>
[1.1.3 Methodology, Models and Metrics](#1.1.3-Methodology,-Models-and-Metrics)<br>
[1.1.4 Data Collection](#1.1.4-Data-Collection)<br>

## 1.1.1 Context

[Dengue fever](https://www.healthhub.sg/a-z/diseases-and-conditions/192/topic_dengue_fever_MOH) is a mosquito-borne viral disease caused by the bite of the Aedes mosquito. It poses a significant public health threat in tropical and subtropical regions, including Singapore. To prevent dengue fever, the key strategy is to eliminate the breeding habitats of the Aedes mosquito. The National Environment Agency (NEA) in Singapore has launched the STOP Dengue Now campaign to combat the spread of dengue by encouraging citizens to actively participate in mosquito breeding prevention.

In pursuit of effective dengue control, innovative approaches such as the Wolbachia project have been implemented. The Wolbachia method involves releasing male mosquitoes infected with the Wolbachia bacteria, which curtails the ability of Aedes mosquitoes to transmit dengue. This approach has shown promise in reducing mosquito populations and subsequently limiting disease transmission.

The factors influencing dengue transmission are complex and include weather patterns, particularly rainfall and temperature, which impact the availability of standing water for mosquito breeding. Additionally, there is evidence to suggest that online search behavior for dengue-related terms might be indicative of disease prevalence, as those at higher risk of infection may be more likely to search for information on the topic. The integration of these factors, along with the influence of the Wolbachia project, forms the basis for our comprehensive study to predict and address dengue cases and their impacts in Singapore.

## 1.1.2 Problem Statement

The objective of this study is to develop a comprehensive predictive model for dengue cases by synergizing climate data and Google search trends. The proposed solution consists of two interconnected parts:

**Part 1: Integrated Prediction Model**<br>
Develop a precise model by merging climate data and Google search trends to predict dengue cases and fatalities. By analyzing historical climate patterns and search trends, we aim to create an accurate predictive model as a tool for proactive public health strategies.

**Part 2: Cost-Based Analysis of Wolbachia Implementation** <br>
Evaluate the long-term Wolbachia project's effectiveness in reducing dengue transmission in Tampines and Yishun. This analysis, considering costs and outcomes, will determine the project's economic viability and its contribution to dengue prevention in Singapore.

## 1.1.3 Methodology, Models and Metrics

Below is our project workflow:

<img src="../images/Workflow Diagram 1.png" alt="drawing" width="700"/>

### Models

The data models used can be split into 2 main categories: Time Series Forecasting models and Regression Models.

* **Time Series Forecasting Models**
    * `SARIMA` (Seasonal Autoregressive Integrated Moving Average) and `SARIMAX` (Seasonal Autoregressive Integrated Moving Average with Exogenous Variables) are used, since we are analyzing and predicting time-dependent data.
    * Instead of the ARIMA and ARIMAX, we used SARIMA and SARIMAX they can handle seasonality, since our data likely has recurring patterns, such as yearly or monthly fluctuations (as dengue is related to weather conditions).
    * SARIMAX is an extension of our baseline SARIMA model, allowing for the inclusion of exogenous variables.
<br>
 
* **Regression Models**
    * `Decision Tree`, `Random Forest` and `SVM` (Support Vector Machines) will be tested as well.
    * Decision Trees have clear interpretability and allows us to inspect the relative importance of different features in making predictions. 
    * Random Forest models are robust and  can automatically perform feature selection by evaluating the importance of each feature across multiple trees. 
    * SVM is effective in capturing complex relationships between predictors and the target variable.

### Metrics

* For the time series models, the metric used will be `Mean Absolute Percentage Error (MAPE)`: Measures the average percentage difference between predicted and actual values, suitable for relative comparisons.<br>
* For the regression models, the metric used will be `Root Mean Squared Error (RMSE)`: Square root of the MSE, provides an interpretable metric in the same unit as the target variable.

## 1.1.4 Data Collection

### Imports

In [3]:
import pandas as pd
import numpy as np
import io
import requests
import time
from bs4 import BeautifulSoup
import pickle

### Metereological Service Singapore

While we are able to simply download datasets from the data.gov and google.trends, we will need to pull data from the [Metereological Service Singapore](http://www.weather.gov.sg/climate-historical-daily/), where historical daily records of weather are posted. Data extraction via the code below enables a more efficient process of directly concatenating all the monthly weather data into a single dataframe, instead of manually downloading the monthly data month-by-month.

In [4]:
#create function to download MSS data
def weather_data(station_no):
    base_url = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_"
    stations_list = [station_no]
    
    data_frames = []
    
    for station in stations_list:
        station_string = "S" + str(station).zfill(2)
        
        for year in range(2012, 2023):
            for month in range(1, 13):
                month_string = str(month).zfill(2)
                url = f"{base_url}{station_string}_{year}{month_string}.csv"
    
                try:
                    response = requests.get(url, allow_redirects=True)
                    if response.status_code == 200:
                        csv_data = response.content
                        # Use io.StringIO to create a string buffer for reading CSV data
                        buffer = io.StringIO(csv_data.decode("ISO-8859-1"))
                        df = pd.read_csv(buffer)
                        # Remove BOM characters and convert column names to lowercase
                        df.columns = df.columns.str.replace("ï»¿", "")
                        df.columns = df.columns.str.replace("Â", "")
                        df.columns = df.columns.str.lower()
                        data_frames.append(df)
                    else:
                        print(f"Failed to download: {url} (Status code: {response.status_code})")
                except requests.exceptions.RequestException as e:
                    print(f"Error downloading {url}: {e}")
                    continue
    
    # Concatenate all data frames into a single DataFrame
    weather_df = pd.concat(data_frames, axis=0, ignore_index=True)
    return weather_df

Using the function above, we can download the data for Changi and Yishun.
We will download data from years 2012 to 2022, as dengue data is only available for these years.<br>

We picked Changi because it is the [main climate station](http://www.weather.gov.sg/learn_climate/#:~:text=Since%201984%2C%20the%20climate%20station%20has%20been%20located%20at%20Changi.) for Singapore - it monitors the climate over a long-term period, and has minimally 30 years of rainfall and temperature data. Changi is also the closest weather station to Tampines, one of the areas of Singapore involved in Project Wolbachia since early stages.

Another area we select to download weather data is Yishun, the second area involved in Project Wolbachia since early stages.

##### Generate weather data for Changi (no 24)

In [5]:
changi_df = weather_data(24)
changi_df.head()

Unnamed: 0,station,year,month,day,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
0,Changi,2012,1,1,0.6,,,,27.2,31.4,25.2,8.4,28.4
1,Changi,2012,1,2,0.0,,,,27.7,31.3,25.4,13.6,33.1
2,Changi,2012,1,3,0.0,,,,27.6,30.9,25.7,15.4,34.6
3,Changi,2012,1,4,0.0,,,,27.4,31.0,25.0,13.3,33.8
4,Changi,2012,1,5,0.0,,,,27.0,30.7,24.5,12.2,33.8


In [6]:
import os # to work with files/directories
if not os.path.exists('../data/weather.gov'): 
    os.makedirs('../data/weather.gov') 

# Save the DataFrame to a CSV file
changi_df.to_csv('../data/weather.gov/changi_weather.csv', index=False)