# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Project 4 - Prediction of Dengue Cases

## Part 1 - Data Import

In this project, we will be working with three datasets:
1. Weather Data
2. Google Data
3. Dengue Data

In this notebook, we will import all three sets of data before commencing data cleaning in the next notebook.

In [3]:
# importing libraries
import requests
import os

import pandas as pd
import numpy as np
#!pip install chardet
import chardet
from io import BytesIO
from bs4 import BeautifulSoup

#!pip install pytrends
from pytrends.request import TrendReq
import time
import tqdm

import pickle

## 1. Weather Data

For weather data, we will use the [Meteorological Service Singapore website](http://www.weather.gov.sg/climate-historical-daily/). When we click 'Inspect' on the CSV link, we see that there is an underlying .csv file, structured as:

http://www.weather.gov.sg/files/dailydata/DAILYDATA_S24_202307.csv

where S24 refers to the station number, 2023 refers to the year, and 07 refers to the month. Thus, we can run a 'for' loop for all the years, months, and stations in order to create a master CSV file that concatenates all the data together.

<mark>**WARNING: The entire process of getting weather data takes approximately 4 hours.**</mark>

In [8]:
#test create soup object with html parser
nea_url = "http://www.weather.gov.sg/climate-historical-daily/"
weather_res = requests.get(nea_url)
weather_soup = BeautifulSoup(weather_res.content, 'html.parser')

In [9]:
#test search
weather_station = weather_soup.find_all('ul', {'class': 'dropdown-menu'})
weather_station[1]

<ul class="dropdown-menu long-dropdown" role="menu">
<li><a href="#Admiralty" onclick="setYear('S104')">Admiralty</a></li><li><a href="#Admiralty West" onclick="setYear('S105')">Admiralty West</a></li><li><a href="#Ang Mo Kio" onclick="setYear('S109')">Ang Mo Kio</a></li><li><a href="#Boon Lay (East)" onclick="setYear('S86')">Boon Lay (East)</a></li><li><a href="#Boon Lay (West)" onclick="setYear('S63')">Boon Lay (West)</a></li><li><a href="#Botanic Garden" onclick="setYear('S120')">Botanic Garden</a></li><li><a href="#Buangkok" onclick="setYear('S55')">Buangkok</a></li><li><a href="#Bukit Panjang" onclick="setYear('S64')">Bukit Panjang</a></li><li><a href="#Bukit Timah" onclick="setYear('S90')">Bukit Timah</a></li><li><a href="#Buona Vista" onclick="setYear('S92')">Buona Vista</a></li><li><a href="#Chai Chee" onclick="setYear('S61')">Chai Chee</a></li><li><a href="#Changi" onclick="setYear('S24')">Changi</a></li><li><a href="#Choa Chu Kang (Central)" onclick="setYear('S114')">Choa Chu

In [10]:
#test call list items into list
stations_list = weather_station[1].find_all('li')
stations_list[-1]

<li><a href="#Yishun" onclick="setYear('S91')">Yishun</a></li>

In [11]:
#getting station details for downloading
#create soup object with html parser
nea_url = "http://www.weather.gov.sg/climate-historical-daily/"
weather_res = requests.get(nea_url)
weather_soup = BeautifulSoup(weather_res.content, 'html.parser')

#isolate list
weather_station = weather_soup.find_all('ul', {'class': 'dropdown-menu'})

#call list items into list
stations_list = weather_station[1].find_all('li')

#strip and extract station details into dictionary
stations_dict = {}

for i, station in enumerate(stations_list):
    station_name = stations_list[i].text.strip()
    station_id = stations_list[i].find('a')['onclick'].split("'")[1]
    stations_dict[i] = {'name': station_name, 'id': station_id}

In [12]:
#strip and extract station details into dictionary
stations_dict = {}

for i, station in enumerate(stations_list):
    station_name = stations_list[i].text.strip()
    station_id = stations_list[i].find('a')['onclick'].split("'")[1]
    stations_dict[i] = {'name': station_name, 'id': station_id}

The above code allows us to match the station name to the station ID. Now, we will download all the CSVs for the different stations, years and months.

In [13]:
base_url = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_{}_{:04d}{:02d}.csv"
output_directory = "../data/weather_data"

if not os.path.exists(output_directory):
    os.makedirs(output_directory)

csv_files = []  # List to store file paths
master_dfs = []  # List to store individual DataFrames

for n, station in stations_dict.items():
   for year in range(2000, 2024):
        for month in range(1, 13):
            url = base_url.format(station['id'], year, month)
            response = requests.get(url)
       
            if response.status_code == 200:
                file_name = f"DAILYDATA_{station['id']}_{year}{month:02d}.csv"
                csv_files.append(file_name)
           
                # Detect encoding
                encoding = chardet.detect(response.content)['encoding']
           
                # Read content as DataFrame
                month_df = pd.read_csv(BytesIO(response.content), encoding=encoding)
                master_dfs.append(month_df)
                print(f"Downloaded and appended: {station['name']}({station['id']})_{year}-{month:02d}")
            else:
                print(f"Failed to download: {station['name']}({station['id']})_{year}-{month:02d}")
    #optionl cooldown timer per paper to lower hashrate and prevent kick out.
#            cooldown_time = random.uniform(0.1, 2.5)
#            time.sleep(cooldown_time)
                
# Concatenate all individual DataFrames into a single master DataFrame
master_df = pd.concat(master_dfs, ignore_index=True)

# Save the concatenated DataFrame to the master CSV file
master_csv_path = "../data/weather_data/master_weather_data.csv"
master_df.to_csv(master_csv_path, index=False)

print("Master CSV file saved.")

Failed to download: Admiralty(S104)_2000-01
Failed to download: Admiralty(S104)_2000-02
Failed to download: Admiralty(S104)_2000-03
Failed to download: Admiralty(S104)_2000-04
Failed to download: Admiralty(S104)_2000-05
Failed to download: Admiralty(S104)_2000-06
Failed to download: Admiralty(S104)_2000-07
Failed to download: Admiralty(S104)_2000-08
Failed to download: Admiralty(S104)_2000-09
Failed to download: Admiralty(S104)_2000-10
Failed to download: Admiralty(S104)_2000-11
Failed to download: Admiralty(S104)_2000-12
Failed to download: Admiralty(S104)_2001-01
Failed to download: Admiralty(S104)_2001-02
Failed to download: Admiralty(S104)_2001-03
Failed to download: Admiralty(S104)_2001-04
Failed to download: Admiralty(S104)_2001-05
Failed to download: Admiralty(S104)_2001-06
Failed to download: Admiralty(S104)_2001-07
Failed to download: Admiralty(S104)_2001-08
Failed to download: Admiralty(S104)_2001-09
Failed to download: Admiralty(S104)_2001-10
Failed to download: Admiralty(S1

In [13]:
# Save master weather file
weather = pd.read_csv('../data/weather_data/master_weather_data.csv')

## 2. Google Data

To download trend results from Google, we will use Pytrends, which is the unofficial API for Google Trends. We will download two search terms:
1. 'dengue'
2. 'dengue symptoms'

One issue is that if we pull data for 5 years and above, it will give us monthly search trends, as opposed to weekly search trends if we pull data for less than 5 years. Thus, we will do 3 pulls that are less than 5 years so that we can get weekly data. We will also have an overlap of one year between each pull so that we can scale the overlapping data. This allows us to get weekly data over the course of 12 years.

### 2.1 'dengue' Search

In [2]:
# Set up the pytrends object
pytrends = TrendReq(hl='en-US', tz=360, backoff_factor=0.1)

# Define search term and geographical location
search_term = 'dengue'
geo_location = 'SG'
period1 = '2012-01-01 2015-12-31'
period2 = '2015-01-01 2019-12-31'
period3 = '2019-01-01 2023-08-06'
periods = [period1, period2, period3]

for i, period in enumerate(periods):
    # Build the payload
    pytrends.build_payload([search_term], cat=0, timeframe = period, geo = geo_location, gprop='')

    # Get the weekly search interest data
    interest_data = pytrends.interest_over_time()

    # Save the data to a CSV file
    file_path = f"../data/google_data/{search_term}_search_interest{i+1}.csv"
    interest_data.to_csv(file_path, index=True)

### 2.2 'dengue symptoms' Search

In [12]:
# Set up the pytrends object
pytrends = TrendReq(hl='en-US', tz=360, backoff_factor=0.1)

# Define search term and geographical location
search_term = 'dengue symptoms'
geo_location = 'SG'
period1 = '2012-01-01 2015-12-31'
period2 = '2015-01-01 2019-12-31'
period3 = '2019-01-01 2023-08-06'
periods = [period1, period2, period3]

for i, period in enumerate(periods):
    # Build the payload
    pytrends.build_payload([search_term], cat=0, timeframe = period, geo = geo_location, gprop='')

    # Get the weekly search interest data
    interest_data = pytrends.interest_over_time()

    # Save the data to a CSV file
    file_path = f"../data/google_data/{search_term}_search_interest{i+1}.csv"
    interest_data.to_csv(file_path, index=True)

### 2.3 Combining Multiple Calls (.csv) files
- Google limitation for datasets more than 5 years are converted to monthly data points.
- Since each dataset is scaled as percentage, with the peak labelled 100%, combining requires a factor to link each dataset.

In [23]:
def factorizer(df1, df2, search_term):
    #fix datetime format and set index
    #check df1
    if isinstance(df1.index, pd.DatetimeIndex) == True:
        pass
    else:
        df1.set_index('date', inplace=True)
        df1.index = pd.to_datetime(df1.index, format='%Y-%m-%d')
    #check df2
    if isinstance(df2.index, pd.DatetimeIndex) == True:
        pass
    else:
        df2.set_index('date', inplace=True)
        df2.index = pd.to_datetime(df2.index, format='%Y-%m-%d')

    #slice up df by year and merge
    df_match_year = df1.merge(df2, how = 'inner', left_index = True, right_index = True, suffixes = ('_1', '_2'))

    #calculate adjustment factor (k), averaged over a year.
    df_match_year['difference'] = df_match_year[f'{search_term}_2']/df_match_year[f'{search_term}_1']
    factor = df_match_year['difference'].mean()
    print(f"Factor (k) for adjusting scale of data in first df to second df is {factor:.4f}.")

    #apply k factor, drop rows of common year, concat datafram vertically (axis =0)
    df1[f'{search_term}'] *= factor
    df1.drop(index = df_match_year.index, inplace = True)
    df1 = pd.concat([df1, df2], axis = 0)
    
    return df1

#### 2.3.1 Combining 'dengue' Searches

In [3]:
#import 
dengue_search_interest1 = pd.read_csv('../data/google_data/dengue_search_interest1.csv')
dengue_search_interest2 = pd.read_csv('../data/google_data/dengue_search_interest2.csv')
dengue_search_interest3 = pd.read_csv('../data/google_data/dengue_search_interest3.csv')

In [4]:
#apply factorizer
dengue_search_partial = factorizer(dengue_search_interest1, dengue_search_interest2, 'dengue')
dengue_search_combined = factorizer(dengue_search_partial, dengue_search_interest3, 'dengue')
dengue_search_combined

Factor (k) for adjusting scale of data in first df to second df is 2.7819.
Factor (k) for adjusting scale of data in first df to second df is 0.5714.


Unnamed: 0_level_0,dengue,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-01-01,11.127111,False
2012-01-08,6.358349,False
2012-01-15,6.358349,False
2012-01-22,11.127111,False
2012-01-29,7.947936,False
...,...,...
2023-07-09,19.000000,False
2023-07-16,19.000000,False
2023-07-23,21.000000,False
2023-07-30,17.000000,False


In [5]:
#export csv
dengue_search_combined.to_csv('../data/google_data/master_dengue_search_data.csv', index=True)

#### 2.3.2 Combining 'dengue symptoms' Searches

In [14]:
#import 
dengue_symptoms_search_interest1 = pd.read_csv('../data/google_data/dengue symptoms_search_interest1.csv')
dengue_symptoms_search_interest2 = pd.read_csv('../data/google_data/dengue symptoms_search_interest2.csv')
dengue_symptoms_search_interest3 = pd.read_csv('../data/google_data/dengue symptoms_search_interest3.csv')

In [24]:
#apply factorizer
dengue_symptoms_search_partial = factorizer(dengue_symptoms_search_interest1, 
                                            dengue_symptoms_search_interest2, 
                                            'dengue symptoms')
dengue_symptoms_search_combined = factorizer(dengue_symptoms_search_partial, 
                                             dengue_symptoms_search_interest3,
                                            'dengue symptoms')
dengue_symptoms_search_combined

Factor (k) for adjusting scale of data in first df to second df is 3.5201.
Factor (k) for adjusting scale of data in first df to second df is 0.6476.


Unnamed: 0_level_0,dengue symptoms,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-01-01,6.838475,False
2012-01-08,0.000000,False
2012-01-15,18.235933,False
2012-01-22,18.235933,False
2012-01-29,9.117967,False
...,...,...
2023-07-09,20.000000,False
2023-07-16,21.000000,False
2023-07-23,28.000000,False
2023-07-30,23.000000,False


In [25]:
#export csv
dengue_symptoms_search_combined.to_csv('../data/google_data/master_dengue_symptoms_search_data.csv', index=True)

## 3. Dengue Data

For dengue data, we will use the data from data.gov.sg.

In [4]:
dengue = pd.read_csv('../data/dengue_data/WeeklyInfectiousDiseaseBulletinCases.csv')

In [5]:
dengue

Unnamed: 0,epi_week,disease,no._of_cases
0,2012-W01,Acute Viral hepatitis B,0
1,2012-W01,Acute Viral hepatitis C,0
2,2012-W01,Avian Influenza,0
3,2012-W01,Campylobacterenterosis,6
4,2012-W01,Chikungunya Fever,0
...,...,...,...
20065,2022-W52,Japanese Encephalitis,0
20066,2022-W52,Tetanus,0
20067,2022-W52,Botulism,0
20068,2022-W52,Murine Typhus,0


In [7]:
#export csv
dengue.to_csv('../data/dengue_data/WeeklyInfectiousDiseaseBulletinCases.csv', index=False)