<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Predicting Dengue Cases in Singapore

---
## Part 1: Background, Overview, Data Scraping and Data Cleaning
---

## Contents
---
- [Problem Statement](##Problem-Statement)
- [Background](##Background)
- [Overview of Process](##Overview-of-Process)
- [Data Scraping & Data Cleaning](##Data-Scraping-&-Data-Cleaning)

---
## Problem Statement
---

Dengue (dengue fever) is a disease that is endemic to Singapore. Despite constant efforts by the government and public to suppress its spread, Singapore continues to suffer from periodic outbreaks. In particular, outbreaks have been occurring at an increasing rate in recent years, with some years experiencing record-breaking numbers. Numerous factors are involved in influencing the spread of dengue and the resulting number of cases, thus making it difficult or virtually impossible to eradicate the virus; as a nation, we are constantly in 'defense-mode', ie., vector control measures are often carried out in a reactionary manner. In light of these challenges, our group has decided to tackle this challenge using a data-driven approach, where we aim to forecast future cases so as to provide an additional time advantage that we have over the virus. By doing so, we hope to imbue the National Environment Agency (NEA) with the confidence to carry out preventive measures at the earliest possible stage based on forecasts from our models, thus positively impacting the lives of Singaporeans as well as reducing associated costs with outbreaks, such as high costs incurred from aggressive vector control measures.

---
## Overview of Process
---

#### Data Cleaning & Data Scraping

This segment deals with scraping, cleaning and filtering of data. Challenges associated with doing so will be elaborated in the relevant sections below, as well as our approach to coming up with a reasonable solution.

#### Exploratory Data Analysis (EDA)

This segment covers the following:
- Background information on dengue cases in Singapore;
- Detailed look into factors contributing towards case numbers;
- Analysis by planning area;
- Cost-Benefit Analysis (CBA)

#### Modelling

This segment is split into two notebooks, with each notebook approaching the modelling process from a different point of view.

While both approaches involve forecasting based on historical information, they differ in the exact methods as well as predictors being used.

#### Applications, Suggestions & Recommendations

Potential applications of the models are covered in this segment, as well as our suggestions on how the forecasting process can be improved upon to maximise obtainable benefits.

---
## Data Scraping & Data Cleaning
---

#### Datasets

| Data Type | Timeframe in Original Dataset | Weekly? | Monthly? | Yearly? |
| --- | --- | --- | --- | --- |
| Climate (temperature & rainfall) | Daily | Yes | Yes | Yes |
| Infectious diseases (including dengue) | Weekly | Yes | Yes | Yes |
| Google search trends | Weekly, Monthly | Yes | Yes | No$^1$ |
| Outbound travel data | Monthly | Yes$^2$ | Yes | Yes |
| Inbound travel data | Monthly | Yes$^2$ | Yes | Yes |

$^1$ Weekly google search trends data for search term 'dengue'; note that it's not an absolute number but rather a representation, according to this definition given by Google:
> Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term.

Therefore, the data should be taken at face value. As such, there will be no yearly version of it since only weekly and monthly data can be extracted from the source.

$^2$ Monthly travel data will be divided into the relevant number of weeks. While this may not be the most accurate way of getting weekly data, the data is crucial in explaining the COVID-period numbers, therefore we have decided to prioritise including it at the expense of some accuracy.

#### Data Cleaning & Merging

The 'end goal' is to have all the datasets compiled into weekly, monthly and yearly versions. To achieve this, the individual DataFrames must first be cleaned and standardised, and finally merged. Details specific to each dataset can be found in the relevant sections dealing with each individual dataset within this notebook.

Each individual DataFrame will only contain relevant information. For example, the dengue DataFrame will only contain one column, `dengue_cases`.

Most years have 52 weeks, with a few exceptions. Unfortunately, due to inherent differences of the source data, years with 53 weeks are not always the same across different datasets. In addition, if the 53rd week is only present in certain years, it may pose a problem in certain processes. Therefore, to minimise these issues, the 53rd week, when it exists, will be 'merged' into the 52nd week:
- For data dealing with totals, such as total rainfall, `.sum()` will be applied.
- For data dealing with averages, such as mean temperature, `.mean()` will be applied.
- For data dealing with minimum values, such as minimum temperature, `.min()` will be applied.
- For data dealing with maximum values, such as maximum temperature, `.max()` will be applied.
- For 'special' data where it is inaccurate to apply mathematical operators, such as Google search trends, the 53rd week will be dropped.

To ensure that the merger is successful, the indexes of all weekly / monthly / yearly Dataframes will be standardised. Examples are as follows:
- Weekly data: 2012-W01
- Monthly data: 2012-M01
- Yearly data: 2012

In [1]:
import numpy as np
import pandas as pd

import calendar
from datetime import date, timedelta, datetime

import requests

from bs4 import BeautifulSoup

import re

import matplotlib.pyplot as plt

In [2]:
# Function to create a 'week' column based on the 'year' column and the consecutive nature of the data.

def insert_week_col(df=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    '''
    
    # Set a placeholder value of 1 for all cells in the new 'week' column.
    df['week'] = 1

    # These variables will be used in the for loop below.
    prev_year = None
    prev_week = None

    # Iterate over each row of the dataframe.
    for idx, row in df.iterrows():

        # If the 'year' value for the current row is different from the 'year' value for the previous row...
        if prev_year is not None and row['year'] != prev_year:

            # ...restart the numbering for the weeks.
            prev_week = None

        # If it is the start of a new year, retain the current 'week' value of 1.
        # If not, set the current 'week' value to be the previous 'week' value + 1.
        if prev_week is not None:

            df.at[idx, 'week'] = prev_week + 1

        # Update the previous year and previous week values to be the current values.
        prev_year = df.at[idx, 'year']
        prev_week = df.at[idx, 'week']

In [3]:
# Function to 'merge' data from the 53rd week into the 52nd week.

def resolve_conflicts(df=None, year=None, column=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    year : int
        Year to apply the function to.
    column : str
        Column of DataFrame to extract data from.
    '''
    
    try:
        
        # For data dealing with minimum values, obtain the lower of the two.
        if column.startswith('min'):
            df.loc[f'{year}-W52', column] = min([df.loc[f'{year}-W52', column], df.loc[f'{year}-W53', column]])
            
        # For data dealing with maximum values, obtain the higher of the two.
        elif column.startswith('max'):
            df.loc[f'{year}-W52', column] = max([df.loc[f'{year}-W52', column], df.loc[f'{year}-W53', column]])
        
        # For data dealing with totals, obtain the sum.
        elif column.startswith('total'):
            df.loc[f'{year}-W52', column] = df.loc[f'{year}-W52', column] + df.loc[f'{year}-W53', column]

        # For data dealing with averages, obtain the mean.
        elif column.startswith('mean'):
            df.loc[f'{year}-W52', column] = (df.loc[f'{year}-W52', column] + df.loc[f'{year}-W53', column])/2
        
    except:
        
        pass
        
    # Drop the now-irrelevant 53rd week.
    # Note that, for all data not falling under any of the above if statements, the 53rd week is simply dropped.
    df.drop(f'{year}-W53', inplace=True, errors='ignore')

In [4]:
# Function to insert a year column as well as week/month column, depending on the timeframe of the data.

def insert_time_data(df=None, timeframe=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    timeframe : str
        'Name' of resulting column.
        Values should be 'week' or 'month'.
    '''
    
    # Create a year column.
    df['year'] = df.index.map(lambda x: int(x[:4]))
    
    # Create a column based on the timeframe specified.
    df[timeframe] = df.index.map(lambda x: int(x[-2:]))

---
## Climate data
---

This daily climate datasets comprise of several different climate variables, of which we are only interested in the following: `total_rainfall`, `mean_temperature`, `minimum_temperature`, `maximum_temperature`, `mean_wind_speed` and `max_wind_speed`.

There are many .csv files as each file represents a single month worth of daily data. Each file exists as a downloadable file on weather.gov.sg. However, as there is a different link for each file, instead of downloading the files manually, we will scrape the files instead.

This is the only dataset that provides daily data, which must then further be compiled. The process of obtaining weekly and monthly data from it poses a few challenges, which we have decided to resolve to the best of our abilities.

Firstly, to compile the data into weekly data, the `.isocalendar()` function will be used as it is the most accurate method. As `.isocalendar()` considers the first Monday of each year to be the first day of week 1, the days preceding that will be counted as part of the *previous year*. For example, the first Monday of 2022 is 3/1. After applying `.isocalendar()`, 1/1 and 2/1 will be classified under week 52, ie. the final week of 2021. 

Due to the structure of the dataset, this poses a problem - there is a `year` column originally in the dataset which we need for the sake of the index. To illustrate the problem using the example above:

| Year | Date | Week (from `.isocalendar()` | Resulting year-week value for the index | Correct? |
| --- | --- | --- | --- | --- |
| 2022 | 3/1 | 1 | 2022-W01 | Yes |
| 2022 | 2/1 | 52 | 2022-W52 | **No** |
| 2022 | 1/1 | 52 | 2022-W52 | **No** |

Therefore, after applying `.isocalendar()`, additional code will be written to ensure that these 'extra' days belong to the previous year, ie.:

| Original Year | _Corrected_ Year | Date | Week (from `.isocalendar()` | Resulting year-week value for the index | Correct? |
| --- | --- | --- | --- | --- | --- |
| 2022 | - | 3/1 | 1 | 2022-W01 | Yes |
| 2022 | 2021 | 2/1 | 52 | 2022-W52 | **Yes** |
| 2022 | 2021 | 1/1 | 52 | 2022-W52 | **Yes** |

A manual adjustment will be made for the head and tail of the data:
- 1/1/2012 will be categorised as week 52 of 2011, so it will be dropped;
- 1/1/2023 will be categorised as week 52 of 2022, so it will be included.

In [5]:
# Function to create & clean weekly, monthly & yearly dataframes for temperature (mean, min, max) & rainfall data.

def group_df_climate(df=None, groupby_col=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    groupby_col : str
        Column in the DataFrame to group the data by.
        Values should be 'year_week', 'year_month' or 'year'.
    '''

    # Create separate dataframes for each column.
    # Rename columns whenever required.
    df1 = df.groupby(groupby_col)[['daily_rainfall_total']].sum()
    df1 = df1.rename(columns={'daily_rainfall_total': 'total_rainfall'})

    df2 = df.groupby(groupby_col)['mean_temperature'].mean()
    df3 = df.groupby(groupby_col)['minimum_temperature'].min()
    df4 = df.groupby(groupby_col)[['minimum_temperature']].mean()
    df4 = df4.rename(columns={'minimum_temperature': 'minimum_temperature_mean'})
    
    df5 = df.groupby(groupby_col)['maximum_temperature'].max()
    df6 = df.groupby(groupby_col)[['maximum_temperature']].max()
    df6 = df6.rename(columns={'maximum_temperature': 'maximum_temperature_mean'})
    
    df7 = df.groupby(groupby_col)['mean_wind_speed'].mean()
    df8 = df.groupby(groupby_col)['max_wind_speed'].max()
    df9 = df.groupby(groupby_col)[['max_wind_speed']].mean()
    df9 = df9.rename(columns={'max_wind_speed': 'max_wind_speed_mean'})
    
    # Concatenate all the dataframes into one final dataframe.
    df_new = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9], axis=1)
    
    # For weekly dataframes, apply the resolve_conflicts() function to...
    # ...eliminate the issue of having a different number of weeks across years.
    if groupby_col == 'year_week':
        for year in range(2012,2023):
            for column in df_new.columns:
                resolve_conflicts(df=df_new, year=year, column=column)
            
    return df_new

In [6]:
# Function to insert a column reflecting cumulative rainfall by year.

def get_cumulative_data(df=None, col=None, timeframe=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    col : str
        Column of DataFrame to extract data from.
    timeframe : str
        'Name' of resulting column.
        Values should be 'week' or 'month'.
    '''

    # Create a new column.
    df[f'cumulative_{col}'] = 0

    # Loop through each year in the dataset.
    for year in df.year.unique():
        
        # Get the rows for the current year.
        year_rows = df[df.year == year]

        # Set the initial cumulative rainfall value to zero.
        cumulative_total = 0

        # Loop through each row in the year's data.
        for index, row in year_rows.iterrows():
            
            # If the timeframe is 1, reset the cumulative total to zero.
            if row[timeframe] == 1:
                cumulative_total = 0

            # Add the current week's rainfall to the cumulative total.
            cumulative_total += row[col]

            # Set the cumulative total rainfall value for the current row
            df.at[index, f'cumulative_{col}'] = cumulative_total

In [7]:
# Code block to scrape and clean up the climate data.

# Either read in climate.csv...
try:
    
    df_climate = pd.read_csv('../data/climate.csv')
    df_climate.drop('Unnamed: 0', axis=1, inplace=True)
    
# ...or create the file if it doesn't yet exist.
except:
    
    # Create a list of years and months to match the filenames.
    year_and_month = [''.join(i.split('-')[:2]) for i in pd.to_datetime([f'{year}' + f' 0{month}' \
                                                                         for year in range(1980,2023) \
                                                                         for month in range(1,13)]).astype(str)]
    
    # Include an additional month of data from 2023; rationale has been explained above.
    year_and_month.append('202301')
    
    # Ensure that there are no repeated elements in the list.
    year_and_month = list(set(year_and_month))
    
    # Using the list created above, create a list of URLs to scrape the .csv files from.
    urls = [f'http://www.weather.gov.sg/files/dailydata/DAILYDATA_S24_{yyyymm}.csv' for yyyymm in year_and_month]
    
    # Partial filepath for the destination folder.
    destination_folder = '../data/'
    
    # For each URL...
    for url in urls:

        # Extract just the variable components of the filename.
        filename = url.split('/')[-1]

        # Create a path based on the destination folder and the filename.
        path = destination_folder + filename

        try:

            # Using the path variable created above, attempt to read in the file if it already exists.
            pd.read_csv(path, encoding='ANSI')

        except:

            # If the file does not exist, use BeautifulSoup to parse the website and download the .csv file.
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            content = response.content
            with open(path, 'wb') as f:
                f.write(content)
                
    # Create an empty dataframe to store the data in.
    df_climate = pd.DataFrame()

    # For each URL...
    for url in urls:

        # Extract just the variable components of the filename.
        filename = url.split('/')[-1]

        # Create a path based on the destination folder and the filename.
        path = destination_folder + filename

        # Read in the .csv files that were downloaded based on the code above.
        df = pd.read_csv(path, encoding='ANSI')

        # Create a new list containing the desired column names.
        # The reason this is necessary is because the naming of certain columns in the .csv files are not consistent.
        new_cols = [col.lower() for col in df.columns if len(col.split(' ')) < 2] + \
        ['_'.join(col.lower().split(' ')[:-1]) for col in df.columns if '(' in col]

        # Rename the columns to ensure standardisation.
        df.rename(columns={old_col: new_col for old_col, new_col in zip(df.columns, new_cols)}, 
                  inplace=True, 
                  errors='ignore')

        # Concatenate the newly-created dataframe with df_climate.
        df_climate = pd.concat([df_climate, df.loc[:, lambda x: x.columns.str.contains('station') == False]])

        # Ensure that values are sorted by date and 
        df_climate = df_climate.sort_values(['year', 'month', 'day'], ascending=True).reset_index().drop('index', axis=1)

        df_climate.to_csv('../data/climate.csv')

In [8]:
# Using the .isocalendar() function, create a 'week' column.

df_climate['week'] = df_climate.apply(lambda x: date(x['year'], x['month'], x['day']).isocalendar()[1], axis=1)

In [9]:
# As explained above, the code here ensures that the days preceding the 1st Monday of each year will be...
# ...categorised under the preceding year.

df_climate['year'] = df_climate.apply(lambda x: x.year - 1 if x.month == 1 and x.week > 50 else x.year, axis=1)

In [10]:
# Change all irrelevant values to 0 for the wind speed columns.

for col in ['mean_wind_speed', 'max_wind_speed']:
    df_climate[col] = pd.to_numeric(df_climate[col], errors='coerce')
    df_climate[col] = df_climate[col].fillna(0)

In [11]:
# Standardise values type to be float.

df_climate = df_climate.loc[df_climate.year.between(1982,2023), :]

for col in [
    'mean_temperature', 
    'daily_rainfall_total', 
    'minimum_temperature', 
    'maximum_temperature', 
    'mean_wind_speed', 
    'max_wind_speed']:
    
    df_climate[col] = df_climate[col].map(lambda x: float(x))

In [12]:
df_climate.reset_index().drop('index', axis=1).to_csv('../data/daily_weather_data.csv')

In [13]:
# As explained above, 1/1/2012 will be dropped & 1/1/2023 will be included in the data.

df_climate = df_climate.loc[df_climate.year.between(2012,2022), :]

In [14]:
# Using the 'year' & 'week' columns, create a 'year_week' column to be used as the index for the weekly dataframes.

df_climate['year_week'] = df_climate.apply(lambda x: f'{x.year}-W0{x.week}' if x.week in range(1,10) \
                                           else f'{x.year}-W{x.week}', 
                                           axis=1)

In [15]:
# Using the 'year' & 'month' columns, create a 'year_month' column to be used as the index for the monthly dataframes.

df_climate['year_month'] = df_climate.apply(lambda x: f'{x.year}-M0{x.month}' if x.month in range(1,10) \
                                            else f'{x.year}-M{x.month}',
                                            axis=1)

In [16]:
# Using the group_df_climate function, create weekly dataframe for temperature & rainfall data.

df_climate_weekly = group_df_climate(df=df_climate, groupby_col='year_week')

In [17]:
# Using the group_df_climate function, create monthly dataframe for temperature & rainfall data.

df_climate_monthly = group_df_climate(df=df_climate, groupby_col='year_month')

In [18]:
# Using the group_df_climate function, create yearly dataframe for temperature & rainfall data.

df_climate_yearly = group_df_climate(df=df_climate, groupby_col='year')

---
## Dengue data
---

Dengue data is extracted from a dataset comprising weekly statistics for infectious diseases, including Dengue Fever & Dengue Haemorrhagic Fever. Both of these data will be merged and used for the purposes of this project.

Unlike the (daily) climate data, this dataset does not include 'year', 'month' and 'week' columns. The year and week can be obtained from the index, though it is unknown how the week numbering is derived, but the month will have to be inferred using the `datetime.strptime()` function. Specifically, the function takes in the following 'date' format: `%Y-%W-%w`, where `%Y` represents the year, `%W` represents the week number, and `%w` represents the day of week. In our case, we will set `%w` to `1`, which stands for Monday.

In [19]:
# Function that returns the month given a year and week.

def get_month_from_week(year, week):
    
    '''   
    Parameters
    ----------
    year : int
        Year to be used in function.
    week : int
        Week to be used in function.
    '''
    
    date = datetime.strptime(f'{year}-{week}-1', "%Y-%W-%w")
    
    return date.month

In [20]:
# Read in weekly infectious diseases data.

df_dengue = pd.read_csv('../data/weekly-infectious-disease-bulletin-cases.csv')

In [21]:
# Drop irrelevant information & merge relevant information.

df_dengue = df_dengue.loc[df_dengue.disease.map(lambda x: (x=='Dengue Fever') | (x=='Dengue Haemorrhagic Fever'))]\
.reset_index()\
.drop('index', axis=1)\
.groupby('epi_week')[['epi_week', 'no._of_cases']]\
.sum()

In [22]:
# Rename column to be more informative.

df_dengue.rename(columns={'no._of_cases': 'dengue_cases'}, inplace=True, errors='ignore')

In [23]:
# Using the index, create a 'year' column to be used as the index for the yearly dataframe.

df_dengue['year'] = df_dengue.index.map(lambda x: int(x[:4]))

In [24]:
# Using the index, create a 'week' column.

df_dengue['week'] = df_dengue.index.map(lambda x: int(x[-2:]))

In [25]:
# Using the resolve_conflicts() function, standardise number of weeks to be 52 per year.

for year in range(2012,2023):
    resolve_conflicts(df=df_dengue, year=year, column='dengue_cases')

In [26]:
# Using the get_month_from_week(), create a 'month' column.

df_dengue['month'] = df_dengue.apply(lambda x: get_month_from_week(x.year, x.week), axis=1)

In [27]:
# Using the 'year' & 'month' columns, create a 'year_month' column to be used as the index for the monthly dataframe.

df_dengue['year_month'] = df_dengue.apply(lambda x: f'{x.year}-M0{x.month}' if x.month in range(1,10) \
                                          else f'{x.year}-M{x.month}',
                                          axis=1)

In [28]:
# Create weekly dataframe.

df_dengue_weekly = df_dengue[['dengue_cases']]

In [29]:
# Create monthly dataframe.

df_dengue_monthly = df_dengue.groupby('year_month')[['dengue_cases']].sum()

In [30]:
# Create yearly dataframe.

df_dengue_yearly = df_dengue.groupby('year')[['dengue_cases']].sum()

---
## Google search trends data
---

As mentioned above, Google search trends data does not make use of absolute numbers that directly show the number of searches, but rather an 'index' of sorts to represent the ***popularity*** of the term relative to highest point on the chart for the given region (Singapore) and time period (2012 to 2022). To avoid inaccuracy of representations, we will only make use of available (weekly & monthly) data, and avoid using them to 'create' other (yearly) data.

As with all the other datasets, the key challenge here is to extract a week number given the available data. In the case of the trends dataset, the source (Google) considers 1/1/2012 to be the 1st day of the year 2012 and 31/12/2012 to be the last day of 2022. (*Note: while this seems obvious, it is not always the case, since there are 365 days in a year, which is a number that is not perfectly divisible by 7; ie. the selected time period happens to coincide nicely with conventional ways of thinking of days in a year.*)

In this case, our job here is to assign a week number to each row of the DataFrame conditional on its year. Then, as mentioned above, the 53rd week of years with 53 weeks will be dropped due to the nature of this data.

In [31]:
# Function to read in and clean each trends dataset.

def create_df_trends(path='../data/interest_trend_2012_to_2016.csv'):
    
    '''   
    Parameters
    ----------
    path : str
        Directory path to read in the .csv file from.
    '''
    
    # Read in weekly google search trends.
    df = pd.read_csv(path)
    
    # Drop irrelevant first row.
    df = df.iloc[1:, :]
    
    # Rename column to be more informative.
    df.rename(columns={'Category: All categories': 'interest'}, inplace=True, errors='ignore')
    
    # Convert values to int.
    df['interest'] = df.interest.map(lambda x: int(x))
    
    return df

In [32]:
# Using the create_df_trends() function, create weekly dataset from 3 separate .csv files. 

df_trends_weekly = pd.concat([create_df_trends(path=f'../data/interest_trend_{filename}.csv') \
                              for filename in ['2012_to_2016', '2017_to_2021', '2022']])

In [33]:
# Using the index, create a 'year' column.

df_trends_weekly['year'] = df_trends_weekly.index.map(lambda x: int(x[:4]))

In [34]:
# Using the insert_week_col() function, create a 'week' column.

insert_week_col(df=df_trends_weekly)

In [35]:
# Using the 'year' & 'week' columns, overwrite the index for the dataframe to follow the standardised format.

df_trends_weekly.index = df_trends_weekly.apply(lambda x: f'{x.year}-W0{x.week}' if x.week in range(1,10) \
                                                       else f'{x.year}-W{x.week}', 
                                                       axis=1)

In [36]:
# Using the resolve_conflicts() function, standardise number of weeks to be 52 per year.

for year in range(2012,2023):
    resolve_conflicts(df=df_trends_weekly, year=year, column='interest')

In [37]:
# Using the create_df_trends() function, create monthly dataset. 

df_trends_monthly = create_df_trends(path='../data/interest_trend_2012_to_2022.csv')

In [38]:
# Using the index, create a 'year' column.

df_trends_monthly['year'] = df_trends_monthly.index.map(lambda x: int(x[:4]))

In [39]:
# Using the index, create a 'month' column.

df_trends_monthly['month'] = df_trends_monthly.index.map(lambda x: int(x[-2:]))

In [40]:
# Using the 'year' & 'month' columns, overwrite the index for the dataframe to follow the standardised format.

df_trends_monthly.index = df_trends_monthly.apply(lambda x: f'{x.year}-M0{x.month}' if x.month in range(1,10) \
                                                  else f'{x.year}-M{x.month}',
                                                  axis=1)

---
## Travel data
---

Comprising both inbound & outbound, travel data exists at the monthly level. Since we will be dealing mostly with weekly data, some liberties will be taken here to 'convert' monthly travel data into weekly travel data by dividing each data point into the relevant number of weeks. As noted above, this will compromise on accuracy, but not to the extent where the data will be rendered unuseable.

In [41]:
def get_weeks_in_month(year, month):
    
    '''   
    Parameters
    ----------
    year : int
        Year to be used in function.
    week : int
        Week to be used in function.
    '''
    
    last_day = calendar.monthrange(year, month)[1]
    
    last_week = datetime(year, month, last_day).strftime("%W")
    
    return int(last_week)

In [42]:
# Create a dictionary of months where the keys are the months as shown in the travel datasets, and...
# ...the values are their numerical representations which we need for standardisation purposes.

months = {word: num for word, num in zip([mth[:3] for mth in list(calendar.month_name)[1:]], range(1,13))}

In [43]:
def get_travel_data(path='../data/monthly_arrivals.csv'):
    
    '''   
    Parameters
    ----------
    path : str
        Directory path to read in the .csv file from.
    '''

    if 'arrivals' in path:
        main_col = 'inbound'
        
    else:
        main_col = 'outbound'
    
    # Read in data.
    df = pd.read_csv(path)
    
    # Drop unnecessary first online & rename column to be more accurate.
    df = pd.DataFrame(df.loc[0, :]).drop('Data Series').rename(columns={0: main_col})
    
    # Using the index, create a 'year' column.
    df['year'] = df.index.map(lambda x: int(x[:4]))
    
    # Using the index & months dictionary, create a 'month' column.
    df['month'] = df.index.map(lambda x: months[x[5:8]])
    
    # Sort the data, first by year, then month, in ascending order.
    df.sort_values(['year', 'month'], inplace=True)
    
    # Using the 'year' & 'month' columns, create a 'year_month' column to be used as the index for the monthly dataframe.
    df['year_month'] = df.apply(lambda x: f'{x.year}-M0{x.month}' if x.month in range(1,10) \
                                else f'{x.year}-M{x.month}',
                                axis=1)
    
    # Using the get_weeks_in_month() function, create an 'n_weeks_consec' column that reflects...
    # ...the cumulative number of weeks in each month of a given year.
    df['n_weeks_consec'] = df.apply(lambda x: get_weeks_in_month(x.year, x.month), axis=1)
    
    # Set the 53rd week of any given year to be the 52nd week for standardisation purposes.
    df['n_weeks_consec'] = df.n_weeks_consec.map(lambda x: x if x <= 52 else 52)
    
    # Set the new column, 'n_weeks', to have the same values as the column 'n_weeks_consec'.
    # This is done to ensure that the first month of each year gets the correct number of weeks.
    df['n_weeks'] = df['n_weeks_consec']

    # These variables will be used in the for loop below.
    prev_year = None
    prev_week = None

    # Iterate over each row of the dataframe.
    for idx, row in df.iterrows():

        # If the 'year' value for the current row is different from the 'year' value for the previous row...
        if prev_year is not None and row['year'] != prev_year:

            # ...restart the numbering for the weeks.
            prev_week = None

        # If it is the start of a new year, retain the current 'n_weeks' value.
        # If not, set the current 'n_weeks' value to be the difference between the current & previous value.
        if prev_week is not None:

            df.at[idx, 'n_weeks'] = row['n_weeks'] - prev_week

        # Update the previous year and previous week values to be the current values.
        prev_year = df.at[idx, 'year']
        prev_week = df.at[idx, 'n_weeks_consec']
    
    # Create an empty dataframe that has the same columns as the original df.
    df_weekly = pd.DataFrame(np.arange(6).reshape(1, 6), columns=df.columns).drop(0)
    
    # For each row in the original df, create X number of duplicate rows, where X = the value in the 'n_weeks' column.
    # Append these rows to df_weekly to obtain a 'weekly' dataframe.
    for idx, row in df.iterrows():

        df_weekly = pd.concat([df_weekly, pd.concat([pd.DataFrame(df.loc[idx, :]).transpose() \
                                                     for i in range(df.loc[idx, 'n_weeks'])])])
    
    # Obtain 'weekly' data by dividing the actual number ('inbound' column) by the number of weeks ('n_weeks' column).
    df_weekly[main_col] = df_weekly.apply(lambda x: round(x[main_col]/x.n_weeks), axis=1)
    
    # Reset df_weekly's index to avoid having duplicated values in the index.
    df_weekly = df_weekly.reset_index().drop('index', axis=1)
    
    # Ensure that all columns, except for the 'year_month' column, are in numerical format.
    # This must be done to ensure that the next portion of the code can be successfully carried out.
    for col in df_weekly.columns:
        
        if col != 'year_month':
            
            df_weekly[col] = df_weekly[col].map(lambda x: int(x))
    
    # Using the insert_week_col() function, create a 'week' column.
    insert_week_col(df=df_weekly)
    
    # Using the 'year' & 'week' columns, overwrite the index for the dataframe to follow the standardised format.
    df_weekly.index = df_weekly.apply(lambda x: f'{x.year}-W0{x.week}' if x.week in range(1,10) \
                                      else f'{x.year}-W{x.week}',
                                      axis=1)
    
    # Create the finalised weekly version of the travel data.
    df_weekly = df_weekly[[main_col]]
    
    # Create the finalised monthly version of the travel data.
    df_monthly = df.groupby('year_month')[[main_col]].sum()
    
    # Create the finalised yearly version of the travel data.
    df_yearly = df.groupby('year')[[main_col]].sum()
    
    return [df_weekly, df_monthly, df_yearly]

In [44]:
# Using the get_travel_data() function, create dataframes for travel data.

df_travel_all = get_travel_data(path='../data/monthly_arrivals.csv') + \
get_travel_data(path='../data/monthly_departures.csv')

In [45]:
# Extract weekly, monthly & yearly travel travel dataframes from df_travel_all.

df_travel_weekly = pd.concat([df_travel_all[i] for i in [0, 3]], axis=1)
df_travel_monthly = pd.concat([df_travel_all[i] for i in [1, 4]], axis=1)
df_travel_yearly = pd.concat([df_travel_all[i] for i in [2, 5]], axis=1)

---
## Combined weekly, monthly & yearly data
---

**Notes**
- Yearly data is not available for Google search trends
- Weekly data for outbound & inbound travel is simply monthly data divided by the appropriate number of weeks.

In [46]:
# Create combined weekly dataframe with additional columns - year, week, cumulative total rainfall.

df_weekly = pd.concat([
    df_climate_weekly,
    df_dengue_weekly,
    df_trends_weekly[['interest']],
    df_travel_weekly],
    axis=1
)

insert_time_data(df=df_weekly, timeframe='week')

for col in ['total_rainfall', 'dengue_cases']:
    get_cumulative_data(df=df_weekly, col=col, timeframe='week')

In [47]:
# Create combined monthly dataframe with additional columns - year, month, cumulative total rainfall.

df_monthly = pd.concat([
    df_climate_monthly,
    df_dengue_monthly,
    df_trends_monthly[['interest']],
    df_travel_monthly],
    axis=1
)

insert_time_data(df=df_monthly, timeframe='month')

for col in ['total_rainfall', 'dengue_cases']:
    get_cumulative_data(df=df_monthly, col=col, timeframe='month')

In [48]:
# Create combined yearly dataframe.

df_yearly = pd.concat([
    df_climate_yearly,
    df_dengue_yearly,
    df_travel_yearly],
    axis=1
)

---
## Cumulative weekly & monthly data
---

Cumulative weekly data refers to combined weekly data spanning 2012 to 2022. For example, week 1 refers to the combined week 1 data from all 11 years.

Cumulative monthly data refers to combined monthly data spanning 2012 to 2022. For example, Jan refers to the combined Jan data from all 11 years.

As always, Google search trends will be left out from cumulative data for reasons expounded upon above.

In [49]:
# Function to create an additional Dataframe that reflects cumulative weekly/monthly data from 2012 to 2022.

def create_cumulative_df(df=None, timeframe=None):
    
    '''   
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to be used in function.
    timeframe : str
        Values should be 'week' or 'month'.
    '''
    
    df_cumulative = pd.DataFrame()
    
    df_cumulative['mean_temperature'] = df.groupby(df[timeframe]).mean_temperature.mean()
    df_cumulative['minimum_temperature'] = df.groupby(df[timeframe]).minimum_temperature.mean()
    df_cumulative['minimum_temperature_mean'] = df.groupby(df[timeframe]).minimum_temperature_mean.mean()
    df_cumulative['maximum_temperature'] = df.groupby(df[timeframe]).maximum_temperature.mean()
    df_cumulative['maximum_temperature_mean'] = df.groupby(df[timeframe]).maximum_temperature_mean.mean()
    df_cumulative['total_rainfall'] = df.groupby(df[timeframe]).total_rainfall.sum()
    df_cumulative['cumulative_total_rainfall'] = df.groupby(df[timeframe]).cumulative_total_rainfall.sum()
    df_cumulative['mean_wind_speed'] = df.groupby(df[timeframe]).mean_wind_speed.mean()
    df_cumulative['max_wind_speed'] = df.groupby(df[timeframe]).max_wind_speed.max()
    df_cumulative['max_wind_speed_mean'] = df.groupby(df[timeframe]).max_wind_speed.mean()
    df_cumulative['dengue_cases'] = df.groupby(df[timeframe]).dengue_cases.sum()
    df_cumulative['interest'] = df.groupby(df[timeframe]).interest.mean()
    df_cumulative['inbound'] = df.groupby(df[timeframe]).inbound.sum()
    df_cumulative['outbound'] = df.groupby(df[timeframe]).outbound.sum()
    
    return df_cumulative

In [50]:
# Using the create_cumulative_df function, create a Dataframe that reflects cumulative weekly data.

df_weekly_cumulative = create_cumulative_df(df=df_weekly, timeframe='week')

In [51]:
# Using the create_cumulative_df function, create a Dataframe that reflects cumulative monthly data.

df_monthly_cumulative = create_cumulative_df(df=df_monthly, timeframe='month')

---
## Export data
---

A total of 5 dataframes will be exported as .csv files for use in the other notebooks:
- `df_weekly`
- `df_monthly`
- `df_yearly`
- `df_weekly_cumulative`
- `df_monthly_cumulative`

In [52]:
# Export dataframes as .csv files.

path = '../data/'

df_weekly.to_csv(path+'weekly_data.csv')
df_monthly.to_csv(path+'monthly_data.csv')
df_yearly.to_csv(path+'yearly_data.csv')
df_weekly_cumulative.to_csv(path+'weekly_data_cumulative.csv')
df_monthly_cumulative.to_csv(path+'monthly_data_cumulative.csv')