# Challenge Summary

Can you predict local epidemics of dengue fever?

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half billion cases per year are occurring in Latin America:

Using environmental data collected by various U.S. Federal Government agencies—from the Centers for Disease Control and Prevention to the National Oceanic and Atmospheric Administration in the U.S. Department of Commerce—can you predict the number of dengue fever cases reported each week in San Juan, Puerto Rico and Iquitos, Peru?


# Team Information

Name: Team Fondue
Members:

- Anthony Xavier Poh Tianci (E0406854)
- Tan Jia Le Damien (E0310355)


# Imports


In [86]:
import warnings
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

import statsmodels.api as sm
import statsmodels.formula.api as smf

from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-white')


warnings.filterwarnings("ignore")

rng = 0


# Data Exploration


In [87]:
# Loading of dataset

train_features = pd.read_csv('./dengue_features_train.csv')
train_labels = pd.read_csv('./dengue_labels_train.csv')
test_features = pd.read_csv('./dengue_features_test.csv')


In [88]:
# fillna

train_features.fillna(method='ffill', inplace=True)
test_features.fillna(method='ffill', inplace=True)


In [89]:
# convert week_start_date column to datetime

train_features['week_start_date'] = pd.to_datetime(
    train_features['week_start_date'])
test_features['week_start_date'] = pd.to_datetime(
    test_features['week_start_date'])


In [90]:
# extracting month to a new column

train_features['month'] = train_features.week_start_date.dt.month
test_features['month'] = test_features.week_start_date.dt.month


In [91]:
# merging features and labels

train_features = pd.merge(train_features, train_labels, on=[
                          'city', 'year', 'weekofyear'])


In [92]:
# getting the average of total_cases for each week over the years
train_features = train_features.join(train_features.groupby(['city', 'weekofyear'])[
                                     'total_cases'].mean(), on=['city', 'weekofyear'], rsuffix='_avg')
test_features = test_features.join(test_features.groupby(
    ['city', 'weekofyear']).mean(), on=['city', 'weekofyear'], rsuffix='_avg')


In [93]:
# we do rolling sum for precipitation values because precipitation builds up over time

rolling_cols_sum = [
    'precipitation_amt_mm',
    'reanalysis_sat_precip_amt_mm',
    'station_precip_mm'
]

# for the following columns, we take the average over a given duration
rolling_cols_avg = [
    'ndvi_ne',
    'ndvi_nw',
    'ndvi_se',
    'ndvi_sw',
    'reanalysis_air_temp_k',
    'reanalysis_avg_temp_k',
    'reanalysis_dew_point_temp_k',
    'reanalysis_max_air_temp_k',
    'reanalysis_min_air_temp_k',
    'reanalysis_precip_amt_kg_per_m2',
    'reanalysis_relative_humidity_percent',
    'reanalysis_specific_humidity_g_per_kg',
    'reanalysis_tdtr_k',
    'station_avg_temp_c',
    'station_diur_temp_rng_c',
    'station_max_temp_c',
    'station_min_temp_c'
]


In [94]:
# for loop to create new rolling sum columns, sum over 3 weeks
for col in rolling_cols_sum:
    train_features['rolling_sum_' + col] = train_features[col].rolling(3).sum()
    test_features['rolling_sum_' + col] = test_features[col].rolling(3).sum()

# for loop to create new rolling average columns, mean over 3 weeks
for col in rolling_cols_avg:
    train_features['rolling_avg_' +
                   col] = train_features[col].rolling(3).mean()
    test_features['rolling_avg_' + col] = test_features[col].rolling(3).mean()


In [95]:
# dengue has about 4 to 10 days of incubation of dengue
# and takes about 8 to 10 days from egg to adult, create lag of 2 weeks
# create lag features for the following columns
for col in rolling_cols_sum:
    train_features['lag_1_' + col] = train_features[col].shift(1)
    test_features['lag_1_' + col] = test_features[col].shift(1)
    train_features['lag_2_' + col] = train_features[col].shift(2)
    test_features['lag_2_' + col] = test_features[col].shift(2)

# create lag features for the following columns
for col in rolling_cols_avg:
    train_features['lag_1_' + col] = train_features[col].shift(1)
    test_features['lag_1_' + col] = test_features[col].shift(1)
    train_features['lag_2_' + col] = train_features[col].shift(2)
    test_features['lag_2_' + col] = test_features[col].shift(2)


In [96]:
# we use backward fill for missing values in the rolling sum and rolling average columns
# reason for this is because our rolling sum and rolling averages take values from the previous weeks
train_features.fillna(method='bfill', inplace=True)
test_features.fillna(method='bfill', inplace=True)


In [98]:
# save our train_features to a csv file for easier access in approach 3 and 4
train_features.to_csv('train_features_modified.csv', mode='w', index=False)
test_features.to_csv('test_features_modified.csv', mode='w', index=False)


In [99]:
# slice train_features, test_features and train_labels by city
# Seperate data for San Juan
sj_train_features = train_features[train_features['city'] == 'sj']
sj_train_labels = train_labels[train_labels['city'] == 'sj']
sj_test_features = test_features[test_features['city'] == 'sj']

# Separate data for Iquitos
iq_train_features = train_features[train_features['city'] == 'iq']
iq_train_labels = train_labels[train_labels['city'] == 'iq']
iq_test_features = test_features[test_features['city'] == 'iq']

# drop city column from train_features and test_features
sj_train_features.drop(['city', 'week_start_date'], axis=1, inplace=True)
sj_train_labels.drop(['city'], axis=1, inplace=True)
sj_test_features.drop(['city', 'week_start_date'], axis=1, inplace=True)

iq_train_features.drop(['city', 'week_start_date'], axis=1, inplace=True)
iq_train_labels.drop(['city'], axis=1, inplace=True)
iq_test_features.drop(['city', 'week_start_date'], axis=1, inplace=True)


In [100]:
# create LSTM model
class LSTM(nn.Module):
    def __init__(self, input_size=1, hidden_layer_size=100, output_size=1):
        super().__init__()
        self.hidden_layer_size = hidden_layer_size

        self.lstm = nn.LSTM(input_size, hidden_layer_size)

        self.linear = nn.Linear(hidden_layer_size, output_size)

        self.hidden_cell = (torch.zeros(1, 1, self.hidden_layer_size),
                            torch.zeros(1, 1, self.hidden_layer_size))

    # forward pass method

    def forward(self, input):
        input = torch.from_numpy(input).float()
        lstm_out, self.hidden_cell = self.lstm(
            input.view(len(input), 1, -1), self.hidden_cell)
        output = self.linear(lstm_out[-1])
        return output


In [101]:
model = LSTM()
# use MAE loss function since driven data uses MAE loss function to grade submissions
loss_function = nn.L1Loss()
# use Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


In [102]:
def train_model(features, labels):
    # train model for 10 epochs
    epochs = 10

    # train model
    for epoch in range(epochs):
        # reset hidden cell state
        model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_size),
                             torch.zeros(1, 1, model.hidden_layer_size))

        # train model
        for i in range(len(features)):
            model.zero_grad()
            model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_size),
                                 torch.zeros(1, 1, model.hidden_layer_size))

            # get input and output
            inputs = features.iloc[i, :-1].values.reshape(-1, 1)
            targets = labels.iloc[i, -1]

            # forward pass
            outputs = model.forward(inputs)

            # calculate loss
            loss = loss_function(outputs, torch.tensor(targets).float())

            # backward pass
            loss.backward()

            # update weights
            optimizer.step()

        # print loss
        print('Epoch: {}/{}, Loss: {}'.format(epoch+1, epochs, loss.item()))

    return model


In [103]:
def predict(model, features):
    predictions = []
    for i in range(len(features)):
        model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_size),
                             torch.zeros(1, 1, model.hidden_layer_size))

        # get input and output
        inputs = features.iloc[i, :-1].values.reshape(-1, 1)
        targets = features.iloc[i, -1]

        # forward pass
        outputs = model.forward(inputs)

        # append prediction to list
        predictions.append(outputs.item())

    return predictions


In [104]:
# train model on sj and iq data
sj_model = train_model(sj_train_features, sj_train_labels)
iq_model = train_model(iq_train_features, iq_train_labels)


Epoch: 1/10, Loss: 9.296796798706055
Epoch: 2/10, Loss: 9.773235321044922
Epoch: 3/10, Loss: 9.781216621398926
Epoch: 4/10, Loss: 9.809554100036621
Epoch: 5/10, Loss: 9.632328987121582
Epoch: 6/10, Loss: 9.458606719970703
Epoch: 7/10, Loss: 9.752833366394043
Epoch: 8/10, Loss: 9.754161834716797
Epoch: 9/10, Loss: 9.754855155944824
Epoch: 10/10, Loss: 9.755311012268066
Epoch: 1/10, Loss: 1.0281591415405273
Epoch: 2/10, Loss: 1.0222787857055664
Epoch: 3/10, Loss: 1.0364742279052734
Epoch: 4/10, Loss: 1.0685386657714844
Epoch: 5/10, Loss: 0.9847903251647949
Epoch: 6/10, Loss: 1.0046987533569336
Epoch: 7/10, Loss: 1.0044307708740234
Epoch: 8/10, Loss: 1.0457072257995605
Epoch: 9/10, Loss: 1.0497546195983887
Epoch: 10/10, Loss: 0.9905633926391602


In [105]:
sj_predictions = predict(sj_model, sj_test_features)
iq_predictions = predict(iq_model, iq_test_features)


In [107]:
predictions = sj_predictions + iq_predictions
# prediction as numpy of integers
predictions = np.array(predictions, dtype=int)

submission = pd.read_csv("./submission_format.csv", index_col=[0, 1, 2])
submission.total_cases = predictions
submission.to_csv("./approach_2.csv")
