# Challenge Summary

Can you predict local epidemics of dengue fever?

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half billion cases per year are occurring in Latin America:

Using environmental data collected by various U.S. Federal Government agencies—from the Centers for Disease Control and Prevention to the National Oceanic and Atmospheric Administration in the U.S. Department of Commerce—can you predict the number of dengue fever cases reported each week in San Juan, Puerto Rico and Iquitos, Peru?


# Team Information

Name: Team Fondue
Members:

- Anthony Xavier Poh Tianci (E0406854)
- Tan Jia Le Damien (E0310355)


# Imports


In [1129]:
import warnings
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, TensorDataset

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('seaborn-white')


warnings.filterwarnings("ignore")

rng = 0


# Data Exploration


In [1130]:
# Loading of dataset

train_features = pd.read_csv('./data/dengue_features_train.csv')
train_labels = pd.read_csv('./data/dengue_labels_train.csv')
test_features = pd.read_csv('./data/dengue_features_test.csv')


In [1131]:
# fillna
train_features.fillna(method='ffill', inplace=True)
test_features.fillna(method='ffill', inplace=True)


In [1132]:
# convert week_start_date column to datetime

train_features['week_start_date'] = pd.to_datetime(train_features['week_start_date'])
test_features['week_start_date'] = pd.to_datetime(test_features['week_start_date'])


In [1133]:
# extracting month to a new column

train_features['month'] = train_features.week_start_date.dt.month
test_features['month'] = test_features.week_start_date.dt.month


In [1134]:
# getting the average of total_cases for each week over the years
print(train_features.shape)
print(test_features.shape)

(1456, 25)
(416, 25)


In [1135]:
# we do rolling sum for precipitation values because precipitation builds up over time

rolling_cols_sum = [
    'precipitation_amt_mm',
    'reanalysis_sat_precip_amt_mm',
    'station_precip_mm'
]

# for the following columns, we take the average over a given duration
rolling_cols_avg = [
    'ndvi_ne',
    'ndvi_nw',
    'ndvi_se',
    'ndvi_sw',
    'reanalysis_air_temp_k',
    'reanalysis_avg_temp_k',
    'reanalysis_dew_point_temp_k',
    'reanalysis_max_air_temp_k',
    'reanalysis_min_air_temp_k',
    'reanalysis_precip_amt_kg_per_m2',
    'reanalysis_relative_humidity_percent',
    'reanalysis_specific_humidity_g_per_kg',
    'reanalysis_tdtr_k',
    'station_avg_temp_c',
    'station_diur_temp_rng_c',
    'station_max_temp_c',
    'station_min_temp_c'
]


In [1136]:
# for loop to create new rolling sum columns, sum over 3 weeks
for col in rolling_cols_sum:
    train_features['rolling_sum_' + col] = train_features[col].rolling(3).sum()
    test_features['rolling_sum_' + col] = test_features[col].rolling(3).sum()

# for loop to create new rolling average columns, mean over 3 weeks
for col in rolling_cols_avg:
    train_features['rolling_avg_' + col] = train_features[col].rolling(3).mean()
    test_features['rolling_avg_' + col] = test_features[col].rolling(3).mean()


In [1137]:
# dengue has about 4 to 10 days of incubation of dengue
# and takes about 8 to 10 days from egg to adult, create lag of 2 weeks
# create lag features for the following columns
for col in rolling_cols_sum:
    train_features['lag_1_' + col] = train_features[col].shift(1)
    test_features['lag_1_' + col] = test_features[col].shift(1)
    train_features['lag_2_' + col] = train_features[col].shift(2)
    test_features['lag_2_' + col] = test_features[col].shift(2)

# create lag features for the following columns
for col in rolling_cols_avg:
    train_features['lag_1_' + col] = train_features[col].shift(1)
    test_features['lag_1_' + col] = test_features[col].shift(1)
    train_features['lag_2_' + col] = train_features[col].shift(2)
    test_features['lag_2_' + col] = test_features[col].shift(2)


In [1138]:
# we use backward fill for missing values in the rolling sum and rolling average columns
# reason for this is because our rolling sum and rolling averages take values from the previous weeks
train_features.fillna(method='bfill', inplace=True)
test_features.fillna(method='bfill', inplace=True)


In [1139]:
# save our train_features to a csv file for easier access in approach 3 and 4
train_features.to_csv('./data/train_features_modified.csv', mode='w', index=False)
test_features.to_csv('./data/test_features_modified.csv', mode='w', index=False)


In [1140]:
# slice train_features, test_features and train_labels by city
# Seperate data for San Juan
sj_train_features = train_features[train_features['city'] == 'sj']
sj_train_labels = train_labels[train_labels['city'] == 'sj']['total_cases'].values.reshape(-1, 1)
sj_test_features = test_features[test_features['city'] == 'sj']

# Separate data for Iquitos
iq_train_features = train_features[train_features['city'] == 'iq']
iq_train_labels = train_labels[train_labels['city'] == 'iq']['total_cases'].values.reshape(-1, 1)
iq_test_features = test_features[test_features['city'] == 'iq']

# drop city column from train_features and test_features
sj_train_features.drop(['city', 'week_start_date'], axis=1, inplace=True)
sj_test_features.drop(['city', 'week_start_date'], axis=1, inplace=True)

iq_train_features.drop(['city', 'week_start_date'], axis=1, inplace=True)
iq_test_features.drop(['city', 'week_start_date'], axis=1, inplace=True)


In [1141]:
min_max_scaler = MinMaxScaler()

# standardize train and test features and labels
sj_train_features = min_max_scaler.fit_transform(sj_train_features)
sj_test_features = min_max_scaler.fit_transform(sj_test_features)

iq_train_features = min_max_scaler.fit_transform(iq_train_features)
iq_test_features = min_max_scaler.fit_transform(iq_test_features)


In [1142]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(83, 256)
        # hidden activation function
        self.act = nn.ReLU()
        # 5 hidden layer network
        self.lin2 = nn.Linear(256, 128)
        self.lin3 = nn.Linear(128, 64)
        self.lin4 = nn.Linear(64, 32)
        self.lin5 = nn.Linear(32, 1)

    # forward pass mat1 mat2
    def forward(self, xb):
        xb = self.lin(xb)
        xb = self.act(xb)
        xb = self.lin2(xb)
        xb = self.act(xb)
        xb = self.lin3(xb)
        xb = self.act(xb)
        xb = self.lin4(xb)
        xb = self.act(xb)
        xb = self.lin5(xb)
        return xb


In [1143]:
def train_model(model, train_features, train_labels):
    opt = optim.SGD(model.parameters(), lr=0.01)

    # train test split train_features and train_labels straified
    X_train, X_valid, y_train, y_valid = train_test_split(
        train_features, train_labels, test_size=0.2)

    # make X_train, Xvalid, y_train, y_valid tensors
    X_train = torch.tensor(X_train).float()
    X_valid = torch.tensor(X_valid).float()
    y_train = torch.tensor(y_train).float()
    y_valid = torch.tensor(y_valid).float()

    train_ds = TensorDataset(X_train, y_train)
    train_dl = DataLoader(train_ds, batch_size=32, shuffle=False)

    valid_ds = TensorDataset(X_valid, y_valid)
    valid_dl = DataLoader(valid_ds, batch_size=32, shuffle=False)

    epochs = 300

    for epoch in range(epochs):
        train_loss = 0.0

        model.train()
        for xb, yb in train_dl:
            pred = model(xb)
            loss = F.l1_loss(pred, yb)

            loss.backward()
            opt.step()
            opt.zero_grad()

            # update running training loss
            train_loss += loss.item()*xb.size(0)

        # print avg training statistics
        train_loss = train_loss/len(train_dl)

        model.eval()
        valid_loss = 0.0

        with torch.no_grad():
            for xb, yb in valid_dl:
                #xb = xb.view(-1, 1, 28, 28)
                outs = model(xb)
                valid_loss += F.l1_loss(outs, yb).item()*xb.size(0)

        valid_loss = valid_loss / len(valid_dl)

        if epoch % 50 == 0:
            print(f'Epoch {epoch+1}/{epochs} train loss: {train_loss:.3f} valid loss: {valid_loss:.3f}')
    return model


In [1144]:
def predict(train_features, train_labels, test_features):
    model = Net()
    model = train_model(model, train_features, train_labels)
    test_features_tensor = torch.tensor(test_features).float()
    predictions = model(test_features_tensor).detach().numpy()
    return predictions

In [1145]:
sj_predictions = predict(sj_train_features, sj_train_labels, sj_test_features)
iq_predictions = predict(iq_train_features, iq_train_labels, iq_test_features)


Epoch 1/300 train loss: 1090.237 valid loss: 922.004
Epoch 51/300 train loss: 672.872 valid loss: 845.964
Epoch 101/300 train loss: 623.745 valid loss: 784.485
Epoch 151/300 train loss: 608.873 valid loss: 778.724
Epoch 201/300 train loss: 672.926 valid loss: 703.194
Epoch 251/300 train loss: 600.637 valid loss: 654.285
Epoch 1/300 train loss: 242.295 valid loss: 181.847
Epoch 51/300 train loss: 188.308 valid loss: 131.021
Epoch 101/300 train loss: 180.566 valid loss: 125.078
Epoch 151/300 train loss: 173.986 valid loss: 130.467
Epoch 201/300 train loss: 169.230 valid loss: 121.418
Epoch 251/300 train loss: 159.541 valid loss: 132.193


In [1146]:
# convert predictions to 1d array and concatenate
sj_predictions = sj_predictions.flatten()
iq_predictions = iq_predictions.flatten()


In [1147]:
predictions = np.concatenate((sj_predictions, iq_predictions)).astype(int)

In [1148]:
submission = pd.read_csv("./data/submission_format.csv", index_col=[0, 1, 2])
submission.total_cases = predictions
submission.to_csv("./output/approach_2.csv")
