# Challenge Summary

Can you predict local epidemics of dengue fever?

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half billion cases per year are occurring in Latin America:

Using environmental data collected by various U.S. Federal Government agencies—from the Centers for Disease Control and Prevention to the National Oceanic and Atmospheric Administration in the U.S. Department of Commerce—can you predict the number of dengue fever cases reported each week in San Juan, Puerto Rico and Iquitos, Peru?

# Team Information

Name: Team Fondue
Members:

- Anthony Xavier Poh Tianci (E0406854)
- Tan Jia Le Damien (E0310355)

# Imports

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf

from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-white')

from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator, TransformerMixin

import warnings
warnings.filterwarnings("ignore")

rng = 0


# Data Exploration


In [22]:
# Loading of dataset

train_features = pd.read_csv('./dengue_features_train.csv')
train_labels = pd.read_csv('./dengue_labels_train.csv')
test_features = pd.read_csv('./dengue_features_test.csv')

In [23]:
# fillna

train_features.fillna(method='ffill', inplace=True)
test_features.fillna(method='ffill', inplace=True)

In [24]:
# convert week_start_date column to datetime

train_features['week_start_date'] = pd.to_datetime(train_features['week_start_date'])
test_features['week_start_date'] = pd.to_datetime(test_features['week_start_date']) 

In [25]:
# extracting month to a new column

train_features['month'] = train_features.week_start_date.dt.month
test_features['month'] = test_features.week_start_date.dt.month

In [32]:
# merging features and labels

train_features = pd.merge(train_features, train_labels, on=['city', 'year', 'weekofyear'])

In [34]:
# getting the average of total_cases for each week over the years

train_features = train_features.join(train_features.groupby(['city','weekofyear'])['total_cases'].mean(), on=['city','weekofyear'], rsuffix='_avg')
test_features = test_features.join(train_features.groupby(['city','weekofyear'])['total_cases'].mean(), on=['city','weekofyear'], rsuffix='_avg')

In [36]:
# we do rolling sum for precipitation values because precipitation builds up over time

rolling_cols_sum=[
 'precipitation_amt_mm',
 'reanalysis_sat_precip_amt_mm',
 'station_precip_mm'
]

# for the following columns, we take the average over a given duration

rolling_cols_avg=[
 'ndvi_ne',
 'ndvi_nw',
 'ndvi_se',
 'ndvi_sw',
 'reanalysis_air_temp_k',
 'reanalysis_avg_temp_k',
 'reanalysis_dew_point_temp_k',
 'reanalysis_max_air_temp_k',
 'reanalysis_min_air_temp_k',
 'reanalysis_precip_amt_kg_per_m2',
 'reanalysis_relative_humidity_percent',
 'reanalysis_specific_humidity_g_per_kg',
 'reanalysis_tdtr_k',
 'station_avg_temp_c',
 'station_diur_temp_rng_c',
 'station_max_temp_c',
 'station_min_temp_c'
]

In [41]:
# for loop to create new rolling sum columns, sum over 3 weeks

for col in rolling_cols_sum:
    train_features['rolling_sum_' + col] = train_features[col].rolling(3).sum()
    test_features['rolling_sum_' + col] = test_features[col].rolling(3).sum()
    
# for loop to create new rolling average columns, mean over 3 weeks
    
for col in rolling_cols_avg:
    train_features['rolling_avg_'+col] = train_features[col].rolling(3).mean()
    test_features['rolling_avg_'+col] = test_features[col].rolling(3).mean()

In [45]:
# we use backward fill for missing values in the rolling sum and rolling average columns
# reason for this is because our rolling sum and rolling averages take values from the previous weeks

train_features.fillna(method='bfill', inplace=True)
test_features.fillna(method='bfill', inplace=True)

In [47]:
# save our train_features to a csv file for easier access later on

train_features.to_csv('train_features_modified.csv')