## Dengai Model

In this model I will first build a xgboost model and then pass these predictions along with the features to a neural network, the idea being that that the initial model will aid the neural network in making a more precise estimate.


In [1]:
#import tensorflow as tf
import numpy as np
import pandas as pd
import xgboost as xgb
import tensorflow as tf



from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_selection import SelectFromModel, VarianceThreshold





#### Preprocess Data

This function preprocesses the data, fills na values and separates the data from each city, all features are saved.
It is able to distinguish when you are loading just the test data and the train data.

In [2]:
# make function to preprocess data
def preprocess_data(data_path, labels_path=None):
    # load data and set index to city, year, weekofyear
    df = pd.read_csv(data_path)
    
    # fill missing values
    # try instead using mean and median
    df.fillna(method='bfill', inplace=True)
    
    #df = df[np.notnull(df)]
    #df = df.dropna()

    # add labels to dataframe
    if labels_path:
        labels = pd.read_csv(labels_path)
        #df = df.join(labels)
    
    # separate san juan and iquitos
    sj_features = df[df.city == 'sj']
    iq_features = df[df.city == 'iq']
    
    #dropping date and city as city already divided
    iq_features = iq_features.drop(iq_features.columns[[0,3]], axis=1)
    sj_features = sj_features.drop(sj_features.columns[[0,3]], axis=1)


#sj_labels.head()
    if labels_path:
        sj_labels = labels[labels.city == 'sj']
        iq_labels = labels[labels.city == 'iq']   
        #removing city, year, weekofyear from labels tables
        sj_labels = sj_labels.total_cases
        iq_labels = iq_labels.total_cases
        return sj_features, iq_features, sj_labels, iq_labels
    return sj_features, iq_features

In [8]:
sj_features, iq_features, sj_labels, iq_labels = preprocess_data(
                                                                'data/dengue_features_train.csv',
                                                                labels_path="data/dengue_labels_train.csv")

In [9]:
#load final test data
sj_test_final, iq_test_final = preprocess_data("data/dengue_features_test.csv")

Since data is already divided by city I remove that column as well as the date column as other columns represent it, so it is kinda redundant, as well as python doesn't like its string formatting

## Features and their descriptions
copied from the example website

#### City and date indicators
city – City abbreviations: sj for San Juan and iq for Iquitos
week_start_date – Date given in yyyy-mm-dd format

#### NOAA's GHCN daily climate data weather station measurements
station_max_temp_c – Maximum temperature
station_min_temp_c – Minimum temperature
station_avg_temp_c – Average temperature
station_precip_mm – Total precipitation
station_diur_temp_rng_c – Diurnal temperature range

#### PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)
precipitation_amt_mm – Total precipitation

#### NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)
<p>reanalysis_sat_precip_amt_mm – Total precipitation
reanalysis_dew_point_temp_k – Mean dew point temperature
reanalysis_air_temp_k – Mean air temperature
reanalysis_relative_humidity_percent – Mean relative humidity
reanalysis_specific_humidity_g_per_kg – Mean specific humidity
reanalysis_precip_amt_kg_per_m2 – Total precipitation
reanalysis_max_air_temp_k – Maximum air temperature
reanalysis_min_air_temp_k – Minimum air temperature
reanalysis_avg_temp_k – Average air temperature
reanalysis_tdtr_k – Diurnal temperature range
</p>

#### Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements
ndvi_se – Pixel southeast of city centroid
ndvi_sw – Pixel southwest of city centroid
ndvi_ne – Pixel northeast of city centroid
ndvi_nw – Pixel northwest of city centroid

In [10]:
sj_features.head()

Unnamed: 0,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,...,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,1990,18,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,297.742857,292.414286,...,32.0,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0
1,1990,19,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,298.442857,293.951429,...,17.94,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,1990,20,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,298.878571,295.434286,...,26.1,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4
3,1990,21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,299.228571,295.31,...,13.9,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,1990,22,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,299.664286,295.821429,...,12.2,80.46,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8


In [None]:
#add lagging variables to most significant variables
Cols = sj_features.columns.values.tolist()
clf = GradientBoostingRegressor(random_state = 8001)

selector = clf.fit(sj_features, sj_labels)
importances = selector.feature_importances_
fs = SelectFromModel(selector, prefit=True)
train = fs.transform(sj_features)
print(train.shape)



In [11]:
for column in sj_features:
    for i in range(5):
        not_lagged = ['year', 'weekofyear']
        if column not in not_lagged:
            new_var_name = column + "_lag_"+str(i+1)
            sj_features[new_var_name] = sj_features[column].shift(-(i+1))
            iq_features[new_var_name] = iq_features[column].shift(-(i+1))
            sj_test_final[new_var_name] = sj_test_final[column].shift(-(i+1))
            iq_test_final[new_var_name] = iq_test_final[column].shift(-(i+1))



In [22]:
sj_features.head()

Unnamed: 0,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,...,station_min_temp_c_lag_2,station_min_temp_c_lag_3,station_min_temp_c_lag_4,station_min_temp_c_lag_5,station_precip_mm_lag_1,station_precip_mm_lag_2,station_precip_mm_lag_3,station_precip_mm_lag_4,station_precip_mm_lag_5,xgb_pred
0,1990,18,0.585062,0.410758,0.36055,0.205706,-0.514378,-1.281232,-1.252891,-1.713274,...,22.8,23.3,23.9,23.9,8.6,41.4,4.0,5.8,39.1,8
1,1990,19,1.016554,0.829255,-0.274076,-0.190959,-0.282134,-0.764552,-0.678411,-0.733529,...,23.3,23.9,23.9,23.3,41.4,4.0,5.8,39.1,29.7,5
2,1990,20,-0.23915,1.164397,-0.36467,0.084296,-0.020413,-0.303353,-0.320827,0.211615,...,23.9,23.9,23.3,22.8,4.0,5.8,39.1,29.7,21.1,5
3,1990,21,0.640101,1.949147,0.871286,1.250091,-0.448725,-0.136906,-0.033588,0.132397,...,23.9,23.3,22.8,22.8,5.8,39.1,29.7,21.1,21.1,4
4,1990,22,1.256474,2.135629,1.286618,1.455392,-0.623801,0.293084,0.323996,0.458372,...,23.3,22.8,22.8,24.4,39.1,29.7,21.1,21.1,1.1,6


In [14]:
#randomly separating data
# splitting data into training set and test set

sj_train, sj_test, sj_train_target, sj_test_target = train_test_split(sj_features, sj_labels, test_size=0.2, random_state=41)

iq_train, iq_test, iq_train_target, iq_test_target = train_test_split(iq_features, iq_labels, test_size=0.5, random_state=41)



#### This function uses training-testing split sets



In [15]:
def xboostRegressor(city_feat, city_labels):
    '''
    this function builds a xboost model given a city
    '''        
    xgbr = xgb.XGBRegressor(n_estimators = 750, # number of boosted trees
                                learning_rate = 0.003057, # step size shrinkage used in update to prevent overfitting
                                max_depth = 10,
                                subsample = 0.75, # subsample ratio of the training set (Stochastic gradient boosting)
                                colsample_bytree = 0.75,
                               gamma = .025)
    xgbr.fit(city_feat,city_labels)
    return xgbr
    
    # Print the AUC
    #print(metrics.mean_absolute_error(testFoldTarget, xgbpred))

In [16]:
sj_model = xboostRegressor(sj_train, sj_train_target)
iq_model = xboostRegressor(iq_train, iq_train_target)

In [17]:
sj_pred = sj_model.predict(sj_test)
score = metrics.mean_absolute_error(sj_test_target,sj_pred)
print(score)

14.2202519787


In [18]:
iq_pred = iq_model.predict(iq_test)
score = metrics.mean_absolute_error(iq_test_target,iq_pred)
print(score)

6.01931702804


In [19]:
#adding predictions to featurers
sj_pred = sj_model.predict(sj_features)
iq_pred = iq_model.predict(iq_features)
sj_pred_final = sj_model.predict(sj_test_final)
iq_pred_final = iq_model.predict(iq_test_final)


#convert from float to int
sj_pred = [int(i) for i in sj_pred]
iq_pred = [int(i) for i in iq_pred]
sj_pred_final = [int(i) for i in sj_pred_final]
iq_pred_final = [int(i) for i in iq_pred_final]


sj_features['xgb_pred'] = list(sj_pred)
iq_features['xgb_pred'] = list(iq_pred)

sj_test_final['xgb_pred'] = list(sj_pred_final)
iq_test_final['xgb_pred'] = list(iq_pred_final)




In [20]:
sj_features.head()

Unnamed: 0,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,...,station_min_temp_c_lag_2,station_min_temp_c_lag_3,station_min_temp_c_lag_4,station_min_temp_c_lag_5,station_precip_mm_lag_1,station_precip_mm_lag_2,station_precip_mm_lag_3,station_precip_mm_lag_4,station_precip_mm_lag_5,xgb_pred
0,1990,18,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,297.742857,292.414286,...,22.8,23.3,23.9,23.9,8.6,41.4,4.0,5.8,39.1,8
1,1990,19,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,298.442857,293.951429,...,23.3,23.9,23.9,23.3,41.4,4.0,5.8,39.1,29.7,5
2,1990,20,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,298.878571,295.434286,...,23.9,23.9,23.3,22.8,4.0,5.8,39.1,29.7,21.1,5
3,1990,21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,299.228571,295.31,...,23.9,23.3,22.8,22.8,5.8,39.1,29.7,21.1,21.1,4
4,1990,22,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,299.664286,295.821429,...,23.3,22.8,22.8,24.4,39.1,29.7,21.1,21.1,1.1,6


In [24]:
#normalize features table
from sklearn import preprocessing


sj_features.fillna(method='ffill', inplace=True)
iq_features.fillna(method='ffill', inplace=True)
sj_test_final.fillna(method='ffill', inplace=True)
iq_test_final.fillna(method='ffill', inplace=True)


for column in sj_features:
    notnorm = ['year','weekofyear','xgb_pred']
    if column not in notnorm:
        sj_features[column] = preprocessing.scale(sj_features[column])
        iq_features[column] = preprocessing.scale(iq_features[column])
        sj_test_final[column] = preprocessing.scale(sj_test_final[column])
        iq_test_final[column] = preprocessing.scale(iq_test_final[column])





In [25]:
sj_labels.head()

0    4
1    5
2    4
3    3
4    6
Name: total_cases, dtype: int64

In [26]:
# split data again to train this model
sj_train, sj_test, sj_train_target, sj_test_target = train_test_split(sj_features, sj_labels, test_size=0.2, random_state=41)

iq_train, iq_test, iq_train_target, iq_test_target = train_test_split(iq_features, iq_labels, test_size=0.5, random_state=41)




In [27]:
iq_feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(iq_features)
sj_feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(sj_features)



In [28]:
# Build 3 layer DNN with 10, 20, 10 units respectively
iq_regressor = tf.contrib.learn.DNNRegressor(feature_columns=iq_feature_columns, 
                                            hidden_units=[512, 256, 128, 256, 512], 
                                            optimizer=tf.train.AdamOptimizer(
                                                learning_rate=.003
                                            ))
sj_regressor = tf.contrib.learn.DNNRegressor(feature_columns=sj_feature_columns, 
                                            hidden_units=[512, 256, 128, 256, 512],
                                            optimizer=tf.train.AdamOptimizer(
                                                learning_rate=.003
                                            ))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x121e3a5c0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': None}
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x121e3a6a0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1

In [29]:
#fitting regressor iq
iq_regressor.fit(iq_train, iq_train_target, steps=1000)
#fitting regressor sj
sj_regressor.fit(sj_train, sj_train_target, steps=1000)

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Sav

DNNRegressor(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._RegressionHead object at 0x11da9d940>, 'hidden_units': [512, 256, 128, 256, 512], 'feature_columns': (_RealValuedColumn(column_name='', dimension=123, default_value=None, dtype=tf.float64, normalizer=None),), 'optimizer': <tensorflow.python.training.adam.AdamOptimizer object at 0x11da9d748>, 'activation_fn': <function relu at 0x11cff7378>, 'dropout': None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 'input_layer_min_slice_size': None})

In [30]:
iq_predictions = list(iq_regressor.predict(iq_test, as_iterable=True))
iq_predictions = [int(i) for i in iq_predictions]
score = metrics.mean_absolute_error(iq_test_target, iq_predictions)
print("Mean Error: {0:f}".format(score))
#iq_labels_test.total_cases

Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/98/8l5yvjyn5hlbh6nxjddn37jm0000gn/T/tmpsixz7c5d/model.ckpt-1000
Mean Error: 6.126923


In [31]:
sj_predictions = list(sj_regressor.predict(sj_test, as_iterable=True))
sj_predictions = [int(i) for i in sj_predictions]
score = metrics.mean_absolute_error(sj_test_target, sj_predictions)
print("Mean Error: {0:f}".format(score))

Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/98/8l5yvjyn5hlbh6nxjddn37jm0000gn/T/tmpzhnebcoy/model.ckpt-1000
Mean Error: 13.696809


In [32]:
sj_pred_final = list(sj_regressor.predict(sj_test_final, as_iterable=True))
iq_pred_final = list(iq_regressor.predict(iq_test_final, as_iterable=True))

Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/98/8l5yvjyn5hlbh6nxjddn37jm0000gn/T/tmpzhnebcoy/model.ckpt-1000
Instructions for updating:
Please switch to predict_scores, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /var/folders/98/8l5yvjyn5hlbh6nxjddn37jm0000gn/T/tmpsixz7

In [33]:
sj_pred_final = [int(k) for k in sj_pred_final]
iq_pred_final = [int(k) for k in iq_pred_final]

In [34]:
submission = pd.read_csv("data/dengue_labels_test.csv",
                         index_col=[0, 1, 2])


submission.total_cases = np.concatenate([sj_pred_final, iq_pred_final])
submission.to_csv("submission/submission_stacked.csv")


In [None]:
#submission