# An RNN model for temperature data

This time we will be working with real data: daily (Tmin, Tmax) temperature series from 36 weather stations spanning 50 years. It is to be noted that a pretty good predictor model already exists for temperatures: the average of temperatures on the same day of the year in N previous years. It is not clear if RNNs can do better but we will see how far they can go.

In [None]:
!pip install tensorflow==1.15.3

In [None]:
import math
import sys
import time
import numpy as np

sys.path.insert(0, "temperatures/utils/") # so Python can find the utils_ modules
import utils_batching
import utils_args

import tensorflow as tf
from tensorflow.python.lib.io import file_io as gfile
print("Tensorflow version: " + tf.__version__)

In [None]:
from matplotlib import pyplot as plt
import utils_prettystyle
import utils_display

## Download data

In [None]:
%%bash
DOWNLOAD_DIR=temperatures/data
mkdir $DOWNLOAD_DIR
gsutil -m cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/temperatures/* $DOWNLOAD_DIR

## Hyperparameters

N_FORWARD = 1: works but model struggles to predict from some positions

N_FORWARD = 4: better but still bad occasionally

N_FORWARD = 8: works perfectly

In [None]:
NB_EPOCHS = 5 # number of times the model sees all the data during training

N_FORWARD = 8 # train the network to predict N in advance (traditionally 1)
RESAMPLE_BY = 5 # averaging period in days (training on daily data is too much)
RNN_CELLSIZE = 128 # size of the RNN cells
N_LAYERS = 2 # number of stacked RNN cells
SEQLEN = 128 # unrolled sequence length
BATCHSIZE = 64 # mini-batch size
DROPOUT_KEEP = 0.7 # probability of neurons not being dropped (should be between 0.5 and 1)
ACTIVATION = tf.nn.tanh # activation function for GRU cells (tf.nn.relu or tf.nn.tanh)

JOB_DIR = "checkpoints"
DATA_DIR = "temperatures/data"

# potentially override some settings from command-line arguments
if __name__ == "__main__":
    JOB_DIR, DATA_DIR = utils_args.read_args1(JOB_DIR, DATA_DIR)
    
ALL_FILEPATTERN = DATA_DIR + "/*.csv" # pattern matches all 1666 files
EVAL_FILEPATTERN = DATA_DIR + "/USC000*2.csv" # pattern matches 8 files
# pattern USW*.csv -> 298 files, pattern USW*0.csv -> 28 files
print("Reading data from '{}'. \nWriting checkpoints to '{}'".format(DATA_DIR, JOB_DIR))

## Temperature data

This is what our temperature dataset looks like: sequences of daily (Tmin, Tmax) from 1960 to 2010. They have been cleaned up and eventual missing values have been filled by interpolation. Interpolated regions of the dataset are marked in red on the graph.

In [None]:
all_filenames = gfile.get_matching_files(ALL_FILEPATTERN)
eval_filenames = gfile.get_matching_files(EVAL_FILEPATTERN)
train_filenames = list(set(all_filenames) - set(eval_filenames))

# By default, this utility function loads all the files and places data
# from them as-is in an array, one file per line. Later, we will use it
# to shape the dataset as needed for training
ite = utils_batching.rnn_multistation_sampling_temperature_sequencer(eval_filenames)
evtemps, _, evdates, _, _ = next(ite) # gets everything

print("Pattern '{}' matches {} files".format(ALL_FILEPATTERN, len(all_filenames)))
print("Pattern '{}' matches {} files".format(EVAL_FILEPATTERN, len(eval_filenames)))
print("Evaluation files: {}".format(len(eval_filenames)))
print("Training files: {}".format(len(train_filenames)))
print("Initial shape of the evaluation dataset: " + str(evtemps.shape))
print("{} files, {} data points per file, {} values per data point"
      " (Tmin, Tmax, is_interpolated) ".format(evtemps.shape[0], evtemps.shape[1], evtemps.shape[2]))