# Will it Rain Tomorrow?
### Deep Learning with Neural Networks and RNNs

For the third component of this portfolio, I wanted to explore a less abstract question about North Carolina's weather than the previous two projects. Unless someone is a time traveler, emerging from a coma, or visiting from the Southern Hemisphere, they probably have a reasonable sense of approximately what season it is. In this project, I address a much more practical and common weather question: *Will it rain tomorrow?*

To answer that question, I build two deep learning models: a standard neural network to predict whether it will rain tomorrow from all the availble information on the current day's weather (14 features), and a recurrent neural network (RNN) that examines whether it has rained each day over the past week in order to predict whether it will rain each day of the following week.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

from sklearn.utils import shuffle

# workaround for MacOS/jupyter notebook bug w/ tensorflow
# https://www.programmersought.com/article/69923598438/
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

## Import Data

We are once again exploring weather data for the RDU airport in NC's Research Triangle region since  2000, as explored in `data/Data-Wrangling.ipynb` and `data/Data-Visualization.ipynb`. When we import our data, we drop variables that are missing for large portions of the data (`STP` and `GUST`) or likely irrelevant to the data (`DAY`). We add an indicator column (`RAIN`) for whether it has rained that day.

In [2]:
np.random.seed(4471)
# https://www.tensorflow.org/tutorials/load_data/numpy
# https://www.tensorflow.org/tutorials/load_data/pandas_dataframe
weather_pd = pd.read_csv('../data/weather.csv', index_col = 0)
weather_pd = weather_pd.drop(['DAY', 'STP', 'GUST'], axis=1)

In [3]:
# whether it rained that day
# bool to int conversion: https://stackoverflow.com/questions/17506163/how-to-convert-a-boolean-array-to-an-int-array
weather_pd['RAIN'] = (weather_pd['PRCP'] > 0).astype(int)

In [4]:
weather_pd.head()

Unnamed: 0,YEAR,MONTH,SEASON,TEMP,DEWP,SLP,VISIB,WDSP,MXSPD,MAX,MIN,PRCP,SNDP,RAIN
0,2000,1,0,47.6,38.1,1023.7,8.3,3.0,10.1,66.9,33.1,0.0,0.0,0
1,2000,1,0,55.3,46.3,1024.2,9.5,4.8,14.0,70.0,33.1,0.0,0.0,0
2,2000,1,0,62.6,55.4,1021.3,8.4,8.5,14.0,73.9,43.0,0.0,0.0,0
3,2000,1,0,65.2,58.6,1014.4,9.5,15.3,28.0,73.9,55.0,0.0,0.0,0
4,2000,1,0,45.7,30.9,1019.8,9.8,6.4,11.1,57.9,37.0,0.34,0.0,1


In [5]:
weather_pd.dtypes

YEAR        int64
MONTH       int64
SEASON      int64
TEMP      float64
DEWP      float64
SLP       float64
VISIB     float64
WDSP      float64
MXSPD     float64
MAX       float64
MIN       float64
PRCP      float64
SNDP      float64
RAIN        int64
dtype: object

# Part 1: Today's Weather -> Tomorrow's Rain

Our first neural network takes all our known information about today's weather (14 features) and builds a model to predict whether it will rain tomorrow.

## Divide into Train/Test Sets

We train our model on 90% of the data and test it on the other 10%.

In [6]:
# adapted from original code in project 2
def divide_data(weather):
    '''divide dataset into two sets: 90% train and 10% test'''
    n = weather.shape[0]
    
    # shuffle data for test/train so no patterns
    # https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
    weather = shuffle(weather)
    
    # take out 10% of the data for validation
    # https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html
    ind_test = np.random.choice(n, size = n // 10, replace = False)
    weather_test = weather.iloc[ind_test]

    # take the other 90% for building the model
    # https://stackoverflow.com/questions/27824075/accessing-numpy-array-elements-not-in-a-given-index-list
    ind_train = [x for x in range(n) if x not in ind_test] # not in index
    weather_train = weather.iloc[ind_train]

    return weather_test, weather_train

In [7]:
weather_test, weather_train = divide_data(weather_pd)
print("test: ", weather_test.shape)
print("train:", weather_train.shape)
weather_train.head()

test:  (777, 14)
train: (6995, 14)


Unnamed: 0,YEAR,MONTH,SEASON,TEMP,DEWP,SLP,VISIB,WDSP,MXSPD,MAX,MIN,PRCP,SNDP,RAIN
1303,2003,7,2,79.5,72.2,1021.6,9.4,6.2,13.0,90.0,69.1,0.02,0.0,1
3355,2009,3,0,70.8,41.4,1015.0,10.0,9.7,15.9,84.0,53.1,0.0,0.0,0
52,2000,2,0,39.9,28.0,1033.2,8.9,1.1,8.0,57.0,27.0,0.0,0.0,0
2575,2007,1,0,40.6,32.0,1019.2,4.6,5.1,11.1,54.0,27.1,0.23,0.0,1
705,2001,12,3,59.1,46.5,1025.2,9.9,4.9,17.1,75.9,36.0,0.0,0.0,0


## Divide Each Set into Features/Targets

For both the training set and testing set, our predictive features are the current day's weather, and the target output is whether it will rain tomorrow. Thus, we shift the target values over by one index.

In [8]:
def separate_targets(weather):
    '''separate dataset into features and targets'''
    # target: whether next day rains
    target = weather[['RAIN']].iloc[1:, :]
    target = np.round(target.to_numpy().reshape(-1))

    # feature: today's weather (array of 14 vars)
    feature = weather.iloc[:-1].to_numpy()
    
    return feature, target

In [9]:
feature_test, target_test = separate_targets(weather_test)
feature_train, target_train = separate_targets(weather_train)

print("feature:", feature_test.shape, "target:", target_test.shape)
print("feature:", feature_train.shape, "target:", target_train.shape)

feature: (776, 14) target: (776,)
feature: (6994, 14) target: (6994,)


## Build the Model

Our neural network has an input layer (not shown), a hidden layer, and an output layer.

In [10]:
# how many possible outputs
num_output_vals = len(np.unique(target_test))

# num nodes in first layer
first_layer = 64

In [11]:
def build_model(first_layer, num_output_vals):
    ''' build NN model'''
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(first_layer, activation='sigmoid', dtype='float64'),
        tf.keras.layers.Dropout(0.2, dtype='float64'),
        tf.keras.layers.Dense(num_output_vals, activation='sigmoid', dtype='float64')
    ])
    
    model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
    return model

In [12]:
model = build_model(first_layer, num_output_vals)

## Example Model Output

As an example, we can apply our model to a day of weather from the training dataset. We observe that the model output has two values, corresponding with our two possible outcomes: dry or rainy. These results will not be accurate because we have not yet trained our model.

In [13]:
example = model(feature_train[88].reshape(1,14))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



In [14]:
example.shape

TensorShape([1, 2])

In [15]:
example

<tf.Tensor: id=57, shape=(1, 2), dtype=float64, numpy=array([[0.59816531, 0.41840464]])>

## Train the Model

Next, we train the model on our training set.

In [16]:
EPOCHS = 30

In [17]:
history = model.fit(feature_train, target_train, epochs=EPOCHS)

Train on 6994 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                multiple                  960       
_________________________________________________________________
dropout (Dropout)            multiple                  0         
_________________________________________________________________
dense_1 (Dense)              multiple                  130       
Total params: 1,090
Trainable params: 1,090
Non-trainable params: 0
_________________________________________________________________


## Predict for an Example Day

Now that the model is trained, we can predict whether it will rain tomorrow for any day in the dataset.

In [19]:
pred = model.predict(feature_test[88].reshape(1,14))
print(sum(sum(pred)))
print(pred.argmax())

1.588809437302793
0


## Evaluate on Test Set

Finally, we can evaluate the performance of our model on the test set.

In [20]:
model.evaluate(feature_test, target_test, verbose=2)

776/1 - 0s - loss: 0.7206 - accuracy: 0.6392


[0.6544783852764011, 0.63917524]

We obtain a somewhat inaccurate accuracy of 64%. This is to be expected, since the weather can change rapidly in 24 hours. Additionally, this model has very little sense of movement through time, so it cannot learn general weather trends outside a 24-hour period. Below, we explore using a recurrent neural network harness the sequential nature of our weather datain our modeling.

# Part 2: Last Week's Rain -> This Week's Rain

Our second neural network takes the pattern of rain over the past week and predicts whether it will rain the following day. This neural network is a recurrent neural network (RNN) because we can run this model iteratively, reusing our predictions as inputs, to generate rain predictions for the entire next week of weather.

## Create Features and Targets

For this neural network, we will only use the `RAIN` column of our weather data, which indicates whether or not it rained on a particular day.

The following code prepares weeks of data on which to train our RNN.

In [21]:
# extract rain column
rain = weather_pd[['RAIN']].to_numpy().reshape(-1)
rain.shape

(7772,)

In [22]:
# week length
seq_length = 7
examples_per_epoch = len(rain)//(seq_length+1)

# convert to tf
rain_tf = tf.data.Dataset.from_tensor_slices(rain)

# example data
for i in rain_tf.take(5):
  print(i.numpy())

0
0
0
0
1


In [23]:
# turn individual days into week sequences
weeks = rain_tf.batch(seq_length+1, drop_remainder=True)

for day in weeks.take(5):
  print(day.numpy())

[0 0 0 0 1 0 1 1]
[1 1 1 0 0 0 0 0]
[0 1 1 1 0 0 1 1]
[1 1 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0]


In [24]:
# input week is days 0-6, target week is days 1-7
def split_input_target(week):
    '''duplicate and shift weeks to form input and target days'''
    input_days = week[:-1]
    target_days = week[1:]
    return input_days, target_days

dataset = weeks.map(split_input_target)

In [25]:
# view example inputs and targets
for input_example, target_example in  dataset.take(3):
    print ('Input data: ', input_example.numpy())
    print ('Target data:', target_example.numpy())

Input data:  [0 0 0 0 1 0 1]
Target data: [0 0 0 1 0 1 1]
Input data:  [1 1 1 0 0 0 0]
Target data: [1 1 0 0 0 0 0]
Input data:  [0 1 1 1 0 0 1]
Target data: [1 1 1 0 0 1 1]


In [26]:
# view example input and output
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input:", input_idx.numpy())
    print("  expected output:", target_idx.numpy())

Step    0
  input: 0
  expected output: 1
Step    1
  input: 1
  expected output: 1
Step    2
  input: 1
  expected output: 1
Step    3
  input: 1
  expected output: 0
Step    4
  input: 0
  expected output: 0


In [27]:
# create training batches of size 4
BATCH_SIZE = 4

# buffer size to shuffle dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((4, 7), (4, 7)), types: (tf.int64, tf.int64)>

## Build the Model

We build our recurrent neural network with an embedding layer to process our inputs, a recurrent hidden layer, and an output layer.

In [28]:
# how many possible outputs
num_output_vals = len(np.unique(rain))

# embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 16

In [29]:
def build_model_RNN(num_output_vals, embedding_dim, rnn_units, batch_size):
    '''build an RNN'''
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(num_output_vals, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(num_output_vals)
    ])
    return model

In [30]:
modelRNN = build_model_RNN(num_output_vals, embedding_dim, rnn_units, BATCH_SIZE)

## Example Model Output

As an example, we can apply our model to a day of weather from the training dataset. We observe that the model output comes in batches of 4 weeks, each 7 days long, with two possible output states of raining or not raining.

These results will not be accurate because we have not yet trained our model.

In [31]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = modelRNN(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, num_output_vals)")

(4, 7, 2) # (batch_size, sequence_length, num_output_vals)


In [32]:
# example prediction
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([1, 0, 0, 0, 1, 0, 0])

In [33]:
# show input and output
print("Input: \n", input_example_batch[0].numpy())
print()
print("Next Day Predictions: \n", sampled_indices)

Input: 
 [0 0 1 1 1 1 1]

Next Day Predictions: 
 [1 0 0 0 1 0 0]


## Prepare Model for Training

Now that we have applied our model to at least one dataset, we can view the structure of our neural network. We also need to compile our model and configure checkpoints to save our training progress.

In [34]:
modelRNN.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (4, None, 256)            512       
_________________________________________________________________
gru (GRU)                    (4, None, 16)             13152     
_________________________________________________________________
dense_2 (Dense)              (4, None, 2)              34        
Total params: 13,698
Trainable params: 13,698
Non-trainable params: 0
_________________________________________________________________


In [35]:
# set up loss function
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, days_in_batches, num_output_vals)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (4, 7, 2)  # (batch_size, days_in_batches, num_output_vals)
Mean loss:         0.691676


In [36]:
# exponential of the mean loss should ~= num outputs
tf.exp(mean_loss).numpy()

1.9970598

As we configure our loss function, we notice that the exponential of the mean loss is about equal to the number of outputs, which indicates that the model should not be overconfident about wrong predictions—a good sign.

In [37]:
# compile the model
modelRNN.compile(optimizer='adam', loss=loss)

In [38]:
# configure checkpoints to save during training
# directory where the checkpoints will be saved
checkpoint_dir = './training-checkpoints'
# name checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

## Train Model

Now, our model is ready to be trained on our data.

In [39]:
EPOCHS = 30

In [40]:
history = modelRNN.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Restore Latest Checkpoint

Since we saved the checkpoints of our model, we do not need to re-train the model every time we open this notebook. We can restore the model from our last checkpoint, this time with a batch size of 1, to obtain a complete model that is ready to make predictions for any given week of rain patterns.

In [41]:
def restore_model(num_output_vals, embedding_dim, rnn_units):
    '''restore model from save'''
    tf.train.latest_checkpoint(checkpoint_dir)

    modelRNN = build_model_RNN(num_output_vals, embedding_dim, rnn_units, batch_size=1)
    modelRNN.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
    modelRNN.build(tf.TensorShape([1, None]))
    
    return modelRNN

modelRNN = restore_model(num_output_vals, embedding_dim, rnn_units)

modelRNN.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            512       
_________________________________________________________________
gru_1 (GRU)                  (1, None, 16)             13152     
_________________________________________________________________
dense_3 (Dense)              (1, None, 2)              34        
Total params: 13,698
Trainable params: 13,698
Non-trainable params: 0
_________________________________________________________________


## Generate Rain Predictions for Next Week

Now we can evaluate our model by generating rain predictions for next week! Starting with whether it has rained over each of the past seven days, our model will predict whether it will rain tomorrow. Then, assuming tomorrow is the current day, our model will predict whether it will rain the following day, and so on. Like a real weather forecast, this will likely get less accurate the further out we predict.

In [42]:
# an example week - it has not rained in the past 7 days
# as of May 20, 2021
example_week = np.array([[0, 0, 0, 0, 0, 0, 0]])
example_week.shape

(1, 7)

In [43]:
def generate_weather(model, start_weather):
    '''generate rain predictions for the next 7 days'''
    # Evaluation step (generating weather using the learned model)

    # Number of days to generate
    num_generate = 7

    # Empty string to store our results
    days_generated = []

    # Low temperatures results in more predictable results.
    # Higher temperatures results in more surprising results.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(start_weather)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the temperature returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        # we pass the predicted weather as the next input to the model
        # along with the previous hidden state
        start_weather = tf.expand_dims([predicted_id], 0)

        days_generated.append(predicted_id)

    return days_generated

In [44]:
print("start days:         ", list(example_week[0]))
print("predicted next days:", generate_weather(modelRNN, start_weather=example_week))

start days:          [0, 0, 0, 0, 0, 0, 0]
predicted next days: [0, 0, 0, 0, 1, 0, 0]


This RNN predicts that it will be clear for the next 7 days except for 5 days from now. Currently, my weather app (Dark Sky) is predicting clear weather most of this week with about a 50% chance of rain 5 and 7 days from now - fairly close to what my model predicts! This could easily be by chance, since weather forecasting is fairly inaccurate so far out. However, these results are encouraging that the model would expect a couple days of rain after so many clear days, which is a very normal weather pattern for North Carolina.

## Conclusion

It seems unlikely that either of these neural networks is extremely accurate, since we are only building from the most common weather features, on data from only the past two decades, from one single location in North Carolina. In practice, meteorology is surely much more complicated than throwing some weather data at a neural network and seeing what pops out. However, for this portfolio project, both models achieved encouraging results. The first model may have only been 64% accurate, but 64% is better than random chance. And the results our RNN  generated are fairly similar to my weather app's current forecast, an encouraging sign that it started to learn North Carolina's weather trends.

In future work, I would like to explore whether feeding more data into these neural networks could produce more accurate results. In the first neural network, I could try predicting whether it will rain tomorrow based off all the weather features of the past week, rather than just the weather features of the current day. In the RNN, I could explore different training sequence sizes and batch sizes for learning North Carolina's rain trends. There certainly is plenty more I would like to explore with deep learning!