# Predicting Droughts with Meteorological Data
### Pre-processing and then saving data as files to save time




## Loading the training data

We load the csv file paths for training, validation and testing into the ``files`` dictionary.

In [1]:
import numpy as np
import pandas as pd
import json
import os
import time
from tqdm.auto import tqdm
from datetime import datetime
from scipy.interpolate import interp1d
from sklearn.preprocessing import RobustScaler

from loader import *
from normalizer import *
from viz_report import *

files = {'test': 'test_timeseries.csv',
        'train': 'train_timeseries.csv'}

Now we'll define a helper method to load the datasets. This just walks through the json and discards the few samples that are corrupted.

In [2]:
# read csvs only when we need to create new data
def get_dfs(files, file_list):
    return {k: pd.read_csv(files[k]).set_index(['fips', 'date'])
            for k in file_list}

In [3]:
dfs = get_dfs(files, ['test', 'train'])

We encode the day of year using sin/cos and add the data loading function `loadXY`.

Now we add a helper to normalise the data.

We can now load our training data set, where X consists of static (soil) and time (meteorological) data and Y consists of the future drought values.

In [4]:
WINDOW_SIZE = 14
X_static_train, X_time_train, y_target_train = loadXY(dfs, "train",
                                                     window_size=WINDOW_SIZE)
print("train shape", X_time_train.shape)

normer = Normalizer()
X_static_train, X_time_train = normer.normalize(X_static_train, X_time_train, fit=True, with_var=True)

  0%|          | 0/3108 [00:00<?, ?it/s]

loaded 1366099 samples
train shape (1366099, 14, 21)


  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

<div class="alert alert-block alert-info"><b>Note:</b> The previous blocks have (mostly) been copied from the kaggle challenge starter notebook. My own contributions in plotting functions, models, etc. begin here.</div> 

In [5]:
print('First line of above data frame:')
print(X_time_train[0][0])
print('...')
print('Last line of above data frame:')
print(X_time_train[0][-1])
print()
print('Labels to be predicted:')
print(y_target_train[0])

First line of above data frame:
[ 0.52218561  0.80873286 -0.20999945 -0.42301141 -0.19271136 -0.1922419
 -0.36496363 -0.43362918  0.42423737 -0.41377866 -0.59771856 -0.65405067
 -0.13323254 -0.76847603 -0.41170154 -0.56177661 -0.3475337  -0.27081785
  2.13890924  0.11372939  0.97378225]
...
Last line of above data frame:
[-0.11189692  0.82182974 -0.85245538 -1.12233853 -1.26041428 -1.24217844
 -1.05793255 -1.06844306  0.16683492 -1.09781127 -0.51546371 -0.58715912
 -0.08327034 -0.69660417 -0.38980252 -0.82298484  0.18748529 -1.24678408
  2.13890924  0.31874101  0.9212121 ]

Labels to be predicted:
[2. 1. 1. 1. 1. 1.]


To create ``X_train``, we are flattening out the time data, pretending there is no temporal component. Then we are concatenating the static soil data.

In [6]:
print(X_time_train.shape)
X_train = np.array(list(map(lambda x: x.flatten(), X_time_train)))
print(X_train.shape)
print(X_static_train.shape)
X_train = np.concatenate((X_train, X_static_train), axis=1)
print(X_train.shape)

(1366099, 14, 21)
(1366099, 294)
(1366099, 30)
(1366099, 324)


``round_and_intify()`` rounds interpolated drought values like 1.21 into clean integers between 0 and 5.


``bold()`` surrounds a string in **boldness** modifiers for printing

``plot_confusion_matrix()`` plots a single seaborn confusion matrix

``plot_confusion_matrices()`` plots a series of six seaborn confusion matrices

``summarize()`` prints a series of confusion matrices from (rounded) true and predicted y values

``macro_f1()`` just returns the macro F1 score

## Loading the validation data

Here we load the validation data and flatten it or transform the time series data into MiniROCKET features, just like for the training data.

Then, we concatenate the fixed-size data on soil quality etc. to the flattened and MiniROCKET features. 

In [7]:
X_static_valid, X_time_valid, y_target_valid = loadXY(dfs, "test",
                                                     window_size=WINDOW_SIZE)
print("test shape", X_time_valid.shape)
X_static_valid, X_time_valid = normer.normalize(X_static_valid, X_time_valid)

  0%|          | 0/3108 [00:00<?, ?it/s]

loaded 149685 samples
test shape (149685, 14, 21)


  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

To create ``X_valid``, we are flattening out the time data, pretending there is no temporal component.

In [8]:
print(X_time_valid.shape)

X_valid = np.array(list(map(lambda x: x.flatten(), X_time_valid)))

print(X_valid.shape)
print(X_static_valid.shape)

X_valid = np.concatenate((X_valid, X_static_valid), axis=1)

print(X_valid.shape)

(149685, 14, 21)
(149685, 294)
(149685, 30)
(149685, 324)


In [9]:
np.save('X_test' + str(WINDOW_SIZE) + 'wvar', X_valid)
np.save('y_test' + str(WINDOW_SIZE) + 'wvar', y_target_valid)



## save data to files

In [9]:
# save the data with pickles so we dont have to reload
np.save('X_train_' + str(WINDOW_SIZE) + 'wvar', X_train)
np.save('y_train_' + str(WINDOW_SIZE)+ 'wvar', y_target_train)
np.save('X_valid_' + str(WINDOW_SIZE) + 'wvar', X_valid)
np.save('y_valid_' + str(WINDOW_SIZE) + 'wvar', y_target_valid)



In [10]:
#validate
# from loader.py package use np_load
#test1 = np_load('X_train')

In [11]:
#print(X_train.shape)
#print(test1.shape)
