# Imports

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd

from keijzer import setup_multi_gpus, create_corr_matrix, reduce_memory, resample_df

%matplotlib inline
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"} # Make sure the axis background of plots is white, this is usefull for the black theme in JupyterLab
sns.set()

# Creating an artificial dataframe

Below a DataFrame is being created, where we know the values from.  
This makes the validation of the steps done in this notebook easier than with the real data.  
We are creating a DataFrame of 50 rows by 3 columns.

In [9]:
# Create some artificial data to easier see what the reshaping is actually doing

# Create three numpy arrays
a = np.arange(0, 50, 1).reshape(50,1) # 1 to 50 in steps of 1
b = np.arange(2, 102, 2).reshape(50,1) # 2 to 102 in steps of 2
c = np.arange(0.5, 25.5, 0.5).reshape(50,1) # 0.5 to 25.5 in steps of 0.5

# Concatenate them together into one ndarray
data = np.concatenate([a,b,c], axis=1)
data[:5] # Print the first 5 rows

array([[ 0. ,  2. ,  0.5],
       [ 1. ,  4. ,  1. ],
       [ 2. ,  6. ,  1.5],
       [ 3. ,  8. ,  2. ],
       [ 4. , 10. ,  2.5]])

In [10]:
data.shape

(50, 3)

In [11]:
# Make it a Pandas DataFrame, because the original (real dwelling) data is still a dataframe at this point
data = pd.DataFrame(data, columns=['a', 'b', 'c'])
data.head()

Unnamed: 0,a,b,c
0,0.0,2.0,0.5
1,1.0,4.0,1.0
2,2.0,6.0,1.5
3,3.0,8.0,2.0
4,4.0,10.0,2.5


# Create a train & test set
This is supposed to give a general insight of how the data is being transformed.

## Define train size and target column number

In [12]:
train_size = 0.1 # Use 10 % of the data as the train set, this is so it's easy to display in this notebook
target_column = 2 # the column index value of the target column

split_idx = int(data.shape[0]*train_size) # index nummer to split at
split_idx

5

## Split the data
The DataFrame `data` is converted to a NumPy array by doing `data.values`.  
After this the data is split by using slicing based on the (rows, columns) shape.

In [13]:
# ...train
X_train = data.values[:split_idx, :target_column]
y_train = data.values[:split_idx, target_column]

# ...test
X_test = data.values[split_idx:, :target_column]
y_test = data.values[split_idx:, target_column]

## Validating the row index where the data is split

In [14]:
X_train

array([[ 0.,  2.],
       [ 1.,  4.],
       [ 2.,  6.],
       [ 3.,  8.],
       [ 4., 10.]])

In [15]:
X_test[:5] # print only the first 5

array([[ 5., 12.],
       [ 6., 14.],
       [ 7., 16.],
       [ 8., 18.],
       [ 9., 20.]])

Remember, there are 50 rows with a train size of 0.1.  
So the dataset should be split at row number $50 \cdot 0.1 = 5$.  
It is split at the correct point, that's good.

## Taking a look at the shape of train set

In [16]:
X_train.shape # 5 rows by 2 columns

(5, 2)

In [17]:
X_train

array([[ 0.,  2.],
       [ 1.,  4.],
       [ 2.,  6.],
       [ 3.,  8.],
       [ 4., 10.]])

In [18]:
y_train.shape # 5 values

(5,)

In [19]:
y_train

array([0.5, 1. , 1.5, 2. , 2.5])

## Get the data in CNN/RNN format
In RNN terms:  
- format = (samples, timesteps, feautures)  
- or in other words (samples, lookback, feautures)

This translater in CNN terms to:  
- shape = (height, width, channels)

Or in other terms, lets make small tables/images of the historical data that is being used to predict the next value of the target column.

## Reshaping the train data
### Visual explanation

Let $A$ be a (5 $\times$ 3) matrix, so 5 rows by 3 columns.  
There are 2 feautures $X$, column 1 is feature $X_1$ and column 2 is the second feature $X_2$.  
Column 3 is the target $y$.  
  

$A = \begin{bmatrix}
0 & 2 & 0.5 \\ 
1 & 4 & 1.0 \\ 
2 & 6 & 1.5 \\ 
3 & 8 & 2.0 \\ 
4 & 10 & 2.5 
\end{bmatrix}$

Or more precise, in numpy this would be in the form:

$ A = \begin{bmatrix}
\begin{bmatrix} 0 & 2 & 0.5 \end{bmatrix} \\ 
\begin{bmatrix} 1 & 4 & 1.0 \end{bmatrix} \\ 
\begin{bmatrix} 2 & 6 & 1.5 \end{bmatrix} \\ 
\begin{bmatrix} 3 & 8 & 2.0 \end{bmatrix} \\ 
\begin{bmatrix} 4 & 10 & 2.5 \end{bmatrix}
\end{bmatrix}$

This matrix is obtained by `df.values`.  
To prepare this for usage with CNN/RNN it has to be split in $X$ and $y$.  
Where $X$ for RNN is in the shape $(samples, timesteps, features)$ or for CNN in the shape $(height, width, channels)$.  
They practically can be used in the same way for timeseries data, so lets just pick the RNN termonology for now.  
Timesteps can be interpreted as 'lookback', i.e. the amount of timesteps being looked back to predict the $y$ value at said point.  

Next step is to split $A$ into the feature matrix $X$ and the target matrix $y$.

$ X = \begin{bmatrix}
\begin{bmatrix} 0 & 2 \end{bmatrix} \\ 
\begin{bmatrix} 1 & 4 \end{bmatrix} \\ 
\begin{bmatrix} 2 & 6 \end{bmatrix} \\ 
\begin{bmatrix} 3 & 8 \end{bmatrix} \\ 
\begin{bmatrix} 4 & 10 \end{bmatrix}
\end{bmatrix}$ $ \ \ \ \ \ \ \ \ \ \    y = \begin{bmatrix} 0.5 & 1.0 & 1.5 & 2.0 & 2.5 \end{bmatrix}$

Now say that timesteps (or lookback) equals 2.  
The idea is to use the two previous $X$ values to predict the current $y$ value.   

Then $\begin{bmatrix}
\begin{bmatrix} 0 & 2  \end{bmatrix} \\ 
\begin{bmatrix} 1 & 4 \end{bmatrix}
\end{bmatrix}$ will be used to predict $\begin{bmatrix}
\begin{bmatrix} 1.5 \end{bmatrix}
\end{bmatrix}$.  

Note that $\begin{bmatrix}
\begin{bmatrix} 0.5  \end{bmatrix} \\ 
\begin{bmatrix} 1.0 \end{bmatrix}
\end{bmatrix}$ cannot be predicted because the previous two $X$ values for them are not available.

To get this done $X$ has to be reshaped to size $(samples, timesteps, features)$.

$ X = \begin{bmatrix} \begin{bmatrix}
\begin{bmatrix} 0 & 2 \end{bmatrix} \\
\begin{bmatrix} 1 & 4 \end{bmatrix}
\end{bmatrix} \\ 
\begin{bmatrix}
\begin{bmatrix} 1 & 4 \end{bmatrix} \\
\begin{bmatrix} 2 & 6 \end{bmatrix}
\end{bmatrix} \\
\begin{bmatrix}
\begin{bmatrix} 2 & 6 \end{bmatrix} \\
\begin{bmatrix} 3 & 8 \end{bmatrix}
\end{bmatrix} \end{bmatrix}  \ \ \ \ \ $ to predict  $ \ \ \ \ \     y = \begin{bmatrix} 1.5 & 2.0 & 2.5 \end{bmatrix}$

Finally, $X$ and $y$ can be used as input for a CNN/RNN network.

### The code

In [20]:
# Define variables
look_back = 2 # look back 2 steps
n_features = 2 # using 2 feautures
output_dim = 1 # to predict 1 y value

In [21]:
samples = len(X_train) # total amount of samples
samples

5

In [22]:
samples_train = X_train.shape[0] - look_back 
samples_train

3

In [23]:
# Define zeros array with the target shape
X_train_reshaped = np.zeros((samples_train, look_back, n_features))
y_train_reshaped = np.zeros((samples_train))
X_train_reshaped.shape

(3, 2, 2)

In [24]:
X_train_reshaped

array([[[0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.]]])

In [25]:
# Create the reshaped train data
for i in range(samples_train):
    y_position = i + look_back
    X_train_reshaped[i] = X_train[i:y_position]
    y_train_reshaped[i] = y_train[y_position]

In [26]:
X_train_reshaped.shape

(3, 2, 2)

In [27]:
X_train_reshaped

array([[[0., 2.],
        [1., 4.]],

       [[1., 4.],
        [2., 6.]],

       [[2., 6.],
        [3., 8.]]])

In [28]:
X_train

array([[ 0.,  2.],
       [ 1.,  4.],
       [ 2.,  6.],
       [ 3.,  8.],
       [ 4., 10.]])

In [29]:
y_train

array([0.5, 1. , 1.5, 2. , 2.5])

In [30]:
# Do the same for the test data
samples_test = X_test.shape[0] - look_back
X_test_reshaped = np.zeros((samples_test, look_back, n_features))
y_test_reshaped = np.zeros((samples_test))

for i in range(samples_test):
    y_position = i + look_back
    X_test_reshaped[i] = X_test[i:y_position]
    y_test_reshaped[i] = y_test[y_position]

## Pandas DataFrame to CNN & RNN format
Lets make the steps explained above as a general function that transforms a dataframe into the required CNN/RNN format.

In [None]:
def df_to_cnn_rnn_format(df, test_size=0.5, look_back=5, target_column='target', scale_X=True):
    """
    Input is a Pandas DataFrame. 
    Output is a np array in the format of (samples, timesteps, features).
    Currently this function only accepts one target variable.

    Usage example:

    # variables
    df = data # should be a pandas dataframe
    train_size = 0.5 # percentage to use for training
    target_column = 'c' # target column name, all other columns are taken as features
    scale_X = False
    look_back = 5 # Amount of previous X values to look at when predicting the current y value
    """
    df = df.copy()

    # Make sure the target column is the last column in the dataframe
    df['target'] = df[target_column] # Make a copy of the target column, this places the new 'target' column at the end of all the other columns
    df = df.drop(columns=[target_column]) # Drop the original target column
    
    target_location = df.shape[1] - 1 # column index number of target
    split_index = int(df.shape[0]*test_size) # the index at which to split df into train and test
    
    # ...train
    X_train = df.values[:split_index, :target_location]
    y_train = df.values[:split_index, target_location]

    # ...test
    X_test = df.values[split_index:, :target_location] # original is split_index:-1
    y_test = df.values[split_index:, target_location] # original is split_index:-1

    # Scale the features
    if scale_X:
        scalerX = StandardScaler(with_mean=True, with_std=True).fit(X_train)
        X_train = scalerX.transform(X_train)
        X_test = scalerX.transform(X_test)
        
    # Reshape the arrays
    samples = len(X_train)
    num_features = target_location # All columns before the target column are features

    samples_train = X_train.shape[0] - look_back
    X_train_reshaped = np.zeros((samples_train, look_back, num_features)) # Initialize the required shape with an 'empty' zeros array.
    y_train_reshaped = np.zeros((samples_train))

    for i in range(samples_train):
        y_position = i + look_back
        X_train_reshaped[i] = X_train[i:y_position]
        y_train_reshaped[i] = y_train[y_position]


    samples_test = X_test.shape[0] - look_back
    X_test_reshaped = np.zeros((samples_test, look_back, num_features))
    y_test_reshaped = np.zeros((samples_test))

    for i in range(samples_test):
        y_position = i + look_back
        X_test_reshaped[i] = X_test[i:y_position]
        y_test_reshaped[i] = y_test[y_position]
    
    return X_train_reshaped, y_train_reshaped, X_test_reshaped, y_test_reshaped

In [None]:
# Print just to see what DataFrame we're dealing with again
data.head()

In [None]:
# Test the function
X_train, y_train, X_test, y_test = df_to_cnn_rnn_format(data, test_size=0.5, look_back=5, target_column='c', scale_X=False)

In [None]:
X_train, y_train