## Overview
* Country: United States, France, Germany, Japan, United Kingdom, Italy, Canada
* Time period: 1950-2018, 69 years
* Target variable: `ngdp_rpch` for annual data, `ngdp_r_sa_pcha` and `ngdp_r_sa_pchy` (respectively) for quarterly data
* Train-test split: 1950-2009 (train, ≤ x  years, depends on data availability), x - y (test, z years)   
  _Need further discussion. Here I divide the dataset by x/y just as the working paper did. Now for the ML model family we do not need to do such split._

## Import packages

In [1]:
# Module 1: Importing the libraries

import tensorflow as tf
from tensorflow import keras
keras = tf.keras

# Print all outputs in a code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Importing the libraries
import numpy as np
import pandas as pd
import random
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline


# from tf.random import set_seed

from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau

from keras.callbacks import ModelCheckpoint

# from keras.callbacks import ResetStatesCallback()

from keras.layers import Conv1D
from keras.layers import SimpleRNN
from keras.layers import LSTM
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten,Dense
from keras.utils import to_categorical

Using TensorFlow backend.


In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # Use the %tensorflow_version magic if in colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [3]:
from tensorflow import random
# from tensorflow.random import set_seed

In [4]:
# Set Seed

seed_global = 42

# Source: https://machinelearningmastery.com/reproducible-results-neural-networks-keras/

from numpy.random import seed
seed(seed_global)

#  Giving an eror 
# from tensorflow import set_random_seed
# set_random_seed(seed_global)

# Source: https://stackoverflow.com/questions/58638701/importerror-cannot-import-name-set-random-seed-from-tensorflow-c-users-po

tf.random.set_seed(seed_global)

# Copy paste this code snippet in every model code chunk 
seed(seed_global)
tf.random.set_seed(seed_global)

# --Ignore--
# tf.random.set_seed(seed)
# # This is giving me an error

# #  Global Seed
# # random.seed (2019) 

## Get data

In [5]:
%%bigquery gdp_quarterly_q

SELECT *
FROM `deep-nexus.temp_for_imf_data.WEO_G7_Quarterly`
ORDER BY time

In [6]:
gdp_quarterly_q.year = (gdp_quarterly_q.time+40)//4 + 1950
gdp_quarterly_q.quarter = (gdp_quarterly_q.time+40)%4 + 1
gdp_quarterly_q.time = gdp_quarterly_q.year.astype('str') + 'Q' + gdp_quarterly_q.quarter.astype('str')

  """Entry point for launching an IPython kernel.
  


In [7]:
gdp_quarterly_q = pd.DataFrame(gdp_quarterly_q)
gdp_quarterly_q.head(5)

Unnamed: 0,country,ifscode,time,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,...,pcpi_sa,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt
0,United States,111,1950Q1,,,,0.062128,,,,...,,,,301.782705,,301.782705,,,,
1,United Kingdom,112,1950Q1,,,,,,,,...,,,,72.776393,,72.776393,,,,
2,France,132,1950Q1,,,,,,,,...,,,,47.016824,,47.016824,,,,
3,Germany,134,1950Q1,,,,,,,,...,,,,,,,,,,
4,Italy,136,1950Q1,,,,,,,,...,,,,,,,,,,


In [8]:
# Selecting a subset of countries
gdp_quarterly_q.country.unique()

selected_countries = list(gdp_quarterly_q.country.unique())[0:1]
print("\nSelected Countries: \n")
selected_countries

array(['United States', 'United Kingdom', 'France', 'Germany', 'Italy',
       'Canada', 'Japan'], dtype=object)


Selected Countries: 



['United States']

In [9]:
# dataset_2 = gdp_quarterly_q

# for i in selected_countries:
#    dataset_2[i] = gdp_quarterly_q[gdp_quarterly_q['country'] == i]

# https://stackoverflow.com/questions/51583888/concatenate-dataframe-name-with-variable-value-python

In [10]:
# # Random Forest Regressor

# from sklearn.ensemble import RandomForestRegressor
# from sklearn.metrics import mean_squared_error

# # random forest model creation

# # Set Seed
# seed(seed_global)
# tensorflow.random.set_seed(seed_global)


# rfr = RandomForestRegressor(n_estimators = 1000)

# rfr.fit(X_train, y_train)

# # predictions
# y_pred = rfr.predict(X_test)

# metrics_mse["random_forest"] =  mean_squared_error(y_test, y_pred)

# print("Random Forest Test MSE: ", mean_squared_error(y_test, y_pred))

In [11]:
# # Variable Importance

# # Top i factors by importance

# i = 20
# importances = rf_reg.feature_importances_
# indices = np.argsort(importances)[-(i-1):]
# features = X.columns

# plt.figure(figsize=(6,6))
# plt.title('Feature Importances - Random Forest Regressor')
# plt.barh(range(len(indices)), importances[indices], color='b', align='center')
# plt.yticks(range(len(indices)), features[indices])
# plt.xlabel('Relative Importance')

In [12]:
# Filter data by country

dataset = gdp_quarterly_q[gdp_quarterly_q['country'].isin(selected_countries)]

dataset = pd.DataFrame(dataset)
# print ("#", "column name", "missing values")
# for i in range(len(dataset.columns)):
#     print(i, dataset.columns[i], " ", dataset.iloc[i].isnull().count())

dataset.head(5)

Unnamed: 0,country,ifscode,time,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,...,pcpi_sa,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt
0,United States,111,1950Q1,,,,0.062128,,,,...,,,,301.782705,,301.782705,,,,
7,United States,111,1950Q2,,,,0.062128,,,,...,,,,301.782705,,301.782705,,,,
14,United States,111,1950Q3,,,,0.062128,,,,...,,,,301.782705,,301.782705,,,,
21,United States,111,1950Q4,,,,0.062128,,,,...,,,,301.782705,,301.782705,,,,
28,United States,111,1951Q1,,,,0.062002,,,,...,,,,348.993057,,348.993057,8.709073,11.508033,10.159067,12.875869


## Variable selection

In [13]:
# Variable selection

# Input Columns
# Selecting 10 variables

dataset_input = dataset

dataset_input = dataset_input.drop(columns = ['country', 'ifscode', 'time', 'ngdp_r_sa_pcha', 'ngdp_r_sa_pchy', 'ngdp_dpchy'])

# Dropped ngdp_dpchy as all values are null

dataset_input.tail(5)

Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt
1897,18155.7,10628.0,153.952333,0.160311,110.571,4.133333,2565.6,0.469925,12729.7,1.139334,...,247.273333,3.141889,2.109865,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625
1904,18819.741667,10786.0,154.951667,0.162071,111.839,4.066667,2578.3,0.495011,12782.9,0.41792,...,249.250333,3.236639,2.222997,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725
1911,18819.741667,10876.1,155.449,0.162071,110.132,3.9,2592.0,0.531358,12909.2,0.988039,...,250.578667,2.148827,2.668825,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725
1918,18819.741667,10994.3,155.879,0.162071,110.681,3.8,2606.0,0.540123,13019.8,0.856753,...,251.828667,2.010362,2.632912,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725
1925,18819.741667,11057.4,156.776667,0.162071,111.37,3.8,2605.7,-0.011512,13066.3,0.357148,...,252.759,1.485933,2.218463,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725


In [14]:
# Outcome vaiable (Column Name) = ngdp_r_sa_pcha

outcome_variable = "ngdp_r_sa_pcha"
predicted_variable = "1_step_ahead_" + outcome_variable

dataset_1 = dataset_input
dataset_1["time"] = dataset["time"]
# dataset_1[num_cols] = dataset[num_cols]
dataset_1[outcome_variable] = dataset[outcome_variable]

dataset_1[predicted_variable] = dataset_1[outcome_variable].shift(-1)

# # Source: https://stackoverflow.com/questions/20095673/shift-column-in-pandas-dataframe-up-by-one

dataset_1 = dataset_1[:-1] 

dataset_1.tail(5)

Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,time,ngdp_r_sa_pcha,1_step_ahead_ngdp_r_sa_pcha
1890,18155.7,10456.7,153.815333,0.160311,110.185,4.3,2553.6,0.145104,12586.3,0.586595,...,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,2017Q3,3.202964,3.545494
1897,18155.7,10628.0,153.952333,0.160311,110.571,4.133333,2565.6,0.469925,12729.7,1.139334,...,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,2017Q4,3.545494,2.552107
1904,18819.741667,10786.0,154.951667,0.162071,111.839,4.066667,2578.3,0.495011,12782.9,0.41792,...,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2018Q1,2.552107,3.512025
1911,18819.741667,10876.1,155.449,0.162071,110.132,3.9,2592.0,0.531358,12909.2,0.988039,...,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2018Q2,3.512025,2.926498
1918,18819.741667,10994.3,155.879,0.162071,110.681,3.8,2606.0,0.540123,13019.8,0.856753,...,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2018Q3,2.926498,1.089155


In [15]:
# Output Columns

# ngdp_r_sa_pcha: WEO: Gross domestic product, constant prices, seasonally adjusted, quarter-over-quarter percent change, annualized (Percent, Units).

dataset_Y = dataset_1[["time", predicted_variable]]
dataset_Y.tail(5)

Unnamed: 0,time,1_step_ahead_ngdp_r_sa_pcha
1890,2017Q3,3.545494
1897,2017Q4,2.552107
1904,2018Q1,3.512025
1911,2018Q2,2.926498
1918,2018Q3,1.089155


In [16]:
# Window size and crearting the lagged columns

#  Using a lag = 0 for identifying initial variable importance by fitting a randowm forest 
# window size 
lag = 0
dataset_input_l = dataset_1

# Drop the 1) preducted outcome variable and 2) time variable 

dataset_input_l = dataset_input_l.drop(columns = ["time", predicted_variable])

print("Before adding the lagged variables to the input dataset: ")
dataset_input_l.tail(5)

# Lagging each column in num_columns by the entire range of lag factors

for j in dataset_input_l.columns:
    for i in range(1, (lag + 1), 1):
        new_col = str(j)+"-"+str(i)
        dataset_input_l[str(new_col)] = dataset_input_l[str(j)].shift(i)
    
print("After adding the lagged variables to the input dataset: ")
dataset_input_l.tail(5)

Before adding the lagged variables to the input dataset: 


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
1890,18155.7,10456.7,153.815333,0.160311,110.185,4.3,2553.6,0.145104,12586.3,0.586595,...,2.153214,1.981427,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.202964
1897,18155.7,10628.0,153.952333,0.160311,110.571,4.133333,2565.6,0.469925,12729.7,1.139334,...,3.141889,2.109865,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.545494
1904,18819.741667,10786.0,154.951667,0.162071,111.839,4.066667,2578.3,0.495011,12782.9,0.41792,...,3.236639,2.222997,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.552107
1911,18819.741667,10876.1,155.449,0.162071,110.132,3.9,2592.0,0.531358,12909.2,0.988039,...,2.148827,2.668825,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,3.512025
1918,18819.741667,10994.3,155.879,0.162071,110.681,3.8,2606.0,0.540123,13019.8,0.856753,...,2.010362,2.632912,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.926498


After adding the lagged variables to the input dataset: 


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
1890,18155.7,10456.7,153.815333,0.160311,110.185,4.3,2553.6,0.145104,12586.3,0.586595,...,2.153214,1.981427,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.202964
1897,18155.7,10628.0,153.952333,0.160311,110.571,4.133333,2565.6,0.469925,12729.7,1.139334,...,3.141889,2.109865,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.545494
1904,18819.741667,10786.0,154.951667,0.162071,111.839,4.066667,2578.3,0.495011,12782.9,0.41792,...,3.236639,2.222997,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.552107
1911,18819.741667,10876.1,155.449,0.162071,110.132,3.9,2592.0,0.531358,12909.2,0.988039,...,2.148827,2.668825,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,3.512025
1918,18819.741667,10994.3,155.879,0.162071,110.681,3.8,2606.0,0.540123,13019.8,0.856753,...,2.010362,2.632912,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.926498


In [17]:
# Combining Input and Output Values

# X1 = dataset_input
# X = pd.concat([X1, X2, dataset_Y], axis=1)
X = pd.concat([dataset_Y, dataset_input_l], axis=1)
X.head(5)
X.shape

print("\nColumns names:\n")
X.columns

Unnamed: 0,time,1_step_ahead_ngdp_r_sa_pcha,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
0,1950Q1,,,,,0.062128,,,,,...,,,301.782705,,301.782705,,,,,
7,1950Q2,,,,,0.062128,,,,,...,,,301.782705,,301.782705,,,,,
14,1950Q3,,,,,0.062128,,,,,...,,,301.782705,,301.782705,,,,,
21,1950Q4,,,,,0.062128,,,,,...,,,301.782705,,301.782705,,,,,
28,1951Q1,,,,,0.062002,,,,,...,,,348.993057,,348.993057,8.709073,11.508033,10.159067,12.875869,


(275, 63)


Columns names:



Index(['time', '1_step_ahead_ngdp_r_sa_pcha', 'gdpwgt', 'lc', 'le', 'llf',
       'lulcm', 'lur', 'ncg_r', 'ncg_rpch', 'ncp_r', 'ncp_rpch', 'ncp_rpchy',
       'nfbrgdp', 'nfb_r', 'nfdd_r', 'nfdd_rpch', 'nfie_r', 'nfisn_r',
       'nfisr_r', 'nfis_r', 'nfi_r', 'nfi_rpch', 'ngdp', 'ngdp_d', 'ngdp_dpch',
       'ngdp_d_sa', 'ngdp_d_sa_pchy', 'ngdp_r', 'ngdp_rpch', 'ngdp_rpchy',
       'ngdp_r_sa', 'ngdp_r_sa_ar', 'ngdp_sa', 'ngdp_sa_ar', 'nmg_r',
       'nmg_rpch', 'nms_r', 'nm_r', 'nm_rpch', 'nshr', 'ntdd_r', 'ntdd_rpch',
       'ntdd_rpchy', 'nxg_r', 'nxg_rpch', 'nxs_r', 'nx_r', 'nx_rpch', 'pcpi',
       'pcpi_pch', 'pcpi_pchy', 'pcpi_sa', 'pcpi_sa_pcha', 'pcpi_sa_pchy',
       'pppgdp', 'pppsh', 'pppwgt', 'tmgwgt', 'tmwgt', 'txgwgt', 'txwgt',
       'ngdp_r_sa_pcha'],
      dtype='object')

In [18]:
# Dropping all rows with missing data
print("\nAfter dropping rows with missing data")
# X = X.iloc[lag:]
# X = X.iloc[:-1]
X = X.dropna()
X.shape
X.head(5)
X.tail(5)


After dropping rows with missing data


(155, 63)

Unnamed: 0,time,1_step_ahead_ngdp_r_sa_pcha,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
840,1980Q1,-7.985864,2352.456802,1573.6,99.862333,0.106979,77.648,6.3,1447.3,0.885264,...,16.741448,14.210019,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,1.261758
847,1980Q2,-0.476985,2352.456802,1599.2,98.953333,0.106979,80.939,7.333333,1467.3,1.381884,...,14.194984,14.42577,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-7.985864
854,1980Q3,7.668385,2352.456802,1628.6,98.899,0.106979,83.201,7.666667,1452.7,-0.995025,...,7.721136,12.935323,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-0.476985
861,1980Q4,8.070747,2352.456802,1687.6,99.498667,0.106979,84.538,7.4,1449.5,-0.220279,...,11.693861,12.53836,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,7.668385
868,1981Q1,-2.926867,2611.68359,1739.6,100.239,0.108677,86.287,7.433333,1461.0,0.793377,...,11.531024,11.261071,3207.025,21.713333,3207.025,248.575,293.825,230.425,280.775,8.070747


Unnamed: 0,time,1_step_ahead_ngdp_r_sa_pcha,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
1890,2017Q3,3.545494,18155.7,10456.7,153.815333,0.160311,110.185,4.3,2553.6,0.145104,...,2.153214,1.981427,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.202964
1897,2017Q4,2.552107,18155.7,10628.0,153.952333,0.160311,110.571,4.133333,2565.6,0.469925,...,3.141889,2.109865,19519.4,15.284951,19519.4,2221.075,2739.425,1444.025,2220.625,3.545494
1904,2018Q1,3.512025,18819.741667,10786.0,154.951667,0.162071,111.839,4.066667,2578.3,0.495011,...,3.236639,2.222997,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.552107
1911,2018Q2,2.926498,18819.741667,10876.1,155.449,0.162071,110.132,3.9,2592.0,0.531358,...,2.148827,2.668825,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,3.512025
1918,2018Q3,1.089155,18819.741667,10994.3,155.879,0.162071,110.681,3.8,2606.0,0.540123,...,2.010362,2.632912,20580.25,15.19556,20580.25,2379.8,2932.075,1538.375,2356.725,2.926498


## Process data

In [19]:
# Separating input and output variables

X1 = X

Y1 = X1[predicted_variable]

Y1 = pd.DataFrame(Y1) # very important step, gave me formatting errors, and wasted 2 hour in debugging   

print("\n Outcome variable dimension", Y1.shape)
Y1.shape
Y1.head(5)

# Dropping outcome variable from input matrix
X1 = X1.drop(columns = [predicted_variable])
print("\n Input matrix: X")
X1.shape
X1.head(5)

print("\n columns in input dataset\n:")
X1.columns


 Outcome variable dimension (155, 1)


(155, 1)

Unnamed: 0,1_step_ahead_ngdp_r_sa_pcha
840,-7.985864
847,-0.476985
854,7.668385
861,8.070747
868,-2.926867



 Input matrix: X


(155, 62)

Unnamed: 0,time,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
840,1980Q1,2352.456802,1573.6,99.862333,0.106979,77.648,6.3,1447.3,0.885264,4277.9,...,16.741448,14.210019,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,1.261758
847,1980Q2,2352.456802,1599.2,98.953333,0.106979,80.939,7.333333,1467.3,1.381884,4181.5,...,14.194984,14.42577,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-7.985864
854,1980Q3,2352.456802,1628.6,98.899,0.106979,83.201,7.666667,1452.7,-0.995025,4227.4,...,7.721136,12.935323,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-0.476985
861,1980Q4,2352.456802,1687.6,99.498667,0.106979,84.538,7.4,1449.5,-0.220279,4284.5,...,11.693861,12.53836,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,7.668385
868,1981Q1,2611.68359,1739.6,100.239,0.108677,86.287,7.433333,1461.0,0.793377,4298.8,...,11.531024,11.261071,3207.025,21.713333,3207.025,248.575,293.825,230.425,280.775,8.070747



 columns in input dataset
:


Index(['time', 'gdpwgt', 'lc', 'le', 'llf', 'lulcm', 'lur', 'ncg_r',
       'ncg_rpch', 'ncp_r', 'ncp_rpch', 'ncp_rpchy', 'nfbrgdp', 'nfb_r',
       'nfdd_r', 'nfdd_rpch', 'nfie_r', 'nfisn_r', 'nfisr_r', 'nfis_r',
       'nfi_r', 'nfi_rpch', 'ngdp', 'ngdp_d', 'ngdp_dpch', 'ngdp_d_sa',
       'ngdp_d_sa_pchy', 'ngdp_r', 'ngdp_rpch', 'ngdp_rpchy', 'ngdp_r_sa',
       'ngdp_r_sa_ar', 'ngdp_sa', 'ngdp_sa_ar', 'nmg_r', 'nmg_rpch', 'nms_r',
       'nm_r', 'nm_rpch', 'nshr', 'ntdd_r', 'ntdd_rpch', 'ntdd_rpchy', 'nxg_r',
       'nxg_rpch', 'nxs_r', 'nx_r', 'nx_rpch', 'pcpi', 'pcpi_pch', 'pcpi_pchy',
       'pcpi_sa', 'pcpi_sa_pcha', 'pcpi_sa_pchy', 'pppgdp', 'pppsh', 'pppwgt',
       'tmgwgt', 'tmwgt', 'txgwgt', 'txwgt', 'ngdp_r_sa_pcha'],
      dtype='object')

In [20]:
# Random Forest

# Sequential train-test split
train_test_ratio = 0.69

training = int(round(X1.shape[0]*train_test_ratio, 0))
test = X1.shape[0] - training

print("# items in training set:", training)
print("\n# items in test set:", test)

X_train = X1.iloc[0:(training),:]
y_train = Y1.iloc[0:(training),0]
X_test = X1.iloc[training:(X1.shape[0]),:]
y_test = Y1.iloc[training:(X1.shape[0]),0]
y_test_outcome_value = Y1.iloc[training:(X1.shape[0]),:]

print("\n input training set:")
X_train.shape
X_train.head(5)

y_train.head(5)

print("\n input test set:")
X_test.shape
X_test.head(5)

y_test.head(5)
y_test_outcome_value.head(5)

# items in training set: 107

# items in test set: 48

 input training set:


(107, 62)

Unnamed: 0,time,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
840,1980Q1,2352.456802,1573.6,99.862333,0.106979,77.648,6.3,1447.3,0.885264,4277.9,...,16.741448,14.210019,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,1.261758
847,1980Q2,2352.456802,1599.2,98.953333,0.106979,80.939,7.333333,1467.3,1.381884,4181.5,...,14.194984,14.42577,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-7.985864
854,1980Q3,2352.456802,1628.6,98.899,0.106979,83.201,7.666667,1452.7,-0.995025,4227.4,...,7.721136,12.935323,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-0.476985
861,1980Q4,2352.456802,1687.6,99.498667,0.106979,84.538,7.4,1449.5,-0.220279,4284.5,...,11.693861,12.53836,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,7.668385
868,1981Q1,2611.68359,1739.6,100.239,0.108677,86.287,7.433333,1461.0,0.793377,4298.8,...,11.531024,11.261071,3207.025,21.713333,3207.025,248.575,293.825,230.425,280.775,8.070747


840   -7.985864
847   -0.476985
854    7.668385
861    8.070747
868   -2.926867
Name: 1_step_ahead_ngdp_r_sa_pcha, dtype: float64


 input test set:


(48, 62)

Unnamed: 0,time,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
1589,2006Q4,12236.2,7624.0,145.606,0.151394,96.534,4.433333,2443.5,0.846059,10504.5,...,-1.630622,1.965396,13814.6,18.732023,13814.6,1715.45,2026.425,921.925,1305.225,3.449636
1596,2007Q1,13021.65,7806.8,146.135,0.153119,96.994,4.5,2444.9,0.057295,10563.3,...,3.97963,2.431651,14451.875,18.104068,14451.875,1895.725,2243.55,1044.925,1472.6,0.945307
1603,2007Q2,13021.65,7845.4,145.850667,0.153119,95.793,4.5,2460.5,0.638063,10582.8,...,4.607759,2.665287,14451.875,18.104068,14451.875,1895.725,2243.55,1044.925,1472.6,2.312389
1610,2007Q3,13021.65,7885.1,145.943667,0.153119,95.084,4.666667,2472.8,0.499898,10642.5,...,2.556194,2.348975,14451.875,18.104068,14451.875,1895.725,2243.55,1044.925,1472.6,2.189473
1617,2007Q4,13021.65,7978.2,146.271333,0.153119,94.925,4.8,2489.1,0.659172,10672.8,...,4.997587,4.031137,14451.875,18.104068,14451.875,1895.725,2243.55,1044.925,1472.6,2.455478


1589    0.945307
1596    2.312389
1603    2.189473
1610    2.455478
1617   -2.279453
Name: 1_step_ahead_ngdp_r_sa_pcha, dtype: float64

Unnamed: 0,1_step_ahead_ngdp_r_sa_pcha
1589,0.945307
1596,2.312389
1603,2.189473
1610,2.455478
1617,-2.279453


In [21]:
# Scaling the training & test sets 

# Dropping the "time" column

X_train.drop(columns = ['time'], inplace = True)
X_test.drop(columns = ['time'], inplace = True)

train_columns = list(X_train.columns)
# train_columns

# X_test  = X_test.drop(columns = ["time"], inplace = True)

X_train.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
840,2352.456802,1573.6,99.862333,0.106979,77.648,6.3,1447.3,0.885264,4277.9,-0.14239,...,16.741448,14.210019,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,1.261758
847,2352.456802,1599.2,98.953333,0.106979,80.939,7.333333,1467.3,1.381884,4181.5,-2.253442,...,14.194984,14.42577,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-7.985864
854,2352.456802,1628.6,98.899,0.106979,83.201,7.666667,1452.7,-0.995025,4227.4,1.097692,...,7.721136,12.935323,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,-0.476985
861,2352.456802,1687.6,99.498667,0.106979,84.538,7.4,1449.5,-0.220279,4284.5,1.350712,...,11.693861,12.53836,2857.325,21.531397,2857.325,212.8,252.675,187.275,230.15,7.668385
868,2611.68359,1739.6,100.239,0.108677,86.287,7.433333,1461.0,0.793377,4298.8,0.333761,...,11.531024,11.261071,3207.025,21.713333,3207.025,248.575,293.825,230.425,280.775,8.070747


In [22]:
# Scaling all the numerical variables
scaler = MinMaxScaler()

# train_columns = list(X_train.columns) # removing 'time' column for feature scaling
# train_columns


print("\nScaled training input dataset:")
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns = train_columns)

X_train.head(5)

print("\nScaled test input dataset:")
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test, columns = train_columns)

X_test.head(5)


Scaled training input dataset:


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
0,0.0,0.0,0.021103,0.0,0.0,0.35468,0.0,0.731506,0.015493,0.492115,...,1.0,0.983648,0.0,0.762432,0.0,0.0,0.0,0.0,0.0,0.531239
1,0.0,0.004329,0.00119,0.0,0.120993,0.507389,0.020435,0.835901,0.0,0.0,...,0.863638,1.0,0.0,0.762432,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.009301,0.0,0.0,0.204154,0.55665,0.005518,0.33625,0.007377,0.781195,...,0.516968,0.887036,0.0,0.762432,0.0,0.0,0.0,0.0,0.0,0.431355
3,0.0,0.019278,0.013137,0.0,0.253309,0.517241,0.002248,0.499109,0.016553,0.840178,...,0.729705,0.856949,0.0,0.762432,0.0,0.0,0.0,0.0,0.0,0.899275
4,0.026228,0.028071,0.029355,0.038232,0.31761,0.522167,0.013998,0.712191,0.018852,0.603113,...,0.720985,0.760141,0.031915,0.811984,0.031915,0.023808,0.023199,0.058735,0.04709,0.922389



Scaled test input dataset:


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
0,0.0,0.0,0.41334,0.0,0.095128,0.103261,0.019358,0.782466,0.033506,0.885535,...,0.47619,0.520792,0.0,1.0,0.0,0.161348,0.050333,0.0,0.0,0.850587
1,0.119305,0.054238,0.44355,0.161546,0.122325,0.11413,0.025968,0.525182,0.0561,0.695181,...,0.846199,0.588765,0.094193,0.822434,0.094193,0.384966,0.278011,0.177483,0.156942,0.670491
2,0.119305,0.065691,0.427312,0.161546,0.051318,0.11413,0.099622,0.71462,0.063593,0.520215,...,0.887625,0.622825,0.094193,0.822434,0.094193,0.384966,0.278011,0.177483,0.156942,0.768803
3,0.119305,0.077471,0.432623,0.161546,0.0094,0.141304,0.157696,0.669553,0.086532,0.697215,...,0.75232,0.576712,0.094193,0.822434,0.094193,0.384966,0.278011,0.177483,0.156942,0.759964
4,0.119305,0.105095,0.451335,0.161546,0.0,0.163043,0.234655,0.721506,0.098175,0.566902,...,0.913335,0.821944,0.094193,0.822434,0.094193,0.384966,0.278011,0.177483,0.156942,0.779093


In [23]:
if 'time' in X_train.columns:
    print("Does not exist")
else:
    print("Does not exist")
    
X_train

# train_cols = X_train.columns

# train_cols

Does not exist


Unnamed: 0,gdpwgt,lc,le,llf,lulcm,lur,ncg_r,ncg_rpch,ncp_r,ncp_rpch,...,pcpi_sa_pcha,pcpi_sa_pchy,pppgdp,pppsh,pppwgt,tmgwgt,tmwgt,txgwgt,txwgt,ngdp_r_sa_pcha
0,0.000000,0.000000,0.021103,0.000000,0.000000,0.354680,0.000000,0.731506,0.015493,0.492115,...,1.000000,0.983648,0.000000,0.762432,0.000000,0.000000,0.000000,0.000000,0.000000,0.531239
1,0.000000,0.004329,0.001190,0.000000,0.120993,0.507389,0.020435,0.835901,0.000000,0.000000,...,0.863638,1.000000,0.000000,0.762432,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.009301,0.000000,0.000000,0.204154,0.556650,0.005518,0.336250,0.007377,0.781195,...,0.516968,0.887036,0.000000,0.762432,0.000000,0.000000,0.000000,0.000000,0.000000,0.431355
3,0.000000,0.019278,0.013137,0.000000,0.253309,0.517241,0.002248,0.499109,0.016553,0.840178,...,0.729705,0.856949,0.000000,0.762432,0.000000,0.000000,0.000000,0.000000,0.000000,0.899275
4,0.026228,0.028071,0.029355,0.038232,0.317610,0.522167,0.013998,0.712191,0.018852,0.603113,...,0.720985,0.760141,0.031915,0.811984,0.031915,0.023808,0.023199,0.058735,0.047090,0.922389
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,0.929171,0.938396,0.950484,0.952835,0.662904,0.157635,0.975886,0.634163,0.958456,0.751513,...,0.435381,0.196149,0.928999,0.123489,0.928999,0.861511,0.870486,0.874804,0.881311,0.666370
103,0.929171,0.953226,0.957333,0.952835,0.640515,0.157635,0.976499,0.550664,0.963293,0.594471,...,0.306060,0.185154,0.928999,0.123489,0.928999,0.861511,0.870486,0.874804,0.881311,0.605169
104,1.000000,0.981906,0.975947,1.000000,0.709081,0.123153,1.000000,0.746615,0.981534,0.785331,...,0.215922,0.186381,1.000000,0.000000,1.000000,1.000000,1.000000,1.000000,1.000000,0.770543
105,1.000000,0.990006,0.989492,1.000000,0.693934,0.108374,0.993154,0.487360,0.989907,0.643350,...,0.299431,0.204070,1.000000,0.000000,1.000000,1.000000,1.000000,1.000000,1.000000,0.512677


## LSTM

In [24]:
# Reshape the training and test sets

X_train_1 = np.array(X_train).reshape(X_train.shape[0], X_train.shape[1],1)
print("training set reshaped:", X_train_1.shape)

X_test_1 = np.array(X_test).reshape(X_test.shape[0], X_test.shape[1],1)
print("test set reshaped:", X_test_1.shape)

training set reshaped: (107, 61, 1)
test set reshaped: (48, 61, 1)


In [25]:
import wandb
from keras.layers import LSTM
from wandb.keras import WandbCallback

In [26]:
#wandb.init(project='IMF')

In [27]:
sweep_config = {
    'method': 'grid',
    'metric': {
      'name': 'val_loss',
      'goal': 'minimize'   
    },
    'parameters': {
                  'epochs': {'values': [10, 20]},
                  'batch_size': {'values': [32, 64]},
                  'nn_units': {'values': [32, 64]}
    }
}

In [28]:
sweepid = wandb.sweep(sweep_config)

Create sweep with ID: e0jayujw
Sweep URL: https://app.wandb.ai/fiscal-forcast/IMF/sweeps/e0jayujw


In [29]:
def train():
    
    # Specify the hyperparameter to be tuned along with
    # an initial value
    config_defaults = {
        'epochs': 10,
        'batch_size': 64,
        'nn_units': 64
    }
    
    # Initialize a new wandb run
    wandb.init(config=config_defaults)
    
    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config
    
    # Define the model
    model = Sequential()
    model.add(LSTM(units = config.nn_units, return_sequences = True, input_shape = (X_train.shape[1], 1)))
    model.add(Dropout(0.2))
    model.add(LSTM(units = config.nn_units, return_sequences = True))
    model.add(Dropout(0.2))
    model.add(LSTM(units = config.nn_units, return_sequences = True))
    model.add(Dropout(0.2))
    model.add(Dense(units = 1, activation='relu'))
    model.add(LSTM(units = config.nn_units, return_sequences = True))
    #model.add(Dropout(0.2))
    model.add(LSTM(units = config.nn_units, return_sequences = True))
    #model.add(Dropout(0.2))
    model.add(LSTM(units = config.nn_units, return_sequences = True))
    #model.add(Dropout(0.2))
    model.add(LSTM(units = config.nn_units))
    model.add(Dense(units = 1, activation='relu'))
    
    # Complie the model
    model.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['mse'])
    
    # Train the model
    model.fit(X_train_1, y_train, epochs = config.epochs, batch_size = config.batch_size, 
              validation_split = 0.3, callbacks=[WandbCallback()])

In [30]:
wandb.agent(sweepid, function=train)

wandb: Agent Starting Run: q1munr74 with config:
	epochs: 10
	nn_units: 32
	batch_size: 32
wandb: Agent Started Run: q1munr74


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
wandb: Agent Finished Run: q1munr74 

wandb: Agent Starting Run: aqbuq9k1 with config:
	epochs: 10
	nn_units: 64
	batch_size: 32
wandb: Agent Started Run: aqbuq9k1


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
wandb: Agent Finished Run: aqbuq9k1 

wandb: Agent Starting Run: ztqh8l95 with config:
	epochs: 20
	nn_units: 32
	batch_size: 32
wandb: Agent Started Run: ztqh8l95


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
wandb: Agent Finished Run: ztqh8l95 

wandb: Agent Starting Run: 17735n2f with config:
	epochs: 20
	nn_units: 64
	batch_size: 32
wandb: Agent Started Run: 17735n2f


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
wandb: Agent Finished Run: 17735n2f 

wandb: Agent Starting Run: zr67oh2y with config:
	epochs: 10
	nn_units: 32
	batch_size: 64
wandb: Agent Started Run: zr67oh2y


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
wandb: Agent Finished Run: zr67oh2y 

wandb: Agent Starting Run: v120kxdo with config:
	epochs: 10
	nn_units: 64
	batch_size: 64
wandb: Agent Started Run: v120kxdo


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
wandb: Agent Finished Run: v120kxdo 

wandb: Agent Starting Run: 9poh8ogj with config:
	epochs: 20
	nn_units: 32
	batch_size: 64
wandb: Agent Started Run: 9poh8ogj


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
wandb: Agent Finished Run: 9poh8ogj 

wandb: Agent Starting Run: 0dr236bp with config:
	epochs: 20
	nn_units: 64
	batch_size: 64
wandb: Agent Started Run: 0dr236bp


wandb: Wandb version 0.8.30 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 74 samples, validate on 33 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
wandb: Agent Finished Run: 0dr236bp 

