# Students Do: Predicting Gold Closing Prices

In this activity, you will gain hands-on experience by building an RNN LSTM for predicting gold closing prices using time-series data.

In [59]:
# Initial imports
import numpy as np
import pandas as pd

%matplotlib inline

## Instructions

### Initial Setup

To ensure models' reproducibility, set the random seed for `numpy` and `tensorflow` libraries.

In [60]:
# Set the random seed for reproducibility
# Note: This is used for model prototyping, but it is good practice to comment this out and run multiple experiments to evaluate your model.
from numpy.random import seed

seed(1)
from tensorflow import random

random.set_seed(2)

### Data Preparation

In this section, you will retrieve the Gold historical prices from the London Bullion Market Association using the [Quandl API](https://www.quandl.com/data/LBMA/GOLD-Gold-Price-London-Fixing). Be sure to have your Quandl API key at hand.

#### Data Retrieval

Import your Quandl API key from an environment variable named `quandl_key`.

In [61]:
# Import the API key for QUANDL
import os
quandl_key = os.getenv("QUANDL_API_KEY")


Open the ["Gold Price: London Fixing" at the Quandl website](https://www.quandl.com/data/LBMA/GOLD-Gold-Price-London-Fixing), and set the URL to retrieve the historical prices of gold in `json` format.

In [62]:
# Set Gold price URL
request_url = "https://www.quandl.com/api/v3/datasets/LBMA/GOLD.json?api_key="

gold_prices_url = request_url + quandl_key

Use the `requests` library to retrieve the historical prices of gold in `json` format.

In [63]:
# Fetch Gold prices from QUANDL
import requests

response_data = requests.get(gold_prices_url).json()

gold_df = pd.DataFrame(response_data["dataset"]["data"])
gold_df.columns = [ "Date",
    "USD (AM)",
    "USD (PM)",
    "GBP (AM)",
    "GBP (PM)",
    "EURO (AM)",
    "EURO (PM)" ]
gold_df["Date"] = pd.to_datetime(gold_df["Date"])
gold_df.set_index("Date", inplace=True)
gold_df.head(4)


Unnamed: 0_level_0,USD (AM),USD (PM),GBP (AM),GBP (PM),EURO (AM),EURO (PM)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-12-21,1869.25,1880.0,1415.77,1413.57,1540.82,1539.53
2020-12-18,1878.95,1879.75,1390.8,1393.46,1533.41,1536.13
2020-12-17,1871.95,1890.75,1378.32,1390.65,1530.88,1542.85
2020-12-16,1861.35,1851.95,1377.59,1371.92,1525.38,1519.15


Explore the `json` response data and create a Pandas DataFrame containing the historical prices of gold in all the different currencies provided by the API, and set the data as the index of the DataFrame.

Your DataFrame should look like the following sample:


![Sample gold prices DataFrame](../Images/sample-gold-prices-df.png)

**Note:** Remember that while working with time-series data, it's important to transform dates data to `datetime` data type explicitly. Review the [Quandl API documentation for time-series data](https://docs.quandl.com/docs/time-series) if needed.

In [64]:
# Explore the `response_data` object

# Fetch the keys of the `response_data` object
response_data_keys = response_data["dataset"].keys()
response_data_keys

dict_keys(['id', 'dataset_code', 'database_code', 'name', 'description', 'refreshed_at', 'newest_available_date', 'oldest_available_date', 'column_names', 'frequency', 'type', 'premium', 'limit', 'transform', 'column_index', 'start_date', 'end_date', 'data', 'collapse', 'order', 'database_id'])

In [65]:
# Explore the contents of the main key


In [66]:
# Create a DataFrame with Gold prices

#Transform the "Date" column to datetime

# Set the "Date" column as the DataFrame Index

# Show the DataFrame head


#### Data Cleaning

Before continue, corroborate if there are any `null` or missing values in the DataFrame, if so, fill the missing values with the previous price in the series.

**Note:** You may want to review the [Working with missing data guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) from the Pandas documentation.

In [67]:
# Looking for missing values
gold_df.isnull().sum().sum()


16020

In [68]:
# Filling missing values with the previous ones
gold_df.fillna(method="pad")


Unnamed: 0_level_0,USD (AM),USD (PM),GBP (AM),GBP (PM),EURO (AM),EURO (PM)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-12-21,1869.25,1880.00,1415.770,1413.570,1540.820,1539.530
2020-12-18,1878.95,1879.75,1390.800,1393.460,1533.410,1536.130
2020-12-17,1871.95,1890.75,1378.320,1390.650,1530.880,1542.850
2020-12-16,1861.35,1851.95,1377.590,1371.920,1525.380,1519.150
2020-12-15,1844.30,1850.65,1383.780,1384.840,1516.570,1523.640
...,...,...,...,...,...,...
1968-01-08,35.14,37.70,14.586,15.679,244.383,243.039
1968-01-05,35.14,37.70,14.597,15.679,244.383,243.039
1968-01-04,35.14,37.70,14.603,15.679,244.383,243.039
1968-01-03,35.16,37.70,14.617,15.679,244.383,243.039


#### Create the Features `X` and Target `y` Data

Use the `window_data()` function bellow, to create the features set `X` and the target vector `y`. Define a window size of `30` days and use the column of the closing gold price (`USD (PM)`) for as feature and target column; this will allow your model to predict gold prices in USD.

In [69]:
def window_data(df, window, feature_col_number, target_col_number):
    """
    This function accepts the column number for the features (X) and the target (y).
    It chunks the data up with a rolling window of Xt - window to predict Xt.
    It returns two numpy arrays of X and y.
    """
    X = []
    y = []
    for i in range(len(df) - window):
        features = df.iloc[i : (i + window), feature_col_number]
        target = df.iloc[(i + window), target_col_number]
        X.append(features)
        y.append(target)
    return np.array(X), np.array(y).reshape(-1, 1)

In [70]:
# Define the window size
window_size = 30

# Set the index of the feature and target columns
feature_column = 1
target_column = 1

# Create the features (X) and target (y) data using the window_data() function.
X, y = window_data(gold_df, window_size, feature_column, target_column)

# Print a few sample values from X and y
print (f"X sample values:\n{X[:3]} \n")
print (f"y sample values:\n{y[:3]}")

X sample values:
[[1880.   1879.75 1890.75 1851.95 1850.65 1831.15 1842.   1844.35 1841.75
  1868.15 1859.95 1843.   1832.35 1822.6  1810.75 1762.55 1779.3  1807.4
  1810.2  1799.6  1840.2  1875.7  1857.35 1876.1  1889.05 1885.6  1890.9
  1874.85 1860.95 1878.7 ]
 [1879.75 1890.75 1851.95 1850.65 1831.15 1842.   1844.35 1841.75 1868.15
  1859.95 1843.   1832.35 1822.6  1810.75 1762.55 1779.3  1807.4  1810.2
  1799.6  1840.2  1875.7  1857.35 1876.1  1889.05 1885.6  1890.9  1874.85
  1860.95 1878.7  1867.3 ]
 [1890.75 1851.95 1850.65 1831.15 1842.   1844.35 1841.75 1868.15 1859.95
  1843.   1832.35 1822.6  1810.75 1762.55 1779.3  1807.4  1810.2  1799.6
  1840.2  1875.7  1857.35 1876.1  1889.05 1885.6  1890.9  1874.85 1860.95
  1878.7  1867.3  1940.8 ]] 

y sample values:
[[1867.3 ]
 [1940.8 ]
 [1938.45]]


#### Split Data Between Training and Testing Sets

To avoid the dataset being randomized, manually create the training and testing sets using array slicing. Use 70% of the data for training and the remainder for testing.

In [71]:
# Manually splitting the data
split = int(0.7 * len(X))

X_train = X[: split]
X_test = X[split:]

y_train = y[: split]
y_test = y[split:]

#### Scale Data with `MinMaxScaler`

Before training the RNN LSTM model, use the `MinMaxScaler` from `sklearn` to scale the training and testing data between `0` and `1`.

**Note:** You need to scale both features and target sets.

In [72]:
# Importing the MinMaxScaler from sklearn
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit the MinMaxScaler object with the features data X
scaler.fit(X)

# Scale the features training and testing sets
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Fit the MinMaxScaler object with the target data Y
scaler.fit(y)

# Scale the target training and testing sets
y_train = scaler.transform(y_train)
y_test = scaler.transform(y_test)


#### Reshape Features Data for the LSTM Model

The LSTM API from Keras needs to receive the features data as a _vertical vector_, so that reshape the `X` data in the form `reshape((X_train.shape[0], X_train.shape[1], 1))`. Both sets, training, and testing should be reshaped.

In [73]:
# Reshape the features data
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Print some sample data after reshaping the datasets
print (f"X_train sample values:\n{X_train[:3]} \n")
print (f"X_test sample values:\n{X_test[:3]}")

X_train sample values:
[[[0.90791675]
  [0.90779374]
  [0.91320606]
  [0.89411533]
  [0.89347569]
  [0.88388113]
  [0.88921964]
  [0.89037591]
  [0.88909663]
  [0.9020862 ]
  [0.89805156]
  [0.88971167]
  [0.88447156]
  [0.87967428]
  [0.87384373]
  [0.85012793]
  [0.85836942]
  [0.87219543]
  [0.87357312]
  [0.86835761]
  [0.88833399]
  [0.90580102]
  [0.89677229]
  [0.90599784]
  [0.91236961]
  [0.91067211]
  [0.91327987]
  [0.9053828 ]
  [0.89854359]
  [0.90727711]]

 [[0.90779374]
  [0.91320606]
  [0.89411533]
  [0.89347569]
  [0.88388113]
  [0.88921964]
  [0.89037591]
  [0.88909663]
  [0.9020862 ]
  [0.89805156]
  [0.88971167]
  [0.88447156]
  [0.87967428]
  [0.87384373]
  [0.85012793]
  [0.85836942]
  [0.87219543]
  [0.87357312]
  [0.86835761]
  [0.88833399]
  [0.90580102]
  [0.89677229]
  [0.90599784]
  [0.91236961]
  [0.91067211]
  [0.91327987]
  [0.9053828 ]
  [0.89854359]
  [0.90727711]
  [0.90166798]]

 [[0.91320606]
  [0.89411533]
  [0.89347569]
  [0.88388113]
  [0.88921964

---

### Build and Train the LSTM RNN

In this section, you will design a custom LSTM RNN in Keras and fit (train) it using the training data we defined.

You will need to:

1. Define the model architecture in Keras.

2. Compile the model.

3. Fit the model with the training data.

In [74]:
# Importing required Keras modules
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

#### Create the LSTM RNN Model Structure

Design the structure of your RNN LSTM as follows:

* Number of units per layer: `30` (same as the window size)

* Dropout fraction: `0.2` (20% of neurons will be randomly dropped on each epoch)

* Add three `LSTM` layers to your model, remember to add a `Dropout` layer after each `LSTM` layer, and to set `return_sequences=True` in the first two layers only.

* Add a `Dense` output layer with one unit.

In [77]:
# Define the LSTM RNN model.
model = Sequential()

# Initial model setup
number_units = 30
dropout_fraction = 0.2

# Layer 1
model.add(LSTM(units=number_units, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(dropout_fraction))

# Layer 2
model.add(LSTM(units=number_units, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(dropout_fraction))

# Layer 3
model.add(LSTM(units=number_units, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(dropout_fraction))

# Output layer
model.add(Dense(1))


#### Compile the LSTM RNN Model

Compile the model using the `adam` optimizer, and `mean_square_error` as loss function since the value you want to predict is continuous.

In [78]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [79]:
# Show the model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 30, 30)            3840      
_________________________________________________________________
dropout_3 (Dropout)          (None, 30, 30)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 30, 30)            7320      
_________________________________________________________________
dropout_4 (Dropout)          (None, 30, 30)            0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 30)                7320      
_________________________________________________________________
dropout_5 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                

#### Train the Model

Train (fit) the model with the training data using `10` epochs and a `batch_size=90`. Since you are working with time-series data, remember to set `shuffle=False` since it's necessary to keep the sequential order of the data.

In [None]:
# Train the model
model.fit(X_train, y_train, epochs=4, shuffle=False, batch_size=90, verbose=1)


Epoch 1/4


---

### Model Performance

In this section, you will evaluate the model using the test data. 

You will need to:

1. Evaluate the model using the `X_test` and `y_test` data.

2. Use the `X_test` data to make predictions.

3. Create a DataFrame of Real (`y_test`) vs. predicted values.

4. Plot the real vs. predicted values as a line chart.

#### Evaluate the Model

Use the `evaluate()` method of the model using the testing data.

In [None]:
# Evaluate the model


#### Make Predictions

Use the `predict()` method of the model to make some closing gold price predictions using your brand new LSTM RNN model and your testing data. Save the predictions in a variable called `predicted`.

In [None]:
# Make predictions using the testing data X_test


Since you scaled the original values using the `MinMaxScaler`, you need to recover the original gold prices to better understand of the predictions. Use the `inverse_transform()` method of the scaler to decode the scaled testing and predicted values to their original scale.

In [None]:
# Recover the original prices instead of the scaled version


#### Plotting Predicted Vs. Real Prices

Create a Pandas DataFrame with an index and two columns as follows to plot the predicted vs. the actual gold prices.

* Index: The datetime index from the `gold_df` DataFrame created previously. 

* Column 1: Actual prices (testing data)

* Column 2: Predicted prices

Your DataFrame should look like the sample below:

![Sample actual vs. predicted gold prices](../Images/sample-gold-prices-predictions-df.png)

In [None]:
# Create a DataFrame of Real and Predicted values

# Show the DataFrame's head


Use the `plot()` method from the DataFrame to create a line chart to contrast the actual vs. the predicted gold prices.

In [None]:
# Plot the real vs predicted prices as a line chart
