# Building a neural network

Given instructions to follow:
'''
1. Write the entropy function for binary classification from scratch using only Python's math module. Do NOT use Numpy.
2. Do NOT call a built-in entropy function from any Python library.
3. Use the math library only. Do not use any other library.
4. You have to write the entropy function from scratch on your own.
5. Your entropy function should take a probability value for one of the two classes as input, and output its entropy value.
6. Call your entropy function using .2 as input.

   
**Print** the output of the function call as "The entropy value of probability .2 is xxx" -
   Replace xxx with the appropriate value

Call your entropy function using .8 as input
**Print** the output of the function call as "The entropy value of probability .8 is xxx" -
   
   Replace xxx with the appropriate value
9. Call your entropy function using .5 as input

**Print** the output of the function call as "The entropy value of probability .5 is xxx" -
   Replace xxx with the appropriate value
'''

In [1]:
import math

def entropy(p):
    if p == 0 or p == 1:
        return 0
    else:
        return -(p * math.log2(p) + (1 - p) * math.log2(1 - p))

# Call the entropy function with .2 as input
entropy_value = entropy(0.2)
print(f"The entropy value of probability .2 is {entropy_value:}") #:.3f for 3 decimal places

The entropy value of probability .2 is 0.7219280948873623


In [2]:
# call the entropy function with 0.8 as input

entropy_value = entropy(0.8)
print(f"The entropy value of probability .8 is {entropy_value:}")

The entropy value of probability .8 is 0.7219280948873623


In [3]:
# call the entropy function with 0.5 as input

entropy_value = entropy(0.5)
print(f"The entropy value of probability .5 is {entropy_value:}")

The entropy value of probability .5 is 1.0


# Building an RNN for time series prediction
'''
RNN for Time Series Prediction

The goal is to predict log_volume using lagged data.

Step 0: Load Libraries

In [16]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Supress TF warnings

import random
import numpy as np
import tensorflow as tf
import pandas as pd

random.seed(1693)
np.random.seed(1693)
tf.random.set_seed(1693)
# FREEZE CODE END

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN
from tensorflow.keras.layers import LSTM
#from tensorflow.keras.layers import TimeDistributed
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.preprocessing import StandardScaler

'''
Step 1: Load & Prep Dataframes


Load the NYSE dataset from the NYSE.csv file available in the file tree to the left.


The date column gives you the timestamps of the time Series.


The train column indicates True for records to be used in the train set, and False for those to be used in the test set.

For this step, let's keep only these 3 columns: 'DJ_return', 'log_volume', 'log_volatility'

Standardize all 3 columns using ScikitLearn's StandardScaler.

In the starter code given below:

  - cols is a list of the names of these 3 columns.
    
  - X is a dataframe that contains only these 3 columns from NYSE.csv.
    
** Print "0. the shape of datafrmae X: xxx" - Replace xxx with the proper value

** Print "1. the first record of dataframe X: xxx" - Replace xxx with proper values
'''

In [5]:

NYSE = pd.read_csv('NYSE.csv')

cols = ['DJ_return', 'log_volume', 'log_volatility']
X = pd.DataFrame(StandardScaler(
with_mean=True ,
with_std=True).fit_transform(NYSE[cols]),
columns=NYSE[cols].columns ,
index=NYSE.index)

print(f"0. The shape of the dataframe X: {X.shape}")  # Print the shape of the dataframe X
print(f"1. The first record of dataframe X: {X.iloc[0].values}")  # Print the first record of dataframe X

0. The shape of the dataframe X: (6051, 3)
1. The first record of dataframe X: [-0.54982334  0.17507497 -4.35707786]


'''
Use code from the textbook lab to set up lagged versions of these 3 data columns (using the starter code given to you here.)

Add column 'train' from the original dataset to the current dataframe X as the last column (to the right).

** Print "2. the shape of dataframe X with lags: xxx" - Replace xxx with the proper value.

** Print "3. the first record of the data frame with lags: xxx" - Replace xxx with proper values.
'''

In [6]:
for lag in range(1, 6):
    for col in cols:
        newcol = np.zeros(X.shape[0]) * np.nan
        newcol[lag:] = X[col].values[:-lag]
        X.insert(len(X.columns), "{0}_{1}".format(col , lag), newcol)
X.insert(len(X.columns), 'train', NYSE['train'])


# Print the shape of the dataframe X with lags
print(f"2. The shape of dataframe X with lags: {X.shape}")

# Print the first record of the dataframe with lags
print(f"3. The first record of the dataframe with lags: {X.iloc[0].values}")

2. The shape of dataframe X with lags: (6051, 19)
3. The first record of the dataframe with lags: [-0.549823342766741 0.17507496702556102 -4.357077862202008 nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan True]


Drop any rows with missing values using the dropna() method.

*4 Print "4. the shape of dataframe X with lags: xxx" - Replace xxx with the proper vale

e
 5 Print "5. the first record of dataframe X with lags: xxx" - Replace xxx with proper values

In [7]:
X = X.dropna()

# Print the shape of the dataframe X with lags
print(f"4. The shape of dataframe X with lags: {X.shape}")

# Print the first record of the dataframe with lags
print("5. the first record of dataframe X with lags:", X.iloc[0])

4. The shape of dataframe X with lags: (6046, 19)
5. the first record of dataframe X with lags: DJ_return          -1.304126
log_volume          0.605918
log_volatility     -1.366028
DJ_return_1          0.04634
log_volume_1        0.224779
log_volatility_1    -2.50097
DJ_return_2        -0.431397
log_volume_2        0.935176
log_volatility_2   -2.366521
DJ_return_3         0.434813
log_volume_3        2.283789
log_volatility_3   -2.418037
DJ_return_4           0.9052
log_volume_4        1.517291
log_volatility_4   -2.529058
DJ_return_5        -0.549823
log_volume_5        0.175075
log_volatility_5   -4.357078
train                   True
Name: 5, dtype: object


Create the Y response target using the 'log_volume' column from dataframe X.

Extract the 'train' column from dataframe X as a separate variable called train. Drop the 'train' column from dataframe X.

Later on we will use the train variable to split the dataset into train vs. test.

Drop the current day’s DJ_return (the "DJ_return" column) and log_volatility from dataframe X.

- Current day refers to the non-lagged columns of these two variables.
- In other words, remove these two X features, and also the Y response that came from dataframe X.

Print "6. the first 3 records of the Y target : xxx" - Replace xxx with proper values.

Print "7. the first 3 records of the train variable: xxx" - Replace xxx with proper values.

Print "8. the first 3 records of dataframe X: xxx" - Replace xxx with proper values.

In [8]:
# Create the Y target
#Y = X['log_volume']

## extract the 'train' column from X
#train = X['train']

Y, train = X['log_volume'], X['train']

#Drop the train column and DJ_return and log_volatility
X = X.drop(columns=['train','DJ_return', 'log_volatility'] + cols)


# Print the first 3 records of the Y target
print("6. the first 3 records of the Y target:", Y.iloc[0:3])

# Print the first 3 records of the train variable
print("7. the first 3 records of train variable:", train.iloc[0:3])

# Print the first 3 records of dataframe X
print("8. the first 3 records of dataframe X:", X.iloc[0:3])


6. the first 3 records of the Y target: 5    0.605918
6   -0.013661
7    0.042552
Name: log_volume, dtype: float64
7. the first 3 records of train variable: 5    True
6    True
7    True
Name: train, dtype: bool
8. the first 3 records of dataframe X:    DJ_return_1  log_volume_1  log_volatility_1  DJ_return_2  log_volume_2  \
5     0.046340      0.224779         -2.500970    -0.431397      0.935176   
6    -1.304126      0.605918         -1.366028     0.046340      0.224779   
7    -0.006294     -0.013661         -1.505667    -1.304126      0.605918   

   log_volatility_2  DJ_return_3  log_volume_3  log_volatility_3  DJ_return_4  \
5         -2.366521     0.434813      2.283789         -2.418037     0.905200   
6         -2.500970    -0.431397      0.935176         -2.366521     0.434813   
7         -1.366028     0.046340      0.224779         -2.500970    -0.431397   

   log_volume_4  log_volatility_4  DJ_return_5  log_volume_5  log_volatility_5  
5      1.517291         -2.529058 

To fit the RNN, we must reshape the X dataframe, as the RNN layer will expect 5 lagged versions of each feature as indicated by the (5,3) shape of the RNN layer below.

We first ensure the columns of our X dataframe are such that a reshaped matrix will have the variables correctly lagged.

We use the reindex() method to do this.

The RNN layer also expects the first row of each observation to be earliest in time.

So we must reverse the current order.

Print "9. the first 3 records of X after reindexing: xxx" - Replace xxx with proper values.

In [9]:
ordered_cols = []
for lag in range(5,0,-1):
    for col in cols:
        ordered_cols.append('{0}_{1}'.format(col , lag))
X = X.reindex(columns=ordered_cols)
X.columns
print("9. the first 3 records of X after reindexing:", X.iloc[0:3])

9. the first 3 records of X after reindexing:    DJ_return_5  log_volume_5  log_volatility_5  DJ_return_4  log_volume_4  \
5    -0.549823      0.175075         -4.357078     0.905200      1.517291   
6     0.905200      1.517291         -2.529058     0.434813      2.283789   
7     0.434813      2.283789         -2.418037    -0.431397      0.935176   

   log_volatility_4  DJ_return_3  log_volume_3  log_volatility_3  DJ_return_2  \
5         -2.529058     0.434813      2.283789         -2.418037    -0.431397   
6         -2.418037    -0.431397      0.935176         -2.366521     0.046340   
7         -2.366521     0.046340      0.224779         -2.500970    -1.304126   

   log_volume_2  log_volatility_2  DJ_return_1  log_volume_1  log_volatility_1  
5      0.935176         -2.366521     0.046340      0.224779         -2.500970  
6      0.224779         -2.500970    -1.304126      0.605918         -1.366028  
7      0.605918         -1.366028    -0.006294     -0.013661         -1.50566

Reshape dataframe X as a 3-D Numpy array such that each record/row has the shape of (5,3). Each row represents a lagged version of the 3 variables in the shape of (5,3).

Print "10. the shape of X after reshaping: xxx" - Replace xxx with proper values.

Print "11. the first 2 records of X after reshaping: xxx" - Replace xxx with proper values.

In [10]:
X_rnn = X.to_numpy().reshape((-1,5,3)) #-1 equals to the size of observations, 5 is the lag and 3 are the number of variables**!!??**
X_rnn.shape
X_rnn

print(f"10. the shape of X after reshaping: {X_rnn.shape}")
print("11. the first 2 records of X after reshaping:", X_rnn[0:2])

10. the shape of X after reshaping: (6046, 5, 3)
11. the first 2 records of X after reshaping: [[[-0.54982334  0.17507497 -4.35707786]
  [ 0.90519995  1.51729071 -2.52905765]
  [ 0.43481275  2.28378937 -2.41803694]
  [-0.43139673  0.93517558 -2.36652094]
  [ 0.04634026  0.22477858 -2.5009701 ]]

 [[ 0.90519995  1.51729071 -2.52905765]
  [ 0.43481275  2.28378937 -2.41803694]
  [-0.43139673  0.93517558 -2.36652094]
  [ 0.04634026  0.22477858 -2.5009701 ]
  [-1.30412619  0.60591805 -1.366028  ]]]


Now we are ready for RNN modeling.

Set up your X_train, X_test, Y_train, and Y_test using the X dataframe, Y response target, and the train variable you have created above.

Include records where train = True in the train set, and train = False in the test set.
Configure a Keras Sequential model with:

(1) proper input shape,

(2) SimpleRNN layer with 12 hidden units, the relu activation function, and 10% dropout

(3) a proper output layer.

Do not name the model or any of the layers.
Print a summary of your model.

In [11]:
X_train = X_rnn[train]
X_test = X_rnn[-train]
Y_train = Y[train]
Y_test = Y[-train]

In [12]:
model = Sequential()
model.add(SimpleRNN(units=12, activation='relu', input_shape=(5, 3), dropout = 0.1))
model.add(Dense(1, activation ="linear"))
model.summary()

  super().__init__(**kwargs)


'''
Compile the modle with

(1) the adam optimizer,

(2) MSE as the loss,

(3) MSE as the metric.

Fit the model with

(1) 200 epochs,

(2) batch size of 32.
'''

In [14]:
model.compile(loss='mean_squared_error',
              optimizer='adam',
              metrics=['mean_squared_error'])

history = model.fit(X_train, Y_train, batch_size=32, epochs=200, validation_data=(X_test, Y_test), verbose=0)

'''
Evaluate the model using model.evaluate() with the test set

**Q4-1-13 Print "13. Test MSE: xxx" - Replace xxx with the proper value.
'''

In [15]:
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print("13. Test MSE:", test_accuracy)

[1m56/56[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.4959 - mean_squared_error: 0.4935
13. Test MSE: 0.6403339505195618
