## The XOR example (Fig. 6.6)

Here is an implementation of the xor Neural Net in Python using the network drawn in Ch. 6 in Figure 6.6.

Work through the discussion that follows the code,  then answer the questions at the end.

The `xor` function in the cell below implements the xor NN in Fig. 6.6 (the same weights,
the same number of layers, but only one ReLU unit, which seems to be all that's
needed).

The `xor2` function in the cell below is equivalent, but uses
python arrays (computational implementations of the matrices in your 
Linear Algebra class) to contain and apply the weights and biases.  The transition
from `xor` to `xor2` requires some minor changes to input and output illustrated below.

In [35]:
def ReLU(z):
    """
    Note: if z is an array or sequence, this will broadcast 0 to the same shape and
    zero out the negative elements, as desired for ReLU.
    
    Example: np.maximum([1,-1],0) returns [1,0]
    """
    return np.maximum(z,0)
    # The following variant works only for scalar z
    #return max (z,0)
    
#######################################################
#
#  X O R    N e u r a l    N e t  (Version 1)
#
#######################################################

# Map to hidden layer
w1 = np.array([1,1])
w2 = np.array([1,1])
b = np.array([0,-1])
# Map to output layer
o = np.array([1,-2])
b2 = 0

def xor (x,verbose=0):
    """
    The caption of Fig. 6.6 says to use 3 ReLUs but one 
    ReLU on the input layer implements two of those.
    
    And a ReLU on the output layer is unneeded.
    """
    # The hidden layer:  2 dot products with biases added
    #h[0] = w1[0]*x[0] + w1[1]*x[1] + b[0]
    #h[1] = w2[0]*x[0] + w2[1]*x[1] + b[1]
    h = np.array([w1.dot(x), w2.dot(x)]) + b
    h = ReLU(h)
    if verbose > 1:
        print("*** x: ",x[0],x[1])
    if verbose > 0:
        print("*** h: ",h[0],h[1])
    # (o[0]*h[0]) + (o[1*h[1]) + b2
    return  o.dot(h) + b2


#######################################################
#
#  X O R    N e u r a l    N e t  (Matrix Version)
#
#######################################################
# A 2D array whose rows are w1 and  w2
# b will be unchanged
H = np.array([w1,w2])
# Output layer
# O is 1x2 2D array whose single row is 1D array o
# This way all our LEARNED parameters are in 2D arrays.  
# b2 is unchanged.
O = np.array([o])

def xor2 (x,verbose=0,linear=False):
    """
    This version represent the hidden layer weights +bias and
    the output weights + bias as matrices.
    """
    # Hidden layer matrix
    h = H@x + b
    if not linear:
        # Add the teensy bit of non linearity here
        h = ReLU(h)
    if verbose > 1:
        print("*** x: ",x[0],x[1])
    if verbose > 0:
        print("*** h: ",h[0],h[1])
    return O@h + b2



In [250]:
# Our data is generally also going to be in an array
X = np.array([[1,1],[1,0],[0,1],[0,0]])
# Verbose=2 prints out x (the input) and the hidden layer output
for x in X:
    # Iterate through the ROWS of X
    print(xor(x,verbose=2))

*** x:  1 1
*** h:  2 1
0
*** x:  1 0
*** h:  1 0
1
*** x:  0 1
*** h:  1 0
1
*** x:  0 0
*** h:  0 0
0


The function  `xor2` is equivalent in behavior 
to `xor` but outputs a 1D array with shape (1,) instead of a scalar,

Have a look at its definition and make sure you understand how it's equivalent, because it is written
much more in the matrix-programming style that is typical for NNs.

In [251]:
for x in X:
    print(xor2(x,verbose=2))

*** x:  1 1
*** h:  2 1
[0]
*** x:  1 0
*** h:  1 0
[1]
*** x:  0 1
*** h:  1 0
[1]
*** x:  0 0
*** h:  0 0
[0]


We need our ReLU to get the right result.  Here's what happens if we take it out:

In [236]:
for x in X:
    print(xor2(x,verbose=2,linear=True))

*** x:  1 1
*** h:  2 1
[0]
*** x:  1 0
*** h:  1 0
[1]
*** x:  0 1
*** h:  1 0
[1]
*** x:  0 0
*** h:  0 -1
[2]


Because of the use of ReLU, the transformation to the first layer ($\mathbf{x} \mapsto \mathbf{h}$) is not a linear map. Let's call that transformation L.

## Questions

1.  [2 points] Implement the `and` and `or` NNs shown in Fig. 6.4 as matrices and biases.  All you need to do is 
    consult the figure and define the matrix so that the code loop in the solution stub for this exercise works. 
    For example using the matrix  if  `and_M` is the matrix for `and` and `and_b` is the corresponding bias:
    
    ```python
    x = [1,0]
    and_M@x + and_b
    ```
    
    returns 
    
    ```python
    -1
    ```
    
    After applying equation 6.7 this is converted to the right value, 0.
    Use the solution stub below to test it on all 4 combinations of 1 and 0.
 
2.  [2 points] Define `nand` as a matrix and bias: `a nand b` ($\text{not} (a \& b)$) ---
    written as $a \mid b$ -- flips the 1s and 0s in the truth table of `&`, so
    it behaves as follows:
    
    $$
    \begin{array}{cc|c}
    a  &  b &  a\mid b\\
    \hline
    1 & 1 & 0\\
    1 & 0 &  1\\
    0 & 1 &  1\\
    0 & 0 &  1
    \end{array}
    $$
    
    Your `nand` matrix and `nand` bias should work just the way `and` and `or` did.
    Use a copy of the solution stub for Exc (1) for testing.
    
3.  [2 points] The following  is a fact from a logic class: Using `|` for `nand` (as we did in the  previous 
    problem), and $\vee$ for `or`, `xor` can be defined as 

    $$
    (3.1) \; a\text{ xor } b = (a\,\vee \,b) \;\& \;(a \,\mid\, b)
    $$

    You should check that this has the right truth table.  We have linear definitions for all the operators on 
    the right hand side ($&$, $\vee$, $\mid$).  Don't we now have a linear definition for `xor`?  Define a 
    function that implements the idea of definition 3.1 above, using the
    matrices and biases for $&$, $\vee$, and $\mid$ that you already have.  Test it. What happens? 
    You can fix your `xor` function by asking just what values is `and` getting from `or` and `nand` and 
    adjusting them with a nonlinear activation function $\tau$.
    
    
    $$
    (3.2)\; a\text{ xor } b = \tau(a\,\vee \,b) \;\& \;\tau(a \,\mid\, b)
    $$
    
    Hint:  The easiest answer is not to use a standard activation function
    (like the ones in the textbook or in the pytorch `functionals` module)
    but to choose one that makes the `or` and `nand` models return exactly
    what they returned in Exc. 1 and 2.

4.  **a.**  [6 points] Code up a `numpy` version of the toy language model in 
    slide 67 of the Jurasky and Martin slides for the NN chapter 
    (or, for the same diagram, see Figure 6.14 in the Jurafksy and Martin reading for the NN module).
    This diagram defines a **language model**, a model that predicts the next word on the basis
    of the words before it.  Language models have been our focus from the first day 
    of the course.
    
    To help you get started in implementing the diagram, a solution-stub cell has been provided below.  It 
    defines a vocabulary, an example to run, and a function `get_input_layer` that maps from an example 
    string to the input layer in the diagram.
    
    The solution stub defines your vocabulary words as
    
    ```python
    vocab = np.array('a across and boy dog his run runs the yard'.split())
    ```
    
    The model in the diagram uses 1-hot encoding vectors for input words.  
    
    The context used to predict the next word is N words long. Therefore the input to the learner is a 
    sequence of N one-hot vectors,  which means that the layer labeled `input_layer` in the diagram
    is an NxV array (execute the code in the solution stub and the example cell below it to see this).
    For concreteness, we assume that N =4.
    
    To follow the diagram, you must define  a separate 2D array `E` which contains embeddings for all
    the words in the vocabulary.  Given the one-hot encodings, the mapping from the
    `input_layer` to the `embedding_layer` can be partially accomplished via a matrix multiplication with `E`
    (Fig. 6.11 in Jurafsky and Martin). This will return the 4 embeddings for the input words in a 2D array. 
    Let's call that 2D array `input_embeddings`.  To construct `embedding_layer` we concatenate  the N vectors 
    in `input_embeddings`.  To code this, use:
    
    ```python
    embedding_layer =  np.concat(input_embeddings)
    ```
   
    Write the NN as a function called `LM_NN` (Language Model Neural Net),
    which takes the input layer as its argument and  returns the corresponding output layer.  
    It should use the architecture of the NN in the diagram, with these modifications to the hyperparameters:
    
    **Assume the embedding dimensionality used in the embedding layer is 5 ($d$ = $5$), the diagram uses 3),
    the hidden layer has dimensionality 9 ($d_h$ = $9$), the size of the input sequence is 4 words ($N$ = $4$, the
    diagram uses 3).**
    
    You should define 2D arrays of the right shape for `E`, `W`, and `U`, as in the diagram.  
    Initialize these  array with random numbers.  To do that, and yet have reproduceable results,
    use the `numpy` random number generator, which supports a number of random-generation tasks:
    
    ```python
    seed = 48
    rng = np.random.default_rng(seed=seed)
    rng.random((m, n))
    ```
    
    The last line returns an array of shape m x n filled with random numbers.
    
    In writing `LM_NN`, make your work clear by using the layer names in the diagram.  Assign appropriate values
    for `embedding_layer`, `hidden_layer`, and `output_layer`.  The value of `input_layer` has
    already been computed for you for an example (in the example cell following the solution stub).
    
    In order to define `output_layer` correctly, you will need to do a **softmax**.  You can use  
    `scipy.special.softmax` for your softmax function or write your own.
    
    **b.** [1 point] For fun,  apply this utterly untrained language model to the appropriate input
    sequence for `a boy and his`. That input has been supplied in the solution stub.
    
    **c.** [3 points] Write a function to compute the loss, using **Negative Log Likelihood** as your loss 
    function.  See the slides entitled *Information_theory.pdf*.  Compute the loss for the `a boy and his`-input
    when the actual next word is *dog*.  
    
    **c.** [1 point] Use the output layer returned by `LM_NN` to compute the predicted next word for the 
    `a boy and his`-input.
    
    **d.** [1 point] If the LM predicts the next word correctly, is the loss always 0? If not, is it ever 0?
    If the loss can ever be 0, under what circumstances can it be 0?

##  Solution stub for Excs 1 and 2

In [None]:
import numpy as np
# Define M and b with the appropriate values for the Boolean operator `and` or `or` or `nand`
# Then test your definition here. Use one copy of this cell for each Boolean
# operator you've defined.
M,b = ??,??
X = np.array([[1,1],[1,0],[0,1],[0,0]])

def PosU (x):
    """
    This is Eqn 6.7
    """
    return int(x>0)

for x in X:
        v = (M@x+b)[0]   ## Put your matrix and bias calculation here
        # This implements the output rule in Eqn 6.7
        o = PosU(v)
        print(f"{x} {v:> 2} {o:>2}")

## Solution stub for Exc 4

In [186]:
##########  Vocab and utilities ##########################
def get_input_layer (doc):
    """
    This constructs the 2D one-hit encoding matrix
    for the input doc.  It assumes the input sequence
    length N is defined at run time.
    """
    wds = doc.split()
    M = len(wds)
    assert M==N, f"The input sequence must be of length {N}, not {M}!"
    input_layer = np.zeros((N,V))
    for (i,wd) in enumerate(wds):
        input_layer[i,word2index[wd]] = 1
    return input_layer

vocab = np.array('a across and boy dog his runs runs the yard'.split())
word2index = {w:i for (i,w) in enumerate(vocab)}
##########  Vocab and utilities ##########################


#############  LEARNING PARAMS OF THE MODEL  ###############
##  Your initializations of E, W, and U go here
##  Defining these depends on certain hyperparameters
##  discussed in the instructions, such as N and d.

############################################################

def LM_NN (input_layer):
    """
    Given the inout `input_layer`
    operates on globals E, W, and U
    to produce and return the
    output `output_layer`.
    """
    ##  YOUR CODE GOES HERE
    return output_layer



################################################################


In [187]:
######### Example input ###################################
N=4
input_doc,next_word = "a boy and his","dog"
input_layer = get_input_layer (input_doc)
print(input_layer)
###########   Example input ################################


[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]
