1. Single-layer XOR.
-----------
Adapt the sample code from HW0 to train a single perceptron on XOR data. Examine the resulting prediction, weights, and history, and comment on the success of the training.

### Answer : 

In [1]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
np.random.seed(237)


Using TensorFlow backend.


In [2]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [3]:
model = Sequential()
model.add(Dense(1, input_dim=2, activation='sigmoid'))

In [4]:
sgd = SGD(lr=0.1)
model.compile(loss='mse', optimizer=sgd)
hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=2)

In [5]:
print(model.predict_proba(in_data))

[[ 0.5033198 ]
 [ 0.500395  ]
 [ 0.50065207]
 [ 0.49772722]]


### Let's check weights


In [6]:
for layer in model.layers:
    print(layer.get_weights())

[array([[-0.01067128],
       [-0.01169946]], dtype=float32), array([ 0.01327952], dtype=float32)]


### We can clearly see that the output is not same as what we have expected
* Instead of getting ( 0, 1, 1, 0), we got ( 0.5, 0.5, 0.5, 0.49 )
* This is expected as XOR outputs are not linearly separable
* Hence we need to have another layer to make it work.



2. XOR with a hidden layer of 2 nodes.
-------------
Repeat the training process with a multilayer perceptron, using two nodes in a hidden layer.
* Identify the output weights, and describe what is happening with the hidden nodes and the output node.
* Do the fit process at least a few more times, starting with new models and random seeds. What is different or the same about each of these experiments?
* For two of the runs that appear substantially different in their weights, identify the logical function that each intermediary node is computing, as well as the output node. (For a network like this, with two inputs and one output per node, you should be able to write out a logic sentence or fill in a table of binary values.)

### Answer : 

### Model - 1
* Using the same model as given with additional layer

In [12]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
from keras.utils import plot_model
np.random.seed(237)


In [13]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [14]:
model = Sequential()
model.add(Dense(2, input_dim=2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

In [15]:
sgd = SGD(lr=0.1)
model.compile(loss='mse', optimizer=sgd)
hist = model.fit(in_data, out_data, batch_size=1, epochs=10000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=2)

In [16]:
print(model.predict_proba(in_data))

[[ 0.03164677]
 [ 0.97265345]
 [ 0.9728266 ]
 [ 0.02818649]]


In [21]:
for layer in model.layers:
    print(layer.get_weights())

[array([[-1.17854929,  0.03321838],
       [-0.42558348, -0.94829679]], dtype=float32), array([ 0.,  0.], dtype=float32)]
[array([[ 1.09355223],
       [ 1.37676966]], dtype=float32), array([ 0.], dtype=float32)]


### We can clearly see that the output is prediction is much better
* Getting ( 0, 0.98, 0.98, 0) as expected value is (0, 1, 1, 0)

* The first line sets up an empty model using the Sequential API. 
* then we’re adding a Dense layer to our model. 
* We set input_dim=2 because each of our input samples is an array of length 2 ([0, 1], [1, 0] etc.). 
* The dimension of the output for this layer is 2. If we think about our model in terms of neurons it means that we have two input neurons (input_dim=2) spreading into 2 neurons in a so called hidden layer.
* We also added another layer with an output dimension of 1 and without an explicit input dimension. In this case the input dimension is implicitly bound to be 2 since that’s the output dimension of the previous layer.


### Weight values
* W(1,2) = -5.9
* W(1,3) = 5.86
* b1 = -3.17
* .
* W(2,2) = 5.7
* W(2,3) = -6.02
* b2 = -3.27
* .
* W(3,2) = 9.22
* W(3,3) = 9.18
* b3 = -4.54


### We can visualize our model like this.

![title](img/Multi.png)

### Model - 2
* For this model Using the same model with increased learning rate(0.2) and higher epoch value(10000).

In [22]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
from keras.utils import plot_model
np.random.seed(237)

In [23]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [24]:
model = Sequential()
model.add(Dense(2, input_dim=2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

In [25]:
sgd = SGD(lr=0.2)
model.compile(loss='mse', optimizer=sgd)
hist = model.fit(in_data, out_data, batch_size=1, epochs=10000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=2)

In [26]:
print(model.predict_proba(in_data))

[[ 0.02092898]
 [ 0.98194629]
 [ 0.98200673]
 [ 0.01870725]]


In [27]:
for layer in model.layers:
    print(layer.get_weights())

[array([[-5.91967487,  5.86592293],
       [ 5.70987177, -6.02209902]], dtype=float32), array([-3.17758083, -3.27455902], dtype=float32)]
[array([[ 9.22402477],
       [ 9.18853474]], dtype=float32), array([-4.54956007], dtype=float32)]


### Model - 3
* For this model Using 32 nodes for the hidden layer with ReLu as activation function
* Using the optimizer 'adam' and checking with different values of epoch for accuracy.

In [28]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
from keras.utils import plot_model
np.random.seed(237)

In [29]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [30]:
model = Sequential()
model.add(Dense(32, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [31]:
model.compile(loss='mse', optimizer='adam')
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=0)
hist = model.fit(in_data, out_data, batch_size=1, epochs=10000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=100, verbose=2)

In [32]:
#print(model.predict_proba(in_data).round())
print(model.predict_proba(in_data))

[[  1.21201265e-05]
 [  9.99991298e-01]
 [  9.99991298e-01]
 [  7.95344386e-06]]


In [33]:
for layer in model.layers:
    print(layer.get_weights())

[array([[-1.69767249,  1.62174177, -0.23225045, -0.32526308,  0.0427657 ,
         0.02756534,  1.28406119,  0.82311469, -0.17420785,  0.60011816,
        -0.35459313, -0.1818248 ,  0.15972893, -0.23245458,  1.7789346 ,
         0.39777139, -0.36727858,  1.57581699, -0.17183733,  2.19454122,
        -0.2543183 , -0.25804031, -0.36253077,  0.45965564, -1.54771793,
        -2.03339219,  0.1993116 ,  0.30599567, -1.44028044,  0.16291015,
         0.50966966, -1.87768149],
       [ 1.6977042 , -1.62171912,  0.0475382 , -0.38090867, -0.39276823,
        -0.58506739, -1.28403914,  0.82311159, -0.36305991,  0.60012156,
        -0.30699015, -0.22405182, -0.26927742, -0.05007973, -1.77889323,
         0.13049468, -0.17718498, -1.57581246, -0.11812219, -2.1945107 ,
        -0.15362132,  0.02173809,  0.06299868,  0.45964518,  1.54773104,
         2.03338909,  0.11602172,  0.17503706,  1.44030619,  0.41295651,
         0.50967193,  1.87768126]], dtype=float32), array([ -2.27531054e-05,  -6.1295204

### Higher values of epoch are giving better accuracy
* Epoch - 1000 =>
  0.08, 0.95, 0.95, 0.03

* Epoch - 10000 => 
  0, 0.999, 0.999, 0

3. Linearly separable logic with hidden layer
---------------
Train a network with the same structure as in Problem 2 on an easier logic relation of your choice – one that is linearly separable, and for which the hidden nodes are not necessary.
* Before fitting the model, write down your predictions or expectations of how the training process might be different, and how the network will perform.
* After training, see if you are able to interpret the weights, and what the hidden nodes are computing.
* How did this experiment go relative to your predictions? Was there anything different about the training process?

### Answer : 

* Using a simple hidden layer of 2 nodes
* As the logic is linearly separable so it will be easier to predict the correct value with additional layer 

In [34]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
np.random.seed(237)

In [35]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,1]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [36]:
model = Sequential()
model.add(Dense(2, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [37]:
sgd = SGD(lr=0.2)
model.compile(loss='mse', optimizer=sgd)
hist = model.fit(in_data, out_data, batch_size=1, epochs=2000, verbose=0)

In [38]:
print(model.predict_proba(in_data))

[[ 0.06655546]
 [ 0.97793853]
 [ 0.98811907]
 [ 0.99998069]]


In [39]:
for layer in model.layers:
    print(layer.get_weights())

[array([[-1.17854929,  2.2603755 ],
       [-0.42558348,  2.05898833]], dtype=float32), array([ 0.        , -0.00033171], dtype=float32)]
[array([[ 1.09355223],
       [ 3.1245904 ]], dtype=float32), array([-2.64084601], dtype=float32)]


* As we can see the predicted values are (0.04, 0.99, 0.98, 0.99) very close to the actual expected values (0,1,1,1)
* Though it is unnecessary to use the hidden layer for linearly separable logic still it does predict the correct value

4. XOR in a larger network
----------
Create a new network structure to train on the XOR data. For instance, try using additional hidden nodes or another hidden layer. For any change you make, predict what the effect will be on the experiment relative to the outcomes in Problem 2. Analyze the results and compare them to your predictions.

### Answer :

### Model 1 :
* Instead of using 32 nodes in the hidden layer using 60 nodes 
* As number of nodes are increrased so expecting the network to get better predicted values.

In [50]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
from keras.utils import plot_model
np.random.seed(237)

In [51]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [52]:
model = Sequential()
model.add(Dense(40, input_dim=2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

In [53]:
sgd = SGD(lr=0.1)
model.compile(loss='mse', optimizer=sgd)
hist = model.fit(in_data, out_data, batch_size=1, epochs=10000, verbose=0)


In [54]:
print(model.predict_proba(in_data))

[[ 0.03244289]
 [ 0.96342623]
 [ 0.96462059]
 [ 0.03964273]]


### Much better values than 32 nodes
* As you can see the values predicted by 32 nodes were ( 0.0123, 0.9893, 0.9893, 0.0110)
* Now with 60 nodes the values are 

### Model 2 :
* Using two hidden layers of 32 nodes each with adam optimizer
* Using binary accuracy to check how well the network is getting trained.
* Expecting the system to get better predictive values even with low epoch value.

In [55]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import numpy as np
from keras.utils import plot_model
np.random.seed(237)

In [56]:
dataset = np.array([[0,0,0], [0,1,1], [1,0,1], [1,1,0]])
in_data = dataset[:,0:2]
out_data = dataset[:,2]

In [57]:
model = Sequential()
model.add(Dense(32, input_dim=2, activation='relu'))
model.add(Dense(32, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

In [58]:
#model.compile(loss='mse', optimizer='adam')
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=1000, verbose=0)
#hist = model.fit(in_data, out_data, batch_size=1, epochs=100, verbose=2)
model.compile(loss='mean_squared_error',
              optimizer='adam',
              metrics=['binary_accuracy'])
hist = model.fit(in_data, out_data, batch_size=1, epochs=500, verbose=2)

Epoch 1/500
0s - loss: 0.2895 - binary_accuracy: 0.5000
Epoch 2/500
0s - loss: 0.2856 - binary_accuracy: 0.5000
Epoch 3/500
0s - loss: 0.2821 - binary_accuracy: 0.5000
Epoch 4/500
0s - loss: 0.2786 - binary_accuracy: 0.5000
Epoch 5/500
0s - loss: 0.2756 - binary_accuracy: 0.5000
Epoch 6/500
0s - loss: 0.2746 - binary_accuracy: 0.5000
Epoch 7/500
0s - loss: 0.2696 - binary_accuracy: 0.5000
Epoch 8/500
0s - loss: 0.2674 - binary_accuracy: 0.5000
Epoch 9/500
0s - loss: 0.2653 - binary_accuracy: 0.5000
Epoch 10/500
0s - loss: 0.2616 - binary_accuracy: 0.5000
Epoch 11/500
0s - loss: 0.2609 - binary_accuracy: 0.5000
Epoch 12/500
0s - loss: 0.2601 - binary_accuracy: 0.5000
Epoch 13/500
0s - loss: 0.2576 - binary_accuracy: 0.5000
Epoch 14/500
0s - loss: 0.2541 - binary_accuracy: 0.5000
Epoch 15/500
0s - loss: 0.2523 - binary_accuracy: 0.5000
Epoch 16/500
0s - loss: 0.2508 - binary_accuracy: 0.5000
Epoch 17/500
0s - loss: 0.2487 - binary_accuracy: 0.5000
Epoch 18/500
0s - loss: 0.2496 - binary_

0s - loss: 0.0689 - binary_accuracy: 1.0000
Epoch 159/500
0s - loss: 0.0680 - binary_accuracy: 1.0000
Epoch 160/500
0s - loss: 0.0672 - binary_accuracy: 1.0000
Epoch 161/500
0s - loss: 0.0664 - binary_accuracy: 1.0000
Epoch 162/500
0s - loss: 0.0653 - binary_accuracy: 1.0000
Epoch 163/500
0s - loss: 0.0645 - binary_accuracy: 1.0000
Epoch 164/500
0s - loss: 0.0638 - binary_accuracy: 1.0000
Epoch 165/500
0s - loss: 0.0628 - binary_accuracy: 1.0000
Epoch 166/500
0s - loss: 0.0622 - binary_accuracy: 1.0000
Epoch 167/500
0s - loss: 0.0611 - binary_accuracy: 1.0000
Epoch 168/500
0s - loss: 0.0603 - binary_accuracy: 1.0000
Epoch 169/500
0s - loss: 0.0596 - binary_accuracy: 1.0000
Epoch 170/500
0s - loss: 0.0587 - binary_accuracy: 1.0000
Epoch 171/500
0s - loss: 0.0580 - binary_accuracy: 1.0000
Epoch 172/500
0s - loss: 0.0572 - binary_accuracy: 1.0000
Epoch 173/500
0s - loss: 0.0566 - binary_accuracy: 1.0000
Epoch 174/500
0s - loss: 0.0559 - binary_accuracy: 1.0000
Epoch 175/500
0s - loss: 0.0

Epoch 325/500
0s - loss: 0.0116 - binary_accuracy: 1.0000
Epoch 326/500
0s - loss: 0.0115 - binary_accuracy: 1.0000
Epoch 327/500
0s - loss: 0.0114 - binary_accuracy: 1.0000
Epoch 328/500
0s - loss: 0.0113 - binary_accuracy: 1.0000
Epoch 329/500
0s - loss: 0.0112 - binary_accuracy: 1.0000
Epoch 330/500
0s - loss: 0.0111 - binary_accuracy: 1.0000
Epoch 331/500
0s - loss: 0.0111 - binary_accuracy: 1.0000
Epoch 332/500
0s - loss: 0.0110 - binary_accuracy: 1.0000
Epoch 333/500
0s - loss: 0.0109 - binary_accuracy: 1.0000
Epoch 334/500
0s - loss: 0.0108 - binary_accuracy: 1.0000
Epoch 335/500
0s - loss: 0.0107 - binary_accuracy: 1.0000
Epoch 336/500
0s - loss: 0.0107 - binary_accuracy: 1.0000
Epoch 337/500
0s - loss: 0.0106 - binary_accuracy: 1.0000
Epoch 338/500
0s - loss: 0.0105 - binary_accuracy: 1.0000
Epoch 339/500
0s - loss: 0.0104 - binary_accuracy: 1.0000
Epoch 340/500
0s - loss: 0.0103 - binary_accuracy: 1.0000
Epoch 341/500
0s - loss: 0.0103 - binary_accuracy: 1.0000
Epoch 342/500


Epoch 495/500
0s - loss: 0.0040 - binary_accuracy: 1.0000
Epoch 496/500
0s - loss: 0.0040 - binary_accuracy: 1.0000
Epoch 497/500
0s - loss: 0.0039 - binary_accuracy: 1.0000
Epoch 498/500
0s - loss: 0.0039 - binary_accuracy: 1.0000
Epoch 499/500
0s - loss: 0.0039 - binary_accuracy: 1.0000
Epoch 500/500
0s - loss: 0.0039 - binary_accuracy: 1.0000


In [59]:
#print(model.predict_proba(in_data).round())
print(model.predict_proba(in_data))

[[ 0.08642094]
 [ 0.93870527]
 [ 0.94294232]
 [ 0.03128288]]


### Better predicted values with lower epoch
* As expected in this network we are able to get the binary accuracy of 1 at 41st epoch only
* So will not train the network for a very high value of epoch