## Neural Networks: Network, Layers & Neurons
Different architectures & neurons are possible to be used. In one way it's a very advanced LEGO, you just need to understand what each LEGO part is built of and how it interacts with everything else and how the LEGO works as a system all together. 
![Neural Network](https://pvsmt99345.i.lithium.com/t5/image/serverpage/image-id/42339i8BA3F2CCCEDE7458/image-size/large?v=1.0&px=999)

A Layer in a Neural Network is built from multiple Neurons being stacked in width. 
I'll start from Neurons and build it upwards to first understand the single LEGO piece and then the bigger picture. 

### Backpropagation
Backpropagation is the key in neural network, it is used to calculate the _gradient_ that is required by the "update" function in the neuron where it recalculates its weights. Backpropagation is short for "the backward propagation of errors".   
In simple terms, we push backward what we learned by our prediction and how well it went compared to the answer and we update our network automatically to create a better guess next time.


In [0]:
# PyTorch
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out
# Init model
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')

In [0]:
# Keras

# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X, Y, epochs=150, batch_size=10)

# evaluate the model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# calculate predictions
predictions = model.predict(X)

### Neurons
There exist an crazy amont of neurons that can be used and I'll try to quickly go through the most common ones. 
![Neuron](https://cdn-images-1.medium.com/max/1200/1*SJPacPhP4KDEB1AdhOFy_Q.png)

#### 1. Dense / Linear
Dense layers are the most simple. They're basically just an matrix (dot) multiplication. Each neuron receives input from all the neurons in the earlier layer, hence densely connected.  
In Keras it's implemented as the following: `output = activation(dot(input, kernel) + bias)`  
We also include a bias vector as well as the weight vector on this. The activation is selected when creating the layer.

#### 2. Dropout
Dropout is a way to combat overfitting. We randomly drop a node completely from the network, this is done by supplying a float inbetween 0 and 1.  
There is another dropout called **Spatial Dropout**. If we think about CNNs this would rather remove the whole feature map (convolution) rather than individual pixels. So basically we drop an entire featuremap instead of an individual feature.
![dropout gif](http://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/1IrdJ5PghD9YoOyVAQ73MJw.gif)

#### 3. Pooling
Pooling is used to reduce dimensions, by doing this we also make the network more independent in form of input / pattern transition / changes. It goes very well hand in hand with CNNs (which will be explained soon). We have a few different types of pooling such as Maxpool, Avgpool and Global XPool.
![pooling layer gif](https://cdn-images-1.medium.com/max/1200/1*ZCjPUFrB6eHPRi4eyP6aaA.gif)

#### 4. Recurrent: LSTM & GRU
The "two" neurons that exist:  
LSTM - Long Short Term Memory.  
GRU - Gated Recurrent Unit  

ELI5: We add a memory to the gate. This makes the network context aware which is great for natural language. We can now remember things over a sequence

More in depth we have three outputs of an LSTM and two out of an GRU. We have the cell-state (c), hidden state (h) & for LSTM we also have output (o). Important to add on this is that outside of memory we also have the possibility to forget.   
LSTM & GRU solves the problem of vanishing gradient where when we backpropagation.  
As they're very complex I won't go through the teory but just the basics.

I might extend this.

### Layers/Network

All from above are layers too. 

#### 1. Convolutional

One can see Convolutional Neural Networks (CNNs) as a type of filter. We apply matrix multiplication to find some kind of feature. One example of this is in images one such matrix multiplication could be to find all edges in an image.  
In CNNs, as with other Neural Networks, this is trained automatically by backpropagation. 
![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/28132045/cnnimage.png)

#### 2. Embedding

The embedding layer does exactly what the name suggests, embedds the feature. In our case we always talk about Word Embeddings, and I'm not sure there is any other use-case of embedding layers. 

How we did it:
![alt text](https://cdn-images-1.medium.com/max/1200/1*YEJf9BQQh0ma1ECs6x_7yQ.png)

How Embedding does it (dense representation):
![alt text](https://cdn-images-1.medium.com/max/1200/0*mRGKYujQkI7PcMDE.)


#### 3. Recurrent: LSTM & GRU
![RNN](https://cdn-images-1.medium.com/max/800/1*XosBFfduA1cZB340SSL1hg.png)
 
![RNN_unfolded](http://www.wildml.com/wp-content/uploads/2015/09/rnn.jpg)


#### 4. Seq2Seq
Sequence 2 sequence is an special version RNNs. 


#### Bonus: Capsule, Autoencoding & Attention

Let's **CODE**!