# Deep Learning

Deep learning is a subset of Machine learning. Features are given machine learning manually. On the other hand, deep learning learns features directly from data.

- Parameters are weight and bias.
- Weights: coefficients of each pixels
- Bias: intercept
- z = (w.t)x + b => z equals to (transpose of weights times input x) + bias
- In an other saying => z = b + px1w1 + px2w2 + ... + px4096*w4096
- y_head = sigmoid(z)

- Sigmoid function makes z between zero and one so that is probability. You can see sigmoid function in computation graph.


- Lambda layer performs simple arithmetic operations like sum, average, exponentiation etc

```python
model.add(Lambda(standardize,input_shape=(28,28,1)))

def standardize(x): 
    return x.mean()/x.stdev()

```

__Why we use sigmoid function?__

It gives probabilistic result. It is derivative so we can use it in gradient descent algorithm (we will see as soon.)
Lets say we find z = 4 and put z into sigmoid function. The result(y_head) is almost 0.9. It means that our classification result is 1 with 90% probability.

Adam is one of the most effective optimization algorithms for training neural networks. Some advantages of Adam is that relatively low memory requirements and usually works well even with little tuning of hyperparameters


# Keras
    - models
    - layers
    - callback
    - optimizers
    - metric
    - losses
    - utils
    - constraints
    - data preprocessing
    
__models__

The core data structures of Keras are layers and models. The simplest type of model is the Sequential model, a linear stack of layers. For more complex architectures, you should use the Keras functional API, which allows to build arbitrary graphs of layers, or write models entirely from scratch via subclasssing.

__How to restrict weights in a range in keras__

```python
from keras.constraints import max_norm

model.add(Convolution2D(32, 3, 3, input_shape=(3, 32, 32), 
                        border_mode='same', activation='relu', kernel_constraint=max_norm(3)))
```

Constraining the weight matrix directly is another kind of regularization. If you use a simple L2 regularization term you penalize high weights with your loss function. With this constraint, you regularize directly. As also linked in the keras code, this seems to work especially well in combination with a dropoutlayer.

# Bidirectinal LSTM

RNN architectures like LSTM and BiLSTM are used in occasions where the learning problem is sequential,

LSTMs and their bidirectional variants are popular because they have tried to learn how and when to forget and when not to using gates in their architecture. In previous RNN architectures, vanishing gradients was a big problem and caused those nets not to learn so much.

Using Bidirectional LSTMs, you feed the learning algorithm with the original data once from beginning to the end and once from end to beginning. 

The term bidirectional means that you'll run your input in two direction (from past to future and from future to past).

Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.

Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards, you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.


# GRU vs LSTM

The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).

The GRU controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.


__Sample:__ one element of a dataset. For instance, one image is a sample in a convolutional network. One audio snippet is a sample for a speech recognition model.

__Batch:__ a set of N samples. The samples in a batch are processed independently, in parallel. If training, a batch results in only one update to the model. A batch generally approximates the distribution of the input data better than a single input. The larger the batch, the better the approximation; however, it is also true that the batch will take longer to process and will still result in only one update. For inference (evaluate/predict), it is recommended to pick a batch size that is as large as you can afford without going out of memory (since larger batches will usually result in faster evaluation/prediction).

__Epoch:__ an arbitrary cutoff, generally defined as "one pass over the entire dataset", used to separate training into distinct phases, which is useful for logging and periodic evaluation. When using validation_data or validation_split with the fit method of Keras models, evaluation will be run at the end of every epoch. Within Keras, there is the ability to add callbacks specifically designed to be run at the end of an epoch. Examples of these are learning rate changes and model checkpointing (saving).

__EarlyStopping callback:__ interrupt training when the validation loss isn't decreasing anymore?

__ModelCheckpoint callback:__ To ensure the ability to recover from an interrupted training run at any time (fault tolerance), you should use a callback that regularly saves your model to disk. You should also set up your code to optionally reload that model at startup.

In [2]:
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb


n_unique_words = 10000 # cut texts after this number of words
maxlen = 200
batch_size = 128

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=n_unique_words)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
y_train = np.array(y_train)
y_test = np.array(y_test)

model = Sequential()
model.add(Embedding(n_unique_words, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=4,
          validation_data=[x_test, y_test])

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
 1728512/17464789 [=>............................] - ETA: 4:09

KeyboardInterrupt: 