# Intro to Data Science
## Part VIII. - Deep Learning and it's applications

### Table of contents

- #### Deep learning basics
    - <a href="#What-is-Deep-Learning?">Theory</a>
    - <a href="#1.-Architectures">Layer Architecture types</a>
        - Dense Neural Networks
            - Activision and Loss Functions
        - Convolutional Neural Networks
        - Recurrent Neural Networks
        - Word Embeddings
        - Regularization
    
---

# I. Deep learning basics

## What is Deep Learning?

> _Deep learning refers to neural networks with multiple hidden layers that can learn increasingly abstract representations of the input data._ [source](https://elitedatascience.com/keras-tutorial-deep-learning-in-python)

> _Deep learning is a class of neural network algorithms that:_
> - _use a cascade of __multiple layers__ of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input._
> - _learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners._
> - _learn __multiple levels of representations__ that correspond to __different levels of abstraction__; the levels form a hierarchy of concepts._ 
[source](https://en.wikipedia.org/wiki/Deep_learning#Definition)

## Why is it important?

Deep Learning is widely used in our daily lives. It powers web search engines, recommender systems, image recognition systems, self driving cars. It helps generating sound, image, text, better ai agents.  
It is the current state of the art machine learning model for many tasks including image recognition, text mining, and classification.

## Tools
- Scikit-Learn
- Gensim
- Tensorflow
- Torch
- Keras

# II. Deep Neural Network Architectures

## [Dense feedforward network](https://keras.io/layers/core/#dense)

A dense layer is just a regular layer of neurons in a neural network. Each neuron recieves input from all the neurons in the previous layer, thus densely connected. The layer has a weight matrix W, a bias vector b, and the activations of previous layer a. The following is te docstring of class Dense from the keras documentation:

output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer.

### [Activision functions](https://keras.io/activations/)

- Sigmoid: $\frac{{\rm e}^x}{{\rm e}^x + 1}$
- Tanh: $\tanh(x)$
- ReLU: $\max(x, 0)$
- Softmax:  $\frac{{\rm e}^x}{\sum{{\rm e}^x}}$
- Hierarchical Softmax

#### Further reading:

- https://medium.com/@srnghn/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4
- https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/
- http://cs231n.github.io/neural-networks-1/#commonly-used-activation-functions
- https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8
- https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

### [Loss functions](https://keras.io/losses/)

- MSE: mean squared error $\frac{\sum{e^2}}{n}$
- MAE: mean absolute error $\frac{\sum{\left|{e}\right|}}{n}$
- Categorical Hinge Loss: $\max(0, 1 - t*y)$
- Cross Entropy Loss: $V(f(x), t) = -t\ln(f(x))-(1-t)\ln(1-f(x))$, where $ {\displaystyle t=(1+y)/2}$
- Cosine proximity

#### Further reading:

- http://cs231n.github.io/neural-networks-2/#loss-functions
- http://cs231n.github.io/neural-networks-3/#loss
- https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23
- https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8


### [Regularization](https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0)

There are many different kind of regularization techniques available in most of the deep learning frameworks.

#### Early stopping

Early stopping is a simple method which checks a loss criteria on a validation dataset during the training process. If the criteria stops decreasing, the training ends as well. A patience parameter can be set to allow non-improving steps which helps to move out from local minima.

#### Dropout

<img src="pics/dl_dense_dropout_network.png" width=500>By <a href="http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf">Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014</a>

Dropout is a a technique used to tackle Overfitting. The Dropout method in keras.layers module takes in a float between 0 and 1, which is the fraction of the neurons to drop. Below is the docstring of the Dropout method from the documentation:

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

#### Weight penalty

Simple L1 (absolute value) or L2 (quadratic value) regularization term in the objective function.

#### Further reading:

- http://cs231n.github.io/neural-networks-2/#reg

---

### In Practice
#### Building a simple dense network to classify hand-written digits

#### 1. Loading data

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
X, y = load_digits(return_X_y=True)
yt = OneHotEncoder(categories='auto', sparse=False).fit_transform(y.reshape(-1, 1))

Xtrain, Xtest, ytrain, ytest = train_test_split(X, yt, random_state=42)

#### 2. Model construction

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, TensorBoard

In [None]:
model = Sequential([
    Dense(8, activation='relu', input_dim=64),
    Dense(10, activation='softmax')
])

#### 3. Assembly

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

##### 3.a Model validation

In [None]:
model.summary()

#### 5. Model training

In [None]:
model.fit(Xtrain, ytrain, # training data
          batch_size=16,  # number of data points to use in a training round
          epochs=100,     # number of full training cycle 
          validation_data=(Xtest, ytest),  # validation dataset
          callbacks=[EarlyStopping(patience=3), 
                     TensorBoard(log_dir='tensor', histogram_freq=0, write_graph=True,
                                 write_images=True, update_freq='epoch')])  # function to execute at the end of each epoch

#### 6. Model evaluation

In [None]:
loss, acc = model.evaluate(Xtest, ytest)
print(f'test loss: {loss}, test acc: {acc}')

#### 7. Prediction

In [None]:
model.predict_classes(X)

#### Exercise: Build a classification model for the iris dataset

---

### [Convolutional Neural Network (CNN)](https://keras.io/layers/convolutional/)

<img src="pics/dl_cnn.png" width=600 alt="Typical cnn.png"><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Aphex34&amp;action=edit&amp;redlink=1" class="new" title="User:Aphex34 (page does not exist)">Aphex34</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=45679374">Link</a>

> _Convolutional Neural Networks are very similar to ordinary Neural Networks: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply._  
 _So what changes? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network._ - [source](http://cs231n.github.io/convolutional-networks/)

> _Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions._  
 _Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension._ - [source](https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050)

A Convolutional Neural Network consists of different building blocks:
- Convolutional layers: feature extraction
- Pooling layers: feature selection
- Dense layers: classification

#### Convolution layer

A filtering layer with a learnable filter. It's purpose is to detect features in the input. In case of images these features could be edges, or even shapes. 

> _In mathematics convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other._ - [source](https://en.wikipedia.org/wiki/Convolution)

<img src="pics/dl_convolution.gif" alt="Convolution of box signal with itself2.gif"><br>By <a href="//commons.wikimedia.org/wiki/File:Convolution_of_box_signal_with_itself.gif" title="File:Convolution of box signal with itself.gif">Convolution_of_box_signal_with_itself.gif</a>: Brian Amberg
derivative work: <a href="//commons.wikimedia.org/wiki/User:Tinos" title="User:Tinos">Tinos</a> (<a href="//commons.wikimedia.org/wiki/User_talk:Tinos" title="User talk:Tinos"><span class="signature-talk">talk</span></a>) - <a href="//commons.wikimedia.org/wiki/File:Convolution_of_box_signal_with_itself.gif" title="File:Convolution of box signal with itself.gif">Convolution_of_box_signal_with_itself.gif</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=11003835">Link</a></p>

> _We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map._ 
_In the animation below, you can see the convolution operation. You can see the filter (the green square) is sliding over our input (the blue square) and the sum of the convolution goes into the feature map (the red square)._ - [source](https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050)

<div style="display: inline-block;">
<img src="pics/dl_sliding_window.gif" width=400 align='left'>
<img src="pics/dl_filter.png" width=400 align='left'>
</div>

<div style='align: clear'>
<br>
Animation by <a href="https://towardsdatascience.com/@ardendertat">Arden Dertat</a>, <a href="https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2">Link</a> 
Image by <a href="https://towardsdatascience.com/@ardendertat">Arden Dertat</a>, <a href="https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2">Link</a>
</div>

The three main parameter to watch out in convolutional layer is:

- __depth__: The number of filters we'd like to use
- __stride__: The size of the step the convolution filter moves each time
- __padding__: the size of the zero-padding around the input

#### Pooling layer

<img src="pics/dl_pooling.png" width=400><br>By <a href="https://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a>, <a href="http://cs231n.github.io/convolutional-networks/">Link</a>

Pooling is much more straightforward: It reduces the dimensionality of the input by downsampling it. It defines a window size and an aggregation function to create an approximate output of the input. Using a poolig layer prevents overfitting, reduces the number of weights in the consecutive layers, shortens training time, and also keeps the important informations. The most common aggregation function is __max__.

#### Fully-connected layer

Regular fully connected layer with proper loss function.

#### Further reading:

- https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050
- http://cs231n.github.io/convolutional-networks/
- https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2
- https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/
- https://github.com/ardendertat/Applied-Deep-Learning-with-Keras/blob/master/notebooks/Part%204%20%28GPU%29%20-%20Convolutional%20Neural%20Networks.ipynb

---

### In Practice

#### Build a CNN classifier for the hand digits dataset

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D

In [None]:
X, y = load_digits(return_X_y=True)

In [None]:
# number of cases, width, height, channels (rgb)
Xt = X.reshape((X.shape[0], 8, 8, 1))
yt = OneHotEncoder(categories='auto', sparse=False).fit_transform(y.reshape(-1, 1))

Xtrain, Xtest, ytrain, ytest = train_test_split(Xt, yt, random_state=42)

In [None]:
sns.heatmap(Xt[1, :, :, 0], cmap="gray")

In [None]:
model = Sequential([
    Conv2D(32, kernel_size=(3, 3), strides=(1, 1), activation='relu', input_shape=(8, 8, 1)),
    MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
    Flatten(),
    Dense(10, activation='softmax')
])

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(Xtrain, ytrain, # training data
          batch_size=16,  # number of data points to use in a training round
          epochs=100,     # number of full training cycle 
          validation_data=(Xtest, ytest),  # validation dataset
          callbacks=[EarlyStopping(patience=3)])  # function to execute at the end of each epoch

In [None]:
loss, acc = model.evaluate(Xtest, ytest)
print(f'test loss: {loss}, test acc: {acc}')

#### Exercise: Build a CNN for the MNIST classification problem

In case you stuck in the process, use [this](https://github.com/adventuresinML/adventures-in-ml-code/blob/master/keras_cnn.py) [tutorial]((https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/)).

In [None]:
from keras.datasets import mnist
from keras.utils import to_categorical

num_classes = 10

# input image dimensions
img_x, img_y = 28, 28

# load the MNIST data set, which already splits into train and test sets for us
(Xtrain, ytrain), (Xtest, ytest) = mnist.load_data()

# because the MNIST is greyscale, we only have a single channel
Xtrain = Xtrain.reshape()  # TODO: fill in the required shape 
Xtest = Xtest.reshape()    # TODO: fill in the required shape 
input_shape = ()           # TODO: fill in the required shape 

# keras built-in OneHotEncoder solution
ytrain = to_categorical(ytrain, num_classes)
ytest = to_categorical(ytest, num_classes)

In [None]:
# plot the first image in Xtrain with sns.heatmap


In [None]:
# define model here
model = Sequential([
    
])

In [None]:
# compile model here


In [None]:
model.summary()

In [None]:
# fit model


In [None]:
# evaluate model


---

### [Recurrent Neural Networks (RNN)](https://keras.io/layers/recurrent/)

<img src="pics/dl_rnn.svg" alt="Recurrent neural network unfold.svg" height="213" width="640"><br>By <a href="//commons.wikimedia.org/wiki/User:Ixnay" title="User:Ixnay">François Deloche</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=60109157">Link</a></p>

> _A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition._ - [source](https://en.wikipedia.org/wiki/Recurrent_neural_network)

So, basically the neuron has memory and remembers it's previous informations by using the results of the previous inputs. 
> _A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop: as you can see above, this chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists._ - [source](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) 

The biggest problem with this setup is easily understood, if we consider the following sentence: "I grew up in _France_... that's why I speak fluent _French_." The related information are far away from each other and it's easy for the network to miss these nuances and however in theory they can learn them, but in practice they often unable to do so.  
But a more complex version of them is able to overcome this difficulty, they're called LSTMs.

#### [Long-Short Term Memory (LSTM) Networks](https://keras.io/layers/recurrent/#lstm)

They are build to have long term memory, and have the same kind of chained structure, but the modules themselves are different.

<img src="pics/dl_lstm.png" width="600"><br>By <a href="https://colah.github.io/about.html">Christopher Olah</a>, <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Link</a>

> _Long short-term memory (LSTM) is an artificial recurrent neural network, (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a "general purpose computer" (that is, it can compute anything that a Turing machine can). It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Bloomberg Business Week wrote: "These powers make LSTM arguably the most commercial AI achievement, used for everything from predicting diseases to composing music."_  
_A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell._  
_LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series. LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications._ - [source](https://en.wikipedia.org/wiki/Long_short-term_memory)


It's main building blocks are:
- Cell state — Acts as a highway that transports relative information along the sequence chain.
- Forget gate — Decides which information should be kept and which should be discarded.
- Input gate — Updates the cell state.
- Output gate — Decides what the next hidden state(contains information on previous inputs) should be.

#### Further reading:

- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://medium.com/datadriveninvestor/a-high-level-introduction-to-lstms-34f81bfa262d
- https://skymind.ai/wiki/lstm
- https://www.dlology.com/blog/how-to-use-return_state-or-return_sequences-in-keras/

---

### In Practice

#### Build a sentiment predictor on movie reviews

Based on [this](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) tutorial.

In [None]:
from keras.models import Sequential

from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import LSTM

from keras.datasets import imdb

from keras.preprocessing import sequence

In [None]:
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [None]:
max_review_length = 500
# pad sequences will fill every doc in the corpus to a given length
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [None]:
embedding_vector_length = 32

model = Sequential([
    Embedding(input_dim=top_words,                 # number of words in the vocab
              output_dim=embedding_vector_length,  # size of the embedding vector
              input_length=max_review_length),     # size of the documents
    LSTM(units=100),
    Dense(1, activation='sigmoid')
])

In [None]:
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

In [None]:
score = model.evaluate(X_test, y_test, batch_size=16)
print('test loss: {}, test accuracy: {}'.format(*score))

#### Exercise: Predict simulated stock prices

Follow this [tutorial](https://stackabuse.com/time-series-analysis-with-lstm-using-pythons-keras-library/).

---

### [Word](https://keras.io/layers/embeddings/) [Embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

<div style="display: inline-block;">
<img src="pics/dl_king_queen_embedding.png" width=400 align='left'>
<img src="pics/dl_king_queen_composition.png" width=400 align='left'>
</div>

<div style='align: clear'/>
<br>Images from <a href="https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/">the morning paper</a>

> _Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension._  
_Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear._ - [source](https://en.wikipedia.org/wiki/Word_embedding)

The intuition to the model is that words with similar contexts have similar meaning.

#### Training

<div style="display: inline-block;">
<img src="pics/dl_w2v_training_data.png" width=300 align='left'>
<img src="pics/dl_w2v_skip_grams.png" width=300 align='left'>
<img src="pics/dl_w2v_weight_matrix.png" width=300 align='left'>
</div>

<div style='align: clear'/>
<br>Images from <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Word2Vec Tutorial - The Skip-Gram Model.</a>, by <a href="http://mccormickml.com/">Chris McCormick</a>

There are two approach to learn word-embeddings: 
- the continous bag-of-words (CBOW): the model predicts the selected word from the context words in the surrounding window (word order invariant)
- the skip-gram architecture:  the model predicts the context words from the selected word (context words are weighted by their distance to the selected word)


#### Further reading:

- https://www.tensorflow.org/tutorials/representation/word2vec#vector-representations-of-words
- https://www.quora.com/How-does-word2vec-work-Can-someone-walk-through-a-specific-example
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
- https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
- https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526
- https://hackernoon.com/word-embeddings-in-nlp-and-its-applications-fab15eaf7430
- https://blog.cambridgespark.com/tutorial-build-your-own-embedding-and-use-it-in-a-neural-network-e9cde4a81296
- https://skymind.ai/wiki/word2vec
- https://github.com/anvaka/word2vec-graph
- https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
- https://heartbeat.fritz.ai/using-a-keras-embedding-layer-to-handle-text-data-2c88dc019600
- [Google Word2Vec](https://code.google.com/archive/p/word2vec/)


---

### In Practice

#### Learning simple word-embeddings

In [None]:
import numpy as np

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

In [None]:
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

labels = np.array([1, 1, 1, 1, 1,
                   0, 0, 0, 0, 0])

In [None]:
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

In [None]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

In [None]:
# define the model
model = Sequential([
    Embedding(vocab_size, 8, input_length=max_length),
    Flatten(),
    Dense(1, activation='sigmoid')
])

In [None]:
# compile the model
model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

In [None]:
# summarize the model
print(model.summary())

In [None]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

In [None]:
# evaluate the model
score = model.evaluate(padded_docs, labels, verbose=0)
print('loss: {}, accuracy: {}'.format(*score))

Test model with an example:

In [None]:
text = "good effort"
enc_text = [one_hot(text, vocab_size)]
pad_text = pad_sequences(enc_text, maxlen=max_length, padding='post')
pred_text = model.predict_classes(pad_text)

text, enc_text, pad_text, pred_text

#### Exercise: News classification

Classify the 20newsgroups dataset while building an embedding. As a first step, try to separate the atheism documents (`alt.atheism`) from the christian documents (`soc.religion.christian`).

---

### Further tutorials:
- https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/
- https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
- https://www.datacamp.com/community/tutorials/deep-learning-python
- https://elitedatascience.com/keras-tutorial-deep-learning-in-python
- https://www.guru99.com/keras-tutorial.html
- https://github.com/adventuresinML/adventures-in-ml-code