# Intro to Data Science
## Part VIII. - Deep Learning and it's applications

### Table of contents

- #### Deep learning basics
    - <a href="#What-is-Deep-Learning?">Theory</a>
    - <a href="#1.-Layers">Layer Architecture types</a>
    - <a href="#2.-Activision-and-Loss-Functions">Activision and Loss Functions</a>
    
- #### In practice
    - <a href="#Classification-Regression">Classification and Regression</a>
    - <a href="#Image-Processing">Image Processing</a>
    - <a href="#Word-Embedding">Word Embedding</a>
    
---

# I. Deep learning basics

## What is Deep Learning?

> _Deep learning refers to neural networks with multiple hidden layers that can learn increasingly abstract representations of the input data._ [source](https://elitedatascience.com/keras-tutorial-deep-learning-in-python)

> _Deep learning is a class of neural network algorithms that:_
> - _use a cascade of __multiple layers__ of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input._
> - _learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners._
> - _learn __multiple levels of representations__ that correspond to __different levels of abstraction__; the levels form a hierarchy of concepts._ 
[source](https://en.wikipedia.org/wiki/Deep_learning#Definition)

## Why is it important?

Deep Learning is widely used in our daily lives. It powers web search engines, recommender systems, image recognition systems, self driving cars. It helps generating sound, image, text, better ai agents.  
It is the current state of the art machine learning model for many tasks including image recognition, text mining, and classification.

## Tools
- Scikit-Learn
- Gensim
- Tensorflow
- Torch
- Keras

## 1. Layers

### [Dense feedforward network](https://keras.io/layers/core/#dense)

A dense layer is just a regular layer of neurons in a neural network. Each neuron recieves input from all the neurons in the previous layer, thus densely connected. The layer has a weight matrix W, a bias vector b, and the activations of previous layer a. The following is te docstring of class Dense from the keras documentation:

output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer.

### [Regularization](https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0)

There are many different kind of regularization techniques available in most of the deep learning frameworks.
#### Early stopping
Stopping when loss criteria is no longer decreasing.
#### Dropout
Dropout is a a technique used to tackle Overfitting . The Dropout method in keras.layers module takes in a float between 0 and 1, which is the fraction of the neurons to drop. Below is the docstring of the Dropout method from the documentation:

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.
#### Weight penalty
Simple L1 or L2 regularization term.

### [Convolutional network](https://keras.io/layers/convolutional/)

<img src="https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png" width=600 alt="Typical cnn.png"><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Aphex34&amp;action=edit&amp;redlink=1" class="new" title="User:Aphex34 (page does not exist)">Aphex34</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=45679374">Link</a>

> _CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually refer to fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "fully-connectedness" of these networks make them prone to overfitting data. Typical ways of regularization includes adding some form of magnitude measurement of weights to the loss function. However, CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity, CNNs are on the lower extreme._ - [source](https://en.wikipedia.org/wiki/Convolutional_neural_network)

### [Recurrent networks](https://keras.io/layers/recurrent/)

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg" alt="Recurrent neural network unfold.svg" height="213" width="640"><br>By <a href="//commons.wikimedia.org/wiki/User:Ixnay" title="User:Ixnay">François Deloche</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=60109157">Link</a></p>

> _A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition._ - [source](https://en.wikipedia.org/wiki/Recurrent_neural_network)

#### [LSTM](https://keras.io/layers/recurrent/#lstm)

<img src="https://upload.wikimedia.org/wikipedia/commons/3/3b/The_LSTM_cell.png" alt="The LSTM cell.png" height="420" width="430"><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:GChe&amp;action=edit&amp;redlink=1" class="new" title="User:GChe (page does not exist)">Guillaume Chevalier</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by/4.0" title="Creative Commons Attribution 4.0">CC BY 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=71836793">Link</a>

> _Long short-term memory (LSTM) is an artificial recurrent neural network, (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a "general purpose computer" (that is, it can compute anything that a Turing machine can). It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Bloomberg Business Week wrote: "These powers make LSTM arguably the most commercial AI achievement, used for everything from predicting diseases to composing music."_

> _A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell._

> _LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series. LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications._ - [source](https://en.wikipedia.org/wiki/Long_short-term_memory)

### [Word](https://keras.io/layers/embeddings/) [Embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

<img src="http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" width=700><br>By <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">McCormick, C. (2016, April 19) Word2Vec Tutorial - The Skip-Gram Model.</a>  

> _Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension._

> _Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear._ - [source](https://en.wikipedia.org/wiki/Word_embedding)

- [Google Word2Vec](https://code.google.com/archive/p/word2vec/)
- [Word2Vec explained](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

## 2. Activision and Loss functions

# II. In practice

tutorials:
- https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/
- https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
- https://www.datacamp.com/community/tutorials/deep-learning-python
- https://elitedatascience.com/keras-tutorial-deep-learning-in-python
- https://www.guru99.com/keras-tutorial.html (regression)

## Building a simple network for classification

### Loading data

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
X, y = load_digits(return_X_y=True)
yt = OneHotEncoder(categories='auto').fit_transform(y.reshape(-1, 1))

Xtrain, Xtest, ytrain, ytest = train_test_split(X, yt, random_state=42)

### Model construction

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.callbacks import EarlyStopping

In [None]:
model = Sequential()
model.add(Dense(8, activation='relu', input_dim=64))
model.add(Dense(10, activation='softmax'))

### Assembly

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(Xtrain, ytrain, batch_size=16, epochs=100, validation_data=(Xtest, ytest), callbacks=[EarlyStopping(patience=3)])

In [None]:
model.evaluate(X, yt)

In [None]:
model.predict_classes(X)

Exercise: Build a classification model for the iris dataset

## Regression

## Image processing

## Embedding