# SCS 3546 Week 1 - Introduction to Course & Review

## Agenda
- Introduction to the course
 - Course outline
 - Meet other learners
- Introduction to Deep Learning
 - Applications of Deep Learning
 - Moral issues
 - Review of key Machine Learning concepts
 - Introduction to Information Theory
 - Getting started with Google Colab and TensorFlow

## Learning Objectives
- Develop familiarity with course and university logistics to prepare for success
- Discuss as a group some of the moral and ethical issues of AI
- Identify some existing applications of Deep Learning and enable you to discover others
- Review material from 3253 required for this course

## Introduction to the course
Welcome to SCS 3546 Deep Learning, the second course in the Certificate in AI.

### Certificates in Data Science
The University of Toronto School of Continuing Studies offers three Data Science-related certificates:
- Management of Enterprise Analytics: This certificate is for people who will be managing or working with Data Science teams and need an understanding of the key management issues such as privacy and security and a survey of the core data science techniques.
- Data Science: This certificate covers the key statistical methods of Data Science, the technologies for analyzing small and massive distributed datasets, and the inner workings of machine learning algorithms, all using hands-on tools and exercises in Python.  The certificate includes four courses: Foundations of Data Science, Statistics for Data Science, Big Data Tools and Machine Learning.
- Artificial Intelligence: This certificate shares the Machine Learning course from the Data Science certificate and adds two additional courses, Deep Learning (this course), which extends the introduction to Deep Learning and TensorFlow from the Machine Learning course into specific Deep Learning architectures and their applications; and Intelligent Agents & Reinforcement Learning which covers search and learning techniques for problems (such as game-playing) where there isn't a single best strategy but some moves are likely better than others and there is occasional feedback that the strategy seems to be working well or not.

### Course outline
The course outline is provided in a separate document.

### Introductions
One of the objectives of courses in our certificate programs is to provide an opportunity for you to widen your network of industry contacts.  Please introduce yourself to the class and tell us a little about yourself:
- Who you work for
- Your role
- Previous experience with Python, TensorFlow and Deep Learning
- Why you're taking the course and what you'd like to get out of it

### Health, safety and local services
- Fire exits
- Washrooms
- Coffee
- Parking
- TCards and Library Access: TCard offices are listed at http://tcard.utoronto.ca/contact-us/
- U of T Closures: https://onesearch.library.utoronto.ca/holiday-hours-and-closures https://www.utoronto.ca/campus-status
- If coming by TTC for a weekend class be sure to check TTC status in advance https://www.ttc.ca/Service_Advisories/index.jsp
- Break: About half way through for 10-15 mins.

### News of the Week
The instructor may ask for volunteers (one per week) to prepare a 5 minute presentation on current events in the world of neural nets at the beginning of each class.  Learners who are in class on time will catch the latest news.

### How to get the most out of the course
- Do the readings
- Be on time
- Work/share with/help other attendees
- Do the assignments
- If you get stuck
  1. Try googling the error message: you'll probably find something useful
  2. Post a question on the portal: the instructor and another learners will help
  3. If those don't work, email the instructor

## Introduction to Deep Learning

### Learning outcomes
Be able to:
- Describe the current applications of Deep Learning and propose new applications
- Recognize and discuss the moral issues of AI
- Recall key concepts from the Machine Learning course that you will need for this course
- Use Information Theory to reason about the quality of a model
- Launch and begin to use TensorFlow to run Deep Learning models

### Applications of Deep Learning
There is a wide variety of applications of Deep Learning and more appearing all the time:

#### 1 Image recognition
- Facial recognition
  - Friends on Facebook: Facebook uses image recognition to identify people in photos
  - Your photos from a competition: There are services now that will, for example, find all the posted or intentional photos taken by people along a marathon route you've run that have you in them
- Seeing AI: This is an experimental phone app that attempts to identify and describe objects in photos you take (intended as a near-real-time aid for the visually impaired) https://www.microsoft.com/en-us/seeing-ai



#### 2 Voice recognition and intelligent assistants

- Siri, Google Home, Amazon Alexa, etc.: These services use deep nets to recognize spoken words and extract meaning.


#### 3 Medical diagnostics and drug design
- X-ray and image analysis: Deep nets now perform at levels similar to radiologists in differentiating diseased from normal images.  This will allow radiologists to focus on the complex cases and avoid spending time on images that are normal with high confidence.
- Medical CAD/CAM: Machine and Deep Learning are finding applications in systems for reconstructing damaged bones and joints.
- ECG analysis: Deep Learning is beginning to be used to analyze cardiograms for signs of disease.
- Eye and skin disease identification: It's also being used to analyze images of retinas to track progress of diseases such as glaucoma and skin blemishes that may have potentially cancerous.

#### 4 Autonomous vehicles and factories
- University of Toronto / Uber Advanced Technologies Group: Uber's self-driving car team is located at the University of Toronto https://www.utoronto.ca/news/autonomous-vehicles-u-t-researchers-make-advances-new-algorithm
- Boeing: Boeing has a website specific to their autonomous systems for military and commercial purposes https://www.boeing.com/defense/autonomous-systems/index.page
- Autonom driverless taxi: http://navya.tech/autonom/cab
- Power generation: https://www.bloomberg.com/news/articles/2018-04-09/forget-cars-mitsubishi-hitachi-sees-autonomous-power-plants

#### 5 Language translation
- Google Translate: Deep nets were added to Google Translate in 2016.  It translates entire sentences at a time rather than individual words so the translation uses the context in which each word appears to produce a better result than if the words were translated individually.

#### 6 Playing board games
- Particularly in combination with Reinforcement Learning (Deep Reinforcement Learning)
- DeepMind AlphaGo beat the world's best Go player 4-1 in a five-game match in 2016: https://deepmind.com/research/alphago/.  Go is considered a much more difficult game than chess as the number of possible moves on each turn is far larger and the ramifications of each move more difficult to predict. 

#### 7 Artistic and entertainment applications
- Style transfer: These nets take a style from one image such as a painting and apply it to another image.
  - Lucid: https://github.com/tensorflow/lucid
  - Prisma: https://prisma-ai.com/
- Music and sound synthesis
  - Google Project Magenta: https://magenta.tensorflow.org/
- Repairing or adding detail to images:
  - https://www.nvidia.com/research/inpainting/
  - https://www.resetera.com/threads/ai-neural-networks-being-used-to-generate-hq-textures-for-older-games-you-can-do-it-yourself.88272/
  - https://arxiv.org/abs/1609.04802

### Moral issues of AI
The World Economic Forum identified the Top 9 Ethical Issues of AI (https://www.weforum.org/agenda/2016/10/top-10-ethical-issues-in-artificial-intelligence/)
1. Unemployment: What happens after the end of jobs?
2. Inequality: How do we distribute the wealth created by machines?
3. Humanity: How do machines affect our behaviour and interaction?
4. Artificial stupidity: How can we guard against mistakes?
5. Racist robots: How do we eliminate AI bias?
6. Security: How do we keep AI safe from adversaries?
7. Evil genies: How do we protect against unintended consequences?
8. Singularity: How do we stay in control of a complex intelligent system?
9. Robot rights: How do we define the humane treatment of AI?

Class Discussion:
- Should a self-driving car sacrifice its owner to save a pedestrian?
- Should an app tell you that you likely have a serious disease?
- Who is liable if an AI makes a bad call?

### Review of key concepts from previous courses

#### Parameters & Hyperparameters
- Supervised Learning involves finding the set of model parameter values that cause a model (such as a neural net) to best fit an observed input-to-output mapping
- Modern neural nets have thousands or millions of parameters referred to as weights and biases
- We call the process of determining the settings of the parameters that cause the model to produce results similar to a dataset of inputs and corresponding outputs "learning a model"
- Some parameters can't be learned (either in principle or because we don't know how to yet).  For example: how many layers a neural net should have.  We call these parameters, that must be set by the data scientist "hyperparameters".
- Hyperparameters sometimes become parameters as our body of knowledge improves and theory provides guidance for optimum (or at least good) settings.

#### Capacity & Overfitting
- The more parameters a model has, the more flexible it is to fit the observed input to output mapping.  For example, think of mapping a single variable to a single variable by using a line of best fit versus a polynomial.  As we add more parameters (terms and therefore degrees to the polynomial) it becomes more flexible and can approximate more and more complicated functions.
- If one model is more flexible than another we say it has higher "capacity".
- But higher capacity isn't necessary good.  We want our models to generalize.  If the actual relationship in the training data is linear (plus noise) and we use a higher-order polynomial, the curve will start to fit to the noise in the data and make increasingly noisy predictions.  We want the model to learn the overall general patterns in the data, not the details of any particular dataset and its unique quirks.
- If a model is too flexible (has too many parameters) it can "overfit".  Overfitting happens when the model "rote learns" rather than extracting generalities.  The learning process will cause the model to contort the prediction function however necessary to make it pass near the observed points; between those points the function can end up far from the observed values and hence make for terrible predictions.
- This is usually because the parameters (weights) have taken on large values.
- Techniques called "normalization" methods exist for constraining the weights from growing too large, helping a model from becoming overfit.  (Note that the word "normalization" has many different meanings in statistics so be careful to understand what a so-called normalization feature in a library actually does before using it).

#### Artificial Neural Nets
- A neural net is a layered network of decision-making or curve-fitting units called neurons.
- Although the idea of the first artificial neural nets was inspired by brain research in the late 1800's and early 1900's, the operation of brain cells is much more complex than artificial neurons so we must be careful not to oversell the similarity.
- An artificial neural net can approximate (closely) any continuous function if it has sufficient capacity (number of neurons).  This is called "universality".
- Each layer of a neural net is almost always fully connected to the layer before it and sometimes there are connections between non-adjacent layers.
- Layers other than the input and output layer are called "hidden" layers (but there isn't really any other significance to the term).
- "Deep" means that the network has more than one hidden layer.
- Training or Learning for a neural net means finding a weight and bias for each neuron that causes the net to best mimic (avoiding overfitting) the input-to-output mapping implied by the dataset used for the training.
- To begin with we typically initialize a neural net with small random weights to help the training process get started.
- An important word of caution: We need to be careful when using a trained network where the net might see inputs in production use that it wasn't exposed to during training; in this case the behaviour of the net could be completely unpredictable: in effect it's being asked a question it's never seen anything like before and the response could be completely inappropriate.  Safeguards against nonsense outputs would be prudent.

#### Loss functions
- In order for a net to learn a mapping, it needs to know how to adjust its weights and biases to fit the training set.  As it is learning, it needs some measure of how accurate its current predictions are and in which direction to adjust its parameters in order to improve its predictions.
- There are several well-known alternative measures of how far a model is from being optimum.  These measures are called loss or error functions.
- For data with a Gaussian distribution the theoretically best measure is the average squared error (i.e. the sum of the squares of the differences between each prediction and the known value, divided by the number of observations). This is what Ordinary Least Squares Regression uses but this measure is not usually the best for training neural nets.
- We will see several alternative loss functions and their relative merits in this course.

#### Stochastic Gradient Descent
- Some loss functions, such as quadratic loss (i.e. the average squared error above), can be precisely optimized in a known number of operations with a specific algorithm.  For example, most stats packages solve for a line of best fit using a standard linear algebra method that is guaranteed to produce a unique optimum in a predictable amount of time.
- For other loss functions there is no known method guaranteed to converge to a minimum.
- In these cases we use a technique known as gradient descent where the algorithm follows the gradient (slope) of the loss function downwards towards a minimum.  If you think of the case where there are two parameters to be learned, the loss function is a surface that we're trying to find the lowest point of.  If the loss is quadratic it's shaped like a bowl and there's a clear minimum.  But it it's more complex it may many local minima.  Algorithms can only test one point on the surface at a time; they can't see the entire surface at a glance and see where the global minimum is.
- Fortunately however for most applications a good minimum is "good enough" even if it isn't the ultimate minimum.
- Calculating the loss and gradient precisely at a point (the current settings of all the parameters) is expensive because it is a function of *all* the data in the training set and training a neural net takes *a lot* of data
- Learning can be accelerated significantly by using an approximation of the gradient calculated by using a sampled subset of the training data
- This is called *stochastic* gradient descent because of the random sampling of the data at each learning step
- Each period in which all of the data has been used is called an *epoch*: for example, if 1/10 of the data is sampled at each step, there are 10 steps in an epoch

### Introduction to Information Theory

#### What is Information Theory?
- Branch of applied mathematics concerned with quantifying the information content of a noisy signal
- Originated with Claude Shannon at Bell Labs in 1948
- Set the theoretical basis for reliable digital communications and data compression
- Introduced the key concept of "Entropy"

#### Entropy
- Key measure of Information Theory
- Quantifies the amount of uncertainty involved in a random variable
- Consider:
 - Life on an island where it rarely rains
 - If someone told you it was not going to rain tomorrow would that be useful/valuable/surprising to you?
 - What if they said they knew for sure it would (and are trustworthy)?
- Entropy measures the unexpectedness of an outcome
- Shannon Entropy H (in bits per symbol) is defined as $H = -\sum_{i} p_i \log_2(p_i)$
- More specifically, Entropy is the average amount of information you learn from an outcome
- In our island example:
 - Info from "going to rain": $I = -\log_2(1/32) = 5\ bits$ (see next cell)
 - Info from "not going to rain": $I = -\log_2(31/32) = 0.046\ bits$
- $H = (1/32)* 5 + (31/32) * 0.46 = 0.20\ bits$
- We get 0.2 bits of information on average over many weather reports
- H can also be expressed in units called "nats": use the natural $\log_e$ rather than $\log_2$
- The ideas of information theory easily extend to continuous variables which are more common in machine learning
- Watch this video: https://www.youtube.com/watch?v=ErfnhcEV1O8&t=7s

In [None]:
# Computing log base 2 and Entropy (H)
import math

print('The -log base 2 of 1/32 is:', -math.log(1.0/32.0, 2))
print('The -log base 2 of 31/32 is:', -math.log(31.0/32.0, 2))
print('H is:', 1/32 * 5 + 31/32 * 0.046)

In [None]:
# Exercise: Compute H if the chance or rain on any day is 50%

#### KL Divergence
- *Kullback-Leibler divergence* (also known as relative entropy) is a measure of how different one distribution is from another
- If we compared a distribution to itself it would have a KL divergence of 0
- You can think of it as being like a distance between two distributions: it's always positive and the more different the distributions, the higher the number
- But it isn't exactly a distance: the KL divergence between distributions *p* and *q* is different than the KL divergence between *q* and *p*
- KL divergence can also be thought of as the number of additional bits required to encode samples from *p* using a code optimized for *q*
- It can also be thought of in Bayesian terms as the information gained when one revises one's beliefs from the prior probability distribution *q* to the posterior *p*

#### KL According to @SimonDeDeo May 8, 2018 Twitter
Kullback-Leibler divergence has an enormous number of interpretations and uses:
- Psychological: an excellent predictor of where attention is directed
- Epistemic: a normative measure of where you ought to direct your experimental efforts (maximize expected model-breaking) http://www.jstor.org/stable/4623265 
- Thermodynamic: a measure of work you can extract from an out-of-equlibrium system as it relaxes to equilibrium
- Statistical: too many to count, but for example, a measure of the failure of an approximation method https://t.co/h4L0O2VZXa
- Computational (machine learning): a measure of model inefficiency—the extent to which it retains useless information
- Computational (compression): the extent to which a compression algorithm designed for one system fails when applied to another
- And more (geometrical, biological, etc.; do a Twitter search to check out the whole thread)



#### Cross-Entropy
- The cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "artificial" probability distribution q, rather than the "true" distribution p. https://en.wikipedia.org/wiki/Cross_entropy
- Average message length if the message was binary-encoded
- Is equal to the Entropy of p + the KL divergence between p and q:

    $H(p,q) =  H_p + D_{KL} (p||q)$
- If you knew how to encode the information perfectly, the cross-entropy would just be equal to the entropy; if your model is off, it will be off by an amount equal to the KL divergence, which we need to add

#### Maximum Likelihood Estimation
- Say we have a probabilistic model that has some parameters that determine its shape
- A simple example would be the Gaussian distribution which has two parameters
- Neural nets can have thousands or millions of parameters (usually called weights)
- Given a dataset of observed inputs and a resulting output and a parametric model we would like to find the parameters that would make the model make predictions for each input as similar as possible to the actual observed output (leaving aside overfitting for a moment)
- Maximum Likelihood Estimation is a mathematical method for finding the values of the parameters that result in this "best fit"
- In effect MLE minimizes the dissimilarity between the observed distribution of the training set and the model distibution (using KL Divergence or cross-entropy as the measure of dissimilarity)

### TensorFlow

####  Installing TensorFlow
- Installing TensorFlow is a bit tricky and the instructions vary with each new release.  See https://www.tensorflow.org/install/ for the latest installation instructions.
- Note: Unfortunately TensorFlow no longer supports using the GPU on Mac's (but you can do all the assignments in this course without GPU support, training the net will just take longer)

#### Setting up a cloud account
- We will be using Google Colaboratory: https://colab.research.google.com
- Other Alternatives
  - Google Cloud Platform: https://cloud.google.com/solutions/running-distributed-tensorflow-on-compute-engine
  - Microsoft Azure: https://blogs.msdn.microsoft.com/uk_faculty_connection/2017/03/27/azure-gpu-tensorflow-step-by-step-setup/
  - AWS: https://aws.amazon.com/machine-learning/amis/
  - Crestle, PaperSpace (see Resources section)

#### TensorFlow Primitives

##### Tensors

- Tensors for our purposes are simply multi-dimensional arrays (like NumPy ndarrays)
- The number of dimensions is referred to as the rank
 - Scalars are rank-0
 - Vectors are rank-1
 - Ordinary matrices are rank-2
- An important question we will face is how to best encode the relevant features of a problem into a tensor

In [None]:
# Hands on
# But first: pip install tensorflow  --or-- conda install -c conda-forge tensorflow 

import tensorflow as tf
tf.InteractiveSession() # Make TF execute eagerly


In [None]:
tf.constant(6.3) # Create a rank-0 tensor (constant scalar) with single-precision floating point value 6.3

In [None]:
tf.zeros(3) # Create a rank-1 tensor (vector) of length 3 initialized to all zeroes

In [None]:
tf.zeros((4, 4)).eval() # Note we get an ordinary NumPy array back when we evaluate it

Question: Why was it tf.zeroes((4,4)) in the last example and not tf.zeroes(4,4)?

In [None]:
tf.ones((2, 3, 3)).eval()

In [None]:
tf.fill((2, 2), value='hello').eval() # Fill every element with a value
                                      # (usually a real number but can be other NumPy types e.g. string)

In [None]:
tf.eye(5).eval() # Identity matrix

In [None]:
# Exercise: Create a vector array([1, 2, 3]) using Numpy arange and tf.constant()
import numpy as np
# Insert your work here

In [None]:
# Now use np.arange to create a tensor like this:
# array([1, 2, 3],
#       [4, 5, 6],
#       [7, 8, 9])
# Hint: See https://www.tensorflow.org/api_docs/python/tf/constant
# Insert your work here

In [None]:
# Exercise: Use tf.transpose (see https://www.tensorflow.org/api_docs/python/tf/transpose) to transpose your tensor

#### Computational graphs

- Computations in TensorFlow are represented as an instance of a tf.graph object
- The graph consists of tf.Tensor objects that hold data and tf.Operation objects that describe mathematical operations on the data

#### Keras

- Keras is a high-level API that makes TensorFlow easier to use: https://www.tensorflow.org/guide/keras
- Keras can be used as a common language for working with other Deep Learning libraries such as Theano (but there may be small variations depending which engine you're using)


In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
# Keras makes it easy to assemble nets layer-by-layer
# Here is an example from the TensorFlow introduction to Keras:
model = keras.Sequential()
# Adds a densely-connected layer with 64 units to the model:
model.add(keras.layers.Dense(64, activation='relu'))
# Add another:
model.add(keras.layers.Dense(64, activation='relu'))
# Add a softmax layer with 10 output units:
model.add(keras.layers.Dense(10, activation='softmax'))

In [None]:
# Configure the model to run with an optimizer
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# Then add some data and run the learning process:
import numpy as np

data = np.random.random((1000, 32))
labels = np.random.random((1000, 10))

model.fit(data, labels, epochs=10, batch_size=32)

## Additional resources
- Style transfer
 - https://medium.com/data-science-group-iitr/artistic-style-transfer-with-convolutional-neural-network-7ce2476039fd
- Universality
 - http://neuralnetworksanddeeplearning.com/chap4.html
- Information Theory
 - https://en.wikipedia.org/wiki/Information_theory
 - http://www.inference.org.uk/itprnn/book.html
- Jupyter notebook
 - https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
- Seeing AI
 - https://www.microsoft.com/en-us/seeing-ai
- Cloud GPU services
  - https://crestle.com
  - https://paperspace.com
- Keras
  - https://www.tensorflow.org/guide/keras
- Alpha Go
  - https://www.alphagomovie.com/
  - https://deepmind.com/research/alphago/

## Next week
Model Tuning