# Dive Deeper into Classification Loss Functions

- For two-class classification problem the last layer activation function is sigmoid and loss function would be binary cross entropy

- For multi-class classification problem the last layer activation function is softmax and loss function would be cross entropy

- Lets learn more about binary cross entropy and cross entropy function

## Lets review sigmoid function and softmax function

- Read the following blog post for 7 minutes

https://www.depends-on-the-definition.com/guide-to-multi-label-classification-with-neural-networks/

## Binary Cross Entropy in Numpy

- For each data sample, when we have two-class classification problem then `y_true` and `y_pred` would be scalar

- For some data samples, these y_pred and y_true can also be arrays

In [2]:
import numpy as np

def binary_cross_entropy_loss(y_pred, y):
    return (-y * np.log(y_pred) - (1 - y) * np.log(1 - y_pred)).mean()

In [3]:
y_true = 1
y_pred = 0.9
# y_pred = 0.2

print(binary_cross_entropy_loss(y_pred, y_true))

0.10536051565782628


## Binary Cross Entropy in Tensorflow

In [4]:
import tensorflow as tf

y_t = 1
y_p = 0.9
# y_p = 0.2

cost = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.convert_to_tensor(y_t, dtype=tf.float32),

                                               logits=tf.convert_to_tensor(np.log(y_p/(1-y_p)), dtype=tf.float32))
binary_cross_entropy = tf.reduce_mean(cost)
sess = tf.Session()

print(sess.run(binary_cross_entropy))

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


0.10536051


Loss is a scalar - it should increase in proportion to how "off" the prediction is, no matter what direction the error is in!

## Binary Cross Entropy in Keras

In [5]:
from keras import backend as K

def f_k_binary_cross_entropy(y_tr, y_pr):
    return K.mean(K.binary_crossentropy(y_tr, y_pr), axis=-1)

# same input as before, it would just be an array of 1 element
y_t = [1] 
y_p = [0.9]
# Same output!
print(K.eval(f_k_binary_cross_entropy(K.constant(y_t), K.constant(y_p))))






0.10536054


Using TensorFlow backend.


## Review of Entropy

**What is Entropy?**

In decision trees, it helped us determine information gain.

**Entropy itself represents the uncertainty around a random variable.** 
For example, the entropy of flilling a fair coin is about 0.5
However, the entropy of an unfair coin would be lower.

Entropy can be both conditional and unconditional.


What is Cross Entropy?

In [8]:
import numpy as np

def entropy(p):
    H = np.array([-p[i]*np.log2(p[i]) for i in range(len(p))]).sum()
    return H
    
p = [.5, .5]
print(entropy(p))

p = [.9, .1]
print(entropy(p))
print(entropy([0, 1]))  
# note: there's no entropy in the last one! the proba of the outcomes clearly points ot a certain outcome each time

1.0
0.4689955935892812
nan


  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


## What if:

In [28]:
p= [1, 0]
print(entropy(p))

nan


  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


## [BETTER] So we should modify our entropy implementation

`np.clip` will help us handle when 0 is included in the input

In [32]:
import numpy as np

eps = 1e-6

def entropy(p):
    H = np.array([-p[i]*np.log2(np.clip(p[i], eps, 1-eps)) for i in range(len(p))]).sum()
    return H

In [33]:
p= [1, 0]
print(entropy(p))

1.4426957622784505e-06


## Cross Entropy

- For each data sample, when we have multi-class classification problem then `y_true` and `y_pred` would be a vector

- Generally speaking, CE measures the "similarity between 2 probability distributions"

**Is Cross Entropy a Symmetric Function?**

No, not necessarily. For our implementation below for instance, p MUST equal y_t and q must equal y_p

**Note**: this is **different from MSE**

## Cross Entropy in Numpy

In [12]:
import numpy as np

eps = 1e-6

def cross_entropy(p, q):
    """y_t is the p, y_p is the q"""
	#return -sum([p[i]*np.log(q[i]) for i in range(len(p))])
#     print([np.clip(q[i], eps, 1-eps) for i in range(len(p))])
    return -sum([p[i]*np.log(np.clip(q[i], eps, 1-eps)) for i in range(len(p))])

y_t = np.array([1, 0, 0, 0, 0])
y_p = np.array([0.4, 0.3, 0.05, 0.05, 0.2])
# y_p = np.array([0.98, 0.01, 0, 0, 0.01])
print(cross_entropy(y_t, y_p))
print(cross_entropy(y_p, y_t))

0.916290731874155
8.289306734778764


## Cross Entropy in Scipy

In [37]:
from scipy.stats import entropy

def cross_entropy_via_scipy(x, y):
        ''' SEE: https://en.wikipedia.org/wiki/Cross_entropy'''
        return  entropy(x, y)
    
print(cross_entropy_via_scipy(y_t, y_p))

0.9162907318741551


## Cross Entropy in Keras

In [44]:
from keras import backend as K

def keras_categorical_crossentropy(y_true, y_pred):
    return K.categorical_crossentropy(y_true, y_pred)

In [41]:
import numpy as np
import tensorflow as tf

#https://medium.com/activating-robotic-minds/demystifying-cross-entropy-e80e3ad54a8
# The data are from the above link
y_t = np.array([1, 0, 0, 0, 0])
y_p = np.array([0.4, 0.3, 0.05, 0.05, 0.2])
# y_p = np.array([0.98, 0.01, 0, 0, 0.01])

print(K.eval(keras_categorical_crossentropy(K.constant(y_t), K.constant(y_p))))

0.9162909


## Cross Entropy in Tensorflow

Low level implementation

In [10]:
# Reference: https://github.com/tensorflow/tensorflow/issues/2462
y_t = np.array([1, 0, 0, 0, 0])
y_p = np.array([0.4, 0.3, 0.05, 0.05, 0.2])
# y_p = np.array([0.98, 0.01, 0, 0, 0.01])

y_pred_tf = tf.convert_to_tensor(y_p, np.float32)
y_true_tf = tf.convert_to_tensor(y_t, np.float32)
eps = 1e-6
clipped_y_pred_tf = tf.clip_by_value(y_pred_tf, eps, 1-eps)
loss_tf = tf.reduce_mean(-tf.reduce_sum(y_true_tf * tf.log(clipped_y_pred_tf)))
with tf.Session() as sess:
    loss = sess.run(loss_tf)
    print(loss)

0.9162907


## Another implementation of Cross Entropy in Tensorflow

higher level implementation 

In [51]:
import numpy as np
import tensorflow as tf

y_t = np.array([1, 0, 0, 0, 0])
y_p = np.array([0.4, 0.3, 0.05, 0.05, 0.2])
# y_p = np.array([0.98, 0.01, 0, 0, 0.01])

cost = tf.losses.softmax_cross_entropy(onehot_labels=tf.convert_to_tensor(y_t, dtype=tf.float32), logits=tf.convert_to_tensor(np.log(y_p), dtype=tf.float32))
sess = tf.Session()
print(sess.run(cost))

0.91629076


# Summary 

We lots of packages to choose from! Includes:

- Numpy
- Scipy
- Tensorflow
- Keras

### Logits()
Logits function can also be implemented. Requires log() input
And there's also a version that is the inverse of softmax, not just sigmoid!



### Tensorflow vs. Keras

Tensorflow is good for when we need to go low-level, for instance when we want to define a custom loss function (to apply in identifying image simialrity problem, for example)

Typically if a problem is simple enough to use Keras - then just use Keras


### On using different levels:

Refer to Milad's Blog post to see another example of implementing a model at three different levels of abstraction in TF!

## Question: How we can use Deep Learning (CNN for example) for sound classification?

- What is Spectogram: https://towardsdatascience.com/understanding-audio-data-fourier-transform-fft-spectrogram-and-speech-recognition-a4072d228520 
    - this also concerns the Fourier Transform
    - this helps us "visualize" sound as images
    - relate time to energy, frequencies
    - THEN - you can feed the images into a CNN!
    - For example - we could build a sound classifier to relate a baby crying into different ailments (baby is hungry, needs a diaper change, etc.)
    - HOWEVER - this is not the ONLY approach - look up how chat assisitants like Siri work as an example
    - Milad's Observation: Spectorgram good when we have around 20 classes (so this wouldn't work in NLP for example, when you have lots more labels for the English language)
- https://medium.com/x8-the-ai-community/audio-classification-using-cnn-coding-example-f9cbd272269e

- https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8605515

## Resource

- https://machinelearningmastery.com/cross-entropy-for-machine-learning/

H = CE for a list of probabilities, P

$H(P) = –\sum_{x \in X} p(x) log(p(x))$

$H(P, Q) = H(P) + KL(P || Q)$

$H(P, Q) != H(Q, P)$

In [None]:
from math import log2
 
# calculate the kl divergence KL(P || Q)
def kl_divergence(p, q):
	return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))
 
# calculate entropy H(P)
def entropy(p):
	return -sum([p[i] * log2(p[i]) for i in range(len(p))])
 
# calculate cross entropy H(P, Q)
def cross_entropy(p, q):
	return entropy(p) + kl_divergence(p, q)
 
# define data
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]
# calculate H(P)
en_p = entropy(p)
print('H(P): %.3f bits' % en_p)
# calculate kl divergence KL(P || Q)
kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f bits' % kl_pq)
# calculate cross entropy H(P, Q)
ce_pq = cross_entropy(p, q)
print('H(P, Q): %.3f bits' % ce_pq)

# Term Review

This was your last lecture of DS 2.2! Congrats!