Skip to content

Commit

Permalink
minor cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
bfortuner committed Apr 23, 2017
1 parent 9542b08 commit 4c25922
Show file tree
Hide file tree
Showing 7 changed files with 46 additions and 65 deletions.
12 changes: 0 additions & 12 deletions docs/activation_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,6 @@ Activation Functions

.. contents:: :local:

Introduction
============

Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. Popular activation functions include :ref:`relu <activation_relu>` and :ref:`sigmoid <activation_sigmoid>`.

Activation functions typically have the following properties:

* **Non-linear** - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. :math:`x^2`, sin, log). To model these relationships we need a non-linear prediction equation.¹ Activation functions provide this non-linearity.

* **Continuously differentiable** — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
* **Fixed Range** — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.

ELU
===
Expand Down
12 changes: 6 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ Brief visual explanations of machine learning concepts with diagrams, code examp

calculus
linear_algebra
probability (todo) <probability>
statistics (todo) <statistics>
Probability (empty) <probability>
Statistics (empty) <statistics>
math_notation


Expand All @@ -40,11 +40,11 @@ Brief visual explanations of machine learning concepts with diagrams, code examp

.. toctree::
:maxdepth: 1
:caption: Deep learning (TODO)
:caption: Deep learning

cnn
rnn
gan
CNNs (empty) <cnn>
RNNs (empty) <rnn>
GANs (empty) <gan>


.. toctree::
Expand Down
2 changes: 1 addition & 1 deletion docs/linear_regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Let’s say we are given a `dataset <http://www-bcf.usc.edu/~gareth/ISL/Advertis
Making predictions
------------------

Our prediction function outputs an estimate of sales given a company's radio advertising spend and our current values for ''Weight'' and ''Bias''.
Our prediction function outputs an estimate of sales given a company's radio advertising spend and our current values for *Weight* and *Bias*.

.. math::
Expand Down
6 changes: 0 additions & 6 deletions docs/logistic_regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,6 @@ Types of logistic regression
- Multi (Cats, Dogs, Sheep)
- Ordinal (Low, Medium, High)

Pros/cons
---------

- **Pros:** Easy to implement, fast to train, returns probability scores
- **Cons:** Bad when too many features or too many classifications



Binary logistic regression
Expand Down
58 changes: 20 additions & 38 deletions docs/loss_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,19 @@ Loss Functions

.. contents:: :local:

Introduction
============

A loss function, or cost function, is a wrapper around our model's predict function that tells us "how good" the model is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there are many different cost functions available. Popular loss functions include: :ref:`mse` and :ref:`Cross-entropy Loss <loss_cross_entropy>`.


.. _loss_cross_entropy:

Cross-Entropy Loss
==================

Cross-entropy loss, or Log Loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
Cross-Entropy
=============

The graph below shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

.. image:: images/cross_entropy.png
:align: center

The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!

.. note::

Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
Expand All @@ -38,67 +33,56 @@ The graph below shows the range of possible loss values given a true observation
else:
return -log(1 - prediction)

.. rubric:: Binary classification (M=2)
.. rubric:: Math

In binary classification, where the number of classes :math:`M` equals 2, cross-entropy can be calculated as:

.. math::
-{(y\log(p) + (1 - y)\log(1 - p))}
.. note::

- N - number of observations
- M - number of possible class labels (dog, cat, fish)
- log - the natural logarithm
- y - a binary indicator (0 or 1) of whether class label :math:`c` is the correct classification for observation :math:`o`
- p - the model's predicted probability that observation :math:`o` is of class :math:`c`


.. rubric:: Multi-class cross-entropy

In multi-class classification (M>2), we take the sum of loss values for each class prediction in the observation.
If :math:`M > 2` (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

.. math::
-\sum_{c=1}^My_{o,c}\log(p_{o,c})
.. note::

Why the Negative Sign?

Cross-entropy takes the negative log to provide an easy metric for comparison. It takes this approach because the positive log of numbers < 1 returns negative values, which is confusing to work with when comparing the performance of two models.

.. image:: images/log_vs_neglog.gif
:align: center
- M - number of classes (dog, cat, fish)
- log - the natural log
- y - binary indicator (0 or 1) if class label :math:`c` is the correct classification for observation :math:`o`
- p - predicted probability observation :math:`o` is of class :math:`c`


.. _hinge_loss:

Hinge Loss
==========
Hinge
=====

Be the first to contribute!


.. _kl_divergence:

Kullback-Leibler divergence
===========================
Kullback-Leibler
================

Be the first to contribute!


.. _l1_loss:

L1 Loss
L1
=======

Be the first to contribute!


.. _l2_loss:

L2 Loss
=======
L2
==

Be the first to contribute!

Expand All @@ -122,8 +106,6 @@ Description of MSE...
:language: python
:pyobject: MSE

**Derivative**

.. literalinclude:: ../code/loss_functions.py
:language: python
:pyobject: MSE_prime
Expand Down
19 changes: 17 additions & 2 deletions docs/nn_concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,28 @@ Notice, it’s exactly the same equation we use with linear regression! In fact,
Activation Functions
====================

:doc:`activation_functions` live inside neurons and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features.
Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. Popular activation functions include :ref:`relu <activation_relu>` and :ref:`sigmoid <activation_sigmoid>`.

Activation functions typically have the following properties:

* **Non-linear** - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. :math:`x^2`, sin, log). To model these relationships we need a non-linear prediction equation.¹ Activation functions provide this non-linearity.

* **Continuously differentiable** — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
* **Fixed Range** — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.

Loss Functions
==============

:doc:`loss_functions` measure "how good" a model is at making predictions for a given set of parameters.
A loss function, or cost function, is a wrapper around our model's predict function that tells us "how good" the model is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there are many different cost functions available. Popular loss functions include: :ref:`mse` and :ref:`Cross-entropy Loss <loss_cross_entropy>`.


Optimization Algorithms
=======================

Be the first to contribute!



.. rubric:: References
Expand Down
2 changes: 2 additions & 0 deletions docs/optimizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
Optimizers
==========

.. contents:: :local:


Adadelta
========
Expand Down

0 comments on commit 4c25922

Please sign in to comment.