minor cleanup

bfortuner · Apr 23, 2017 · 4c25922 · 4c25922
1 parent 9542b08
commit 4c25922
Show file tree

Hide file tree

Showing 7 changed files with 46 additions and 65 deletions.
diff --git a/docs/activation_functions.rst b/docs/activation_functions.rst
@@ -6,18 +6,6 @@ Activation Functions
 
 .. contents:: :local:
 
-Introduction
-============
-
-Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. Popular activation functions include :ref:`relu <activation_relu>` and :ref:`sigmoid <activation_sigmoid>`.
-
-Activation functions typically have the following properties:
-
-  * **Non-linear** - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. :math:`x^2`, sin, log). To model these relationships we need a non-linear prediction equation.¹ Activation functions provide this non-linearity.
-
-  * **Continuously differentiable** — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
-
-  * **Fixed Range** — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.
 
 ELU
 ===

diff --git a/docs/index.rst b/docs/index.rst
@@ -21,8 +21,8 @@ Brief visual explanations of machine learning concepts with diagrams, code examp
 
     calculus
     linear_algebra
-    probability (todo) <probability>
-    statistics (todo) <statistics>
+    Probability (empty) <probability>
+    Statistics (empty) <statistics>
     math_notation
 
 
@@ -40,11 +40,11 @@ Brief visual explanations of machine learning concepts with diagrams, code examp
 
 .. toctree::
     :maxdepth: 1
-    :caption: Deep learning (TODO)
+    :caption: Deep learning
 
-    cnn
-    rnn
-    gan
+    CNNs (empty) <cnn>
+    RNNs (empty) <rnn>
+    GANs (empty) <gan>
 
 
 .. toctree::

diff --git a/docs/linear_regression.rst b/docs/linear_regression.rst
@@ -58,7 +58,7 @@ Let’s say we are given a `dataset <http://www-bcf.usc.edu/~gareth/ISL/Advertis
 Making predictions
 ------------------
 
-Our prediction function outputs an estimate of sales given a company's radio advertising spend and our current values for ''Weight'' and ''Bias''.
+Our prediction function outputs an estimate of sales given a company's radio advertising spend and our current values for *Weight* and *Bias*.
 
 .. math::
 

diff --git a/docs/logistic_regression.rst b/docs/logistic_regression.rst
@@ -27,12 +27,6 @@ Types of logistic regression
   - Multi (Cats, Dogs, Sheep)
   - Ordinal (Low, Medium, High)
 
-Pros/cons
----------
-
-  - **Pros:** Easy to implement, fast to train, returns probability scores
-  - **Cons:** Bad when too many features or too many classifications
-
 
 
 Binary logistic regression

diff --git a/docs/loss_functions.rst b/docs/loss_functions.rst
@@ -6,24 +6,19 @@ Loss Functions
 
 .. contents:: :local:
 
-Introduction
-============
-
-A loss function, or cost function, is a wrapper around our model's predict function that tells us "how good" the model is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there are many different cost functions available. Popular loss functions include: :ref:`mse` and :ref:`Cross-entropy Loss <loss_cross_entropy>`.
-
 
 .. _loss_cross_entropy:
 
-Cross-Entropy Loss
-==================
-
-Cross-entropy loss, or Log Loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
+Cross-Entropy
+=============
 
-The graph below shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!
+Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
 
 .. image:: images/cross_entropy.png
     :align: center
 
+The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!
+
 .. note::
 
   Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
@@ -38,67 +33,56 @@ The graph below shows the range of possible loss values given a true observation
       else:
           return -log(1 - prediction)
 
-.. rubric:: Binary classification (M=2)
+.. rubric:: Math
+
+In binary classification, where the number of classes :math:`M` equals 2, cross-entropy can be calculated as:
 
 .. math::
 
   -{(y\log(p) + (1 - y)\log(1 - p))}
 
-.. note::
-
-  - N - number of observations
-  - M - number of possible class labels (dog, cat, fish)
-  - log - the natural logarithm
-  - y - a binary indicator (0 or 1) of whether class label :math:`c` is the correct classification for observation :math:`o`
-  - p - the model's predicted probability that observation :math:`o` is of class :math:`c`
-
-
-.. rubric:: Multi-class cross-entropy
-
-In multi-class classification (M>2), we take the sum of loss values for each class prediction in the observation.
+If :math:`M > 2` (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
 
 .. math::
 
   -\sum_{c=1}^My_{o,c}\log(p_{o,c})
 
 .. note::
 
-  Why the Negative Sign?
-
-  Cross-entropy takes the negative log to provide an easy metric for comparison. It takes this approach because the positive log of numbers < 1 returns negative values, which is confusing to work with when comparing the performance of two models.
-
-  .. image:: images/log_vs_neglog.gif
-        :align: center
+  - M - number of classes (dog, cat, fish)
+  - log - the natural log
+  - y - binary indicator (0 or 1) if class label :math:`c` is the correct classification for observation :math:`o`
+  - p - predicted probability observation :math:`o` is of class :math:`c`
 
 
 .. _hinge_loss:
 
-Hinge Loss
-==========
+Hinge
+=====
 
 Be the first to contribute!
 
 
 .. _kl_divergence:
 
-Kullback-Leibler divergence
-===========================
+Kullback-Leibler
+================
 
 Be the first to contribute!
 
 
 .. _l1_loss:
 
-L1 Loss
+L1
 =======
 
 Be the first to contribute!
 
 
 .. _l2_loss:
 
-L2 Loss
-=======
+L2
+==
 
 Be the first to contribute!
 
@@ -122,8 +106,6 @@ Description of MSE...
     :language: python
     :pyobject: MSE
 
-**Derivative**
-
 .. literalinclude:: ../code/loss_functions.py
     :language: python
     :pyobject: MSE_prime

diff --git a/docs/nn_concepts.rst b/docs/nn_concepts.rst
@@ -97,13 +97,28 @@ Notice, it’s exactly the same equation we use with linear regression! In fact,
 Activation Functions
 ====================
 
-:doc:`activation_functions` live inside neurons and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features.
+Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. Popular activation functions include :ref:`relu <activation_relu>` and :ref:`sigmoid <activation_sigmoid>`.
+
+Activation functions typically have the following properties:
+
+  * **Non-linear** - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. :math:`x^2`, sin, log). To model these relationships we need a non-linear prediction equation.¹ Activation functions provide this non-linearity.
+
+  * **Continuously differentiable** — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
+
+  * **Fixed Range** — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.
 
 
 Loss Functions
 ==============
 
-:doc:`loss_functions` measure "how good" a model is at making predictions for a given set of parameters.
+A loss function, or cost function, is a wrapper around our model's predict function that tells us "how good" the model is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there are many different cost functions available. Popular loss functions include: :ref:`mse` and :ref:`Cross-entropy Loss <loss_cross_entropy>`.
+
+
+Optimization Algorithms
+=======================
+
+Be the first to contribute!
+
 
 
 .. rubric:: References

diff --git a/docs/optimizers.rst b/docs/optimizers.rst
@@ -4,6 +4,8 @@
 Optimizers
 ==========
 
+.. contents:: :local:
+
 
 Adadelta
 ========