html update

ermongroup · Apr 28, 2017 · 181ebbb · 181ebbb
1 parent cdaa85d
commit 181ebbb
Show file tree

Hide file tree

Showing 19 changed files with 231 additions and 12 deletions.
diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-TEMPDIR := $(shell mktemp -d -t tmp)
+TEMPDIR := $(shell mktemp -d -t tmp.XXX)
 
 publish:
 	echo 'hmmm'
@@ -8,5 +8,5 @@ publish:
 	git init && \
 	git add . && \
 	git commit -m 'publish site' && \
-	git remote add origin git@github.com:kuleshov/cs228-notes.git && \
+	git remote add origin https://github.com/ermongroup/cs228-notes.git && \
 	git push origin master:refs/heads/gh-pages --force
diff --git a/docs/Makefile b/docs/Makefile
@@ -8,5 +8,5 @@ publish:
 	git init && \
 	git add . && \
 	git commit -m 'publish site' && \
-	git remote add origin git@github.com:kuleshov/cs228-notes.git && \
+	git remote add origin git@github.com:ermongroup/cs228-notes.git && \
 	git push origin master:refs/heads/gh-pages --force
diff --git a/docs/extras/vae/index.html b/docs/extras/vae/index.html
@@ -286,6 +286,18 @@ <h3 id="experimental-results">Experimental results</h3>
 
 <p>The authors also compare their methods against three alternative approaches: the wake-sleep algorithm, Monte-Carlo EM, and hybrid Monte-Carlo. The latter two methods are sampling-based approaches; they are quite accurate, but don’t scale well to large datasets. Wake-sleep is a variational inference algorithm that scales much better; however it does not use the exact gradient of the ELBO (it uses an approximation), and hence it is not as accurate as AEVB. The paper illustrates this by plotting learning curves.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../../learning/structLearn">Previous</a></td>
+      <td><a href="">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/index.html b/docs/index.html
@@ -83,13 +83,13 @@ <h1>Contents</h1>
 Although we have written up most of the material, you will probably find several typos. If you do, please let us know, or submit a pull request with your fixes to our <a href="https://github.com/ermongroup/cs228-notes">Github repository</a>. </span>
 You too may help make these notes better by submitting your improvements to us via <a href="https://github.com/ermongroup/cs228-notes">Github</a>.</p>
 
-<p>This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the <a href="">variational auto-encoder</a>, an important probabilistic model that is also one of the most influential recent results in deep learning.</p>
+<p>This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the <a href="extras/vae">variational auto-encoder</a>, an important probabilistic model that is also one of the most influential recent results in deep learning.</p>
 
 <h2 id="preliminaries">Preliminaries</h2>
 
 <ol>
   <li>
-    <p><a href="preliminaries/introduction/">Introduction</a> What is probabilistic graphical modeling? Overview of the course.</p>
+    <p><a href="preliminaries/introduction/">Introduction</a>: What is probabilistic graphical modeling? Overview of the course.</p>
   </li>
   <li>
     <p><a href="preliminaries/probabilityreview">Review of probability theory</a>: Probability distributions. Conditional probability. Random variables (<em>under construction</em>).</p>

diff --git a/docs/inference/jt/index.html b/docs/inference/jt/index.html
@@ -91,7 +91,7 @@ <h2 id="belief-propagation">Belief propagation</h2>
 
 <h3 id="variable-elimination-as-message-passing">Variable elimination as message passing</h3>
 
-<p>First, consider what happens if we run the VE algorithm on a tree in order to compute a marginal <script type="math/tex">p(x_i)</script>. We can easily find an optimal ordering for this problem by rooting the tree at <script type="math/tex">x_i</script> and iterating through the nodes in post-order<label for="1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="1" class="margin-toggle" /><span class="sidenote">A postorder traversal of a rooted tree is one that starts from the leaves and goes up the tree such that a node if always visited after all of its children. The root is visited last. </span>.</p>
+<p>First, consider what happens if we run the VE algorithm on a tree in order to compute a marginal <script type="math/tex">p(x_i)</script>. We can easily find an optimal ordering for this problem by rooting the tree at <script type="math/tex">x_i</script> and iterating through the nodes in post-order<label for="1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="1" class="margin-toggle" /><span class="sidenote">A postorder traversal of a rooted tree is one that starts from the leaves and goes up the tree such that a node is always visited after all of its children. The root is visited last. </span>.</p>
 
 <p>This ordering is optimal because the largest clique that formed during VE will be of size 2. At each step, we will eliminate <script type="math/tex">x_j</script>; this will involve computing the factor <script type="math/tex">\tau_k(x_k) = \sum_{x_j} \phi(x_k, x_j) \tau_j(x_j)</script>, where <script type="math/tex">x_k</script> is the parent of <script type="math/tex">x_j</script> in the tree. At a later step, <script type="math/tex">x_k</script> will be eliminated, and <script type="math/tex">\tau_k(x_k)</script> will be passed up the tree to the parent <script type="math/tex">x_l</script> of <script type="math/tex">x_k</script> in order to be multiplied by the factor <script type="math/tex">\phi(x_l, x_k)</script> before being marginalized out. We can visualize this transfer of information using arrows on a tree.
 <label for="mp1" class="margin-toggle">⊕</label><input type="checkbox" id="mp1" class="margin-toggle" /><span class="marginnote"><img class="fullwidth" src="/cs228-notes/assets/img/mp1.png" /><br />Message passing order when using VE to compute <script type="math/tex">p(x_3)</script> on a small tree.</span></p>
@@ -288,6 +288,17 @@ <h3 id="properties">Properties</h3>
 
 <p>We will return to this algorithm later in the course and try to explain it as a special case of <em>variational inference</em> algorithms.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../ve">Previous</a></td>
+      <td><a href="../map">Next</a></td>
+    </tr>
+  </tbody>
+</table>
 
 
 

diff --git a/docs/inference/map/index.html b/docs/inference/map/index.html
@@ -310,6 +310,17 @@ <h3 id="simulated-annealing">Simulated annealing</h3>
 
 <p>The idea of simulated annealing is to run a sampling algorithm starting with a high <script type="math/tex">t</script>, and gradually decrease it, as the algorithm is being run. If the “cooling rate” is sufficiently slow, we are guaranteed to eventually find the mode of our distribution. In practice, however, choosing the rate requires a lot of tuning. This makes simulated annealing somewhat difficult to use in practice.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../jt">Previous</a></td>
+      <td><a href="../sampling">Next</a></td>
+    </tr>
+  </tbody>
+</table>
 
 
 

diff --git a/docs/inference/sampling/index.html b/docs/inference/sampling/index.html
@@ -311,6 +311,18 @@ <h3 id="running-time-of-mcmc">Running time of MCMC</h3>
 
 <p>In summary, even though MCMC is able to sample from the right distribution (which in turn can be used to solve any inference problem), doing so may sometimes require a very long time, and there is no easy way to judge the amount of computation that we need to spend to find a good solution.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../map">Previous</a></td>
+      <td><a href="../variational">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/inference/variational/index.html b/docs/inference/variational/index.html
@@ -263,6 +263,18 @@ <h2 id="mean-field-inference">Mean-field inference</h2>
 
 -->
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../sampling">Previous</a></td>
+      <td><a href="../../learning/directed">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/inference/ve/index.html b/docs/inference/ve/index.html
@@ -148,7 +148,7 @@ <h3 id="factor-operations">Factor Operations</h3>
 <div class="mathblock"><script type="math/tex; mode=display">
 \phi_3(x_c) = \phi_1(x_c^{(1)}) \times \phi_2(x_c^{(2)}).
 </script></div>
-<p>The scope of <script type="math/tex">\phi_3</script> is defined as the union of the variables in the scopes of <script type="math/tex">\phi_1, \phi_2</script>; also <script type="math/tex">x_c^{(i)}</script> denotes an assignment to the variables in the scope of <script type="math/tex">\phi_i</script> defined by the restriction of <script type="math/tex">x_c</script> to that scope. For example, we define <script type="math/tex">\phi_3(a,b,c) := \phi_1(a,b) \times \phi(b,c)</script>.</p>
+<p>The scope of <script type="math/tex">\phi_3</script> is defined as the union of the variables in the scopes of <script type="math/tex">\phi_1, \phi_2</script>; also <script type="math/tex">x_c^{(i)}</script> denotes an assignment to the variables in the scope of <script type="math/tex">\phi_i</script> defined by the restriction of <script type="math/tex">x_c</script> to that scope. For example, we define <script type="math/tex">\phi_3(a,b,c) := \phi_1(a,b) \times \phi_2(b,c)</script>.</p>
 
 <p>Next, the marginalization operation “locally” eliminates a set of variable from a factor. If we have a factor <script type="math/tex">\phi(X,Y)</script> over two sets of variables <script type="math/tex">X,Y</script>, marginalizing <script type="math/tex">Y</script> produces a new factor</p>
 <div class="mathblock"><script type="math/tex; mode=display">
@@ -230,6 +230,18 @@ <h3 id="choosing-variable-elimination-orderings">Choosing variable elimination o
 
 <p>In practice, these methods often result in reasonably good performance in many interesting settings.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../../representation/undirected">Previous</a></td>
+      <td><a href="../jt">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/learning/bayesianlearning/index.html b/docs/learning/bayesianlearning/index.html
@@ -138,6 +138,17 @@ <h2 id="conjugate-priors">Conjugate Priors</h2>
 
 <figure><figcaption>Here the exponents $$(3,2)$$ and $$(30,20)$$ can both be used to encode the belief that $$\theta$$ is $$0.6$$. But the second set of exponents imply a stronger belief as they are based on a larger sample.</figcaption><img src="/cs228-notes/assets/img/beta.png" /></figure>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../latent">Previous</a></td>
+      <td><a href="../structLearn">Next</a></td>
+    </tr>
+  </tbody>
+</table>
 
 
 

diff --git a/docs/learning/directed/index.html b/docs/learning/directed/index.html
@@ -231,6 +231,18 @@ <h2 id="maximum-likelihood-learning-in-bayesian-networks">Maximum likelihood lea
 
 <p>We thus conclude that in Bayesian networks with discrete variables, the maximum-likelihood estimate has a closed-form solution. Even when the variables are not discrete, the task is equally simple: the log-factors are linearly separable, hence the log-likelihood reduces to estimating each of them separately. The simplicity of learning is one of the most convenient features of Bayesian networks.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../../inference/variational">Previous</a></td>
+      <td><a href="../undirected">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/learning/latent/index.html b/docs/learning/latent/index.html
@@ -273,6 +273,18 @@ <h3 id="properties-of-em">Properties of EM</h3>
 
 <p>In summary, the EM algorithm is a very popular technique for optimizing latent variable models that is also often very effective. Its main downside are its difficulties with local minima.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../undirected">Previous</a></td>
+      <td><a href="../bayesianlearning">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/learning/structLearn/index.html b/docs/learning/structLearn/index.html
@@ -144,6 +144,18 @@ <h3 id="recent-advances">Recent Advances</h3>
 
 <p>The ILP approach encodes the graph structure, scoring and the acyclic constraints into a linear programming problem. Thus it can utilize the state-of-art integer programming solver. But this approach requires a bound on the maximum number of node parents in the graph (say to be 4 or 5). Otherwise, the number of constraints in the ILP will explode and the computation will be intractable.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../bayesianlearning">Previous</a></td>
+      <td><a href="../../extras/vae">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/learning/undirected/index.html b/docs/learning/undirected/index.html
@@ -266,6 +266,18 @@ <h2 id="learning-in-conditional-random-fields">Learning in conditional random fi
 
 <p>Finally, we would like to add that there exists another popular objective for training CRFs called the max-margin loss, a generalization of the objective for training SVMs. Models trained using this loss are called <em>structured support vector machines</em> or <em>max-margin networks</em>. This loss is more widely used in practice because it often leads to better generalization, and also it requires only MAP inference to compute the gradient, rather than general (e.g. marginal) inference, which is often more expensive to perform.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../directed">Previous</a></td>
+      <td><a href="../latent">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/preliminaries/applications/index.html b/docs/preliminaries/applications/index.html
@@ -119,6 +119,18 @@ <h1 id="poverty-mapping">Poverty Mapping</h1>
 <h1 id="error-correcting-codes">Error Correcting Codes</h1>
 <p><img src="Picture1.png" alt="codes" /></p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../probabilityreview/">Previous</a></td>
+      <td><a href="../../representation/directed/">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/preliminaries/introduction/index.html b/docs/preliminaries/introduction/index.html
@@ -184,6 +184,18 @@ <h3 id="learning">Learning</h3>
 
 <p>Our last key task refers to fitting a model to a dataset, which could be for example a large number of labeled examples of spam. By looking at the data, we can infer useful patterns (e.g. which words are found more frequently in spam emails), which we can then use to make predictions about the future. However, we will see that learning and inference are also inherently linked in a more subtle way, since inference will turn out to be a key subroutine that we will repeatedly call within learning algorithms. Also, the topic of learning will feature important connections to the field of computational learning theory — which deals with questions such as generalization from limited data and overfitting — as well as to Bayesian statistics — which tells us (among other things) about how to combine prior knowledge and observed evidence in a principled way.</p>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../../">Previous</a></td>
+      <td><a href="../probabilityreview">Next</a></td>
+    </tr>
+  </tbody>
+</table>
+
 
 
     </article>

diff --git a/docs/preliminaries/probabilityreview/index.html b/docs/preliminaries/probabilityreview/index.html
@@ -108,12 +108,36 @@ <h3 id="properties"><strong>Properties</strong>:</h3>
   <li><strong>Law of Total Probability</strong> If <script type="math/tex">A_1, . . . , A_k</script> are a set of disjoint events such that <script type="math/tex">\bigcup^k_{i=1} A_i = \Omega</script> then <script type="math/tex">\sum^k_{i=1} P(A_k) = 1.</script></li>
 </ul>
 
-<h2 id="11-conditional-probability-and-independence">1.1 Conditional probability and independence</h2>
+<h2 id="11-conditional-probability">1.1 Conditional probability</h2>
 
 <p>Let B be an event with non-zero probability. The conditional probability of any event A given B is defined as
 <script type="math/tex">P(A \mid B) = \frac {P(A \cap B)}{P(B)}</script>
 In other words, <script type="math/tex">P(A \mid B)</script> is the probability measure of the event A after observing the occurrence of
-event B. Two events are called independent if and only if <script type="math/tex">P(A \cap B) = P(A)P(B)</script> (or equivalently,
+event B.</p>
+
+<h2 id="12-chain-rule">1.2 Chain Rule</h2>
+
+<p>Let <script type="math/tex">S_1, \cdots, S_k</script> be events, <script type="math/tex">P(S_i) >0</script>. Then</p>
+
+<p>\begin{equation}
+P(S_1 \cap S_2 \cap \cdots \cap S_k) = P(S_1) P(S_2 | S_1) P(S_3 | S_2 \cap S_1 ) \cdots  P(S_n | S_1 \cap S_2 \cap \cdots S_{n-1})
+\end{equation}</p>
+
+<p>Note that for <script type="math/tex">k=2</script> events, this is just the definition of conditional probability</p>
+
+<p>\begin{equation}
+P(S_1 \cap S_2) = P(S_1) P(S_2 | S_1)
+\end{equation}</p>
+
+<p>In general, it is derived by applying the definition of conditional independence multiple times, as in the following example</p>
+
+<p>\begin{equation}
+P(S_1 \cap S_2 \cap S_3 \cap S_4) =  P(S_1 \cap S_2 \cap S_3) P(S_4 \mid S_1 \cap S_2 \cap S_3) = P(S_1 \cap S_2) P(S_3 \mid S_1 \cap S_2) P(S_4 \mid S_1 \cap S_2 \cap S_3) = P(S_1) P(S_2 \mid S_1) P(S_3 \mid S_1 \cap S_2) P(S_4 \mid S_1 \cap S_2 \cap S_3)
+\end{equation}</p>
+
+<h2 id="13-independence">1.3 Independence</h2>
+
+<p>Two events are called independent if and only if <script type="math/tex">P(A \cap B) = P(A)P(B)</script> (or equivalently,
 <script type="math/tex">P(A \mid B) = P(A)</script>). Therefore, independence is equivalent to saying that observing B does not have
 any effect on the probability of A.</p>
 
@@ -388,7 +412,15 @@ <h2 id="34-conditional-distributions">3.4 Conditional distributions</h2>
 \end{equation}
 provided <script type="math/tex">f_X(x) \neq 0</script>.</p>
 
-<h2 id="35-bayess-rule">3.5 Bayes’s rule</h2>
+<h2 id="35-chain-rule">3.5 Chain rule</h2>
+
+<p>The chain rule we derived earlier for events can be applied to random variables as follows:</p>
+
+<p>\begin{equation}
+p_{X_1, \cdots X_n} (x_1, \cdots, x_n) = p_{X_1} (x_1) p_{X_2 \mid X_1} (x_2 \mid x_1) \cdots p_{X_n \mid X_1, \cdots, X_{n-1}} (x_n \mid x_1, \cdots, x_{n-1})
+\end{equation}</p>
+
+<h2 id="36-bayess-rule">3.6 Bayes’s rule</h2>
 
 <p>A useful formula that often arises when trying to derive expression for the conditional probability of
 one variable given another, is <strong>Bayes’s rule</strong>.</p>
@@ -403,7 +435,7 @@ <h2 id="35-bayess-rule">3.5 Bayes’s rule</h2>
 f_{Y \mid X}(y\mid x) = \frac{f_{XY}(x, y)}{f_X(x)} = \frac{f_{X \mid Y} (x \mid y) f_Y(y)}{\int^{\infty}_{- \infty} f_{X\mid Y} (x \mid y') f_Y (y') dy'}
 </script></div>
 
-<h2 id="36-independence">3.6 Independence</h2>
+<h2 id="37-independence">3.7 Independence</h2>
 <p>Two random variables X and Y are independent if <script type="math/tex">F_{XY} (x, y) = F_X(x)F_Y(y)</script> for all values of x and y. Equivalently,</p>
 <ul>
   <li>For discrete random variables, <script type="math/tex">p_{XY} (x, y) = p_X(x)p_Y(y)</script> for all <script type="math/tex">x \in Val(X)</script>, <script type="math/tex">y \in Val(Y)</script>.</li>
@@ -420,7 +452,7 @@ <h2 id="36-independence">3.6 Independence</h2>
 \end{equation}
 By using the above lemma one can prove that if <script type="math/tex">X</script> is independent of <script type="math/tex">Y</script> then any function of X is independent of any function of Y.</p>
 
-<h2 id="37-expectation-and-covariance">3.7 Expectation and covariance</h2>
+<h2 id="38-expectation-and-covariance">3.8 Expectation and covariance</h2>
 
 <p>Suppose that we have two discrete random variables <script type="math/tex">X, Y</script> and <script type="math/tex">g : {I\!R}^2 \rightarrow I\!R</script> is a function of these two random variables. Then the expected value of g is defined in the following way, 
 \begin{equation}
@@ -455,6 +487,17 @@ <h3 id="properties-7"><strong>Properties</strong>:</h3>
   <li>If <script type="math/tex">X</script> and <script type="math/tex">Y</script> are independent, then <script type="math/tex">E[f(X)g(Y)] = E[f(X)]E[g(Y)]</script>.</li>
 </ul>
 
+<p><br /></p>
+
+<table>
+  <tbody>
+    <tr>
+      <td><a href="../../">Index</a></td>
+      <td><a href="../introduction/">Previous</a></td>
+      <td><a href="../applications/">Next</a></td>
+    </tr>
+  </tbody>
+</table>