Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions linear-classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ In addition to the motivation we provided above there are many desirable propert

The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. For example, suppose that we have some input vector \\(x = [1,1,1,1] \\) and two weight vectors \\(w\_1 = [1,0,0,0]\\), \\(w\_2 = [0.25,0.25,0.25,0.25] \\). Then \\(w\_1^Tx = w\_2^Tx = 1\\) so both weight vectors lead to the same dot product, but the L2 penalty of \\(w\_1\\) is 1.0 while the L2 penalty of \\(w\_2\\) is only 0.25. Therefore, according to the L2 penalty the weight vector \\(w\_2\\) would be preferred since it achieves a lower regularization loss. Intuitively, this is because the weights in \\(w\_2\\) are smaller and more diffuse. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. As we will see later in the class, this effect can improve the generalization performance of the classifiers on test images and lead to less *overfitting*.

Note that biases do not have the same effect since unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights \\(W\\) but not the biases \\(b\\). However, in practice this often turns out to have a negligeable effect. Lastly, note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of \\(W = 0\\).
Note that biases do not have the same effect since unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights \\(W\\) but not the biases \\(b\\). However, in practice this often turns out to have a negligible effect. Lastly, note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of \\(W = 0\\).

**Code**. Here is the loss function (without regularization) implemented in Python, in both unvectorized and half-vectorized form:

Expand Down Expand Up @@ -259,7 +259,7 @@ $$
L\_i = -\log\left(\frac{e^{f\_{y\_i}}}{ \sum\_j e^{f\_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L\_i = -f\_{y\_i} + \log\sum\_j e^{f\_j}
$$

where we are using the notation \\(f\_j\\) to mean the j-th element of the vector of class scores \\(f\\). As before, the full loss for the dataset is the mean of \\(L\_i\\) over all training examples together with a regularization term \\(R(W)\\). The function \\(f\_j(z) = \frac{e^{z\_j}}{\sum\_k e^{z\_k}} \\) is called the **softmax function**: It takes a vector of arbitrary real-valued scores (in \\(z\\)) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivativate.
where we are using the notation \\(f\_j\\) to mean the j-th element of the vector of class scores \\(f\\). As before, the full loss for the dataset is the mean of \\(L\_i\\) over all training examples together with a regularization term \\(R(W)\\). The function \\(f\_j(z) = \frac{e^{z\_j}}{\sum\_k e^{z\_k}} \\) is called the **softmax function**: It takes a vector of arbitrary real-valued scores (in \\(z\\)) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivate.

**Information theory view**. The *cross-entropy* between a "true" distribution \\(p\\) and an estimated distribution \\(q\\) is defined as:

Expand Down