Add logistic regression tutorial #11651

Ishitori · 2018-07-11T18:52:20Z

Description

Add new tutorial on how to do logistic regression with Gluon API (writing minimal amount of custom code) + provided some explanation why it should be done in that way

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

New tutorial on logistic regression

Ishitori · 2018-07-12T17:39:07Z

@thomelane @ThomasDelteil @safrooze or @indhub please take a look

ifeherva · 2018-07-12T21:47:43Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+Loss function is used to calculate how the output of the network different from the ground truth. In case of the logistic regression the ground truth are class labels, which can be either 0 or 1. Because of that, we are using `SigmoidBinaryCrossEntropyLoss`, which suites well for that scenario.
+
+Trainer object allows to specify the method of training to be used. There are various methods available, and for our tutorial we use a widely accepted method Stochastic Gradient Descent. We also need to parametrize it with learning rate value, which defines how fast training happens, and weight decay which is used for regularization.


"We also need to parametrize it with learning rate value, which defines how fast training happens" The learning rate of SGD defines the weight updates, not necessarily how fast training happens. I propose to reword it a bit to avoid confusion that large LR will lead to fast training.

Good suggestion, fixed.

safrooze · 2018-07-17T04:39:14Z

docs/tutorials/gluon/logistic_regression_explained.md

@@ -0,0 +1,215 @@
+
+# Logistic regression using Gluon API explained


This tutorial is very similar to this Straight Dope tutorial. Did you consider focusing purely on the confusion around Accuracy and pointing user to Straight Dope for further reading?

There is a lot of similarity with the tutorial you have mentioned. My personal pet-peeve with the tutorial in The Straight Dope is that it doesn't use library functions for sigmoid, loss and accuracy. I believe it creates an impression that you need to write a lot of boiler plate code instead of using what MXNet offers.

Also, there were confusions around number of neurons in forum and stackoverflow which is directly relates to the usage of Accuracy, but needs a bit more context to cover.

safrooze · 2018-07-17T04:42:10Z

docs/tutorials/gluon/logistic_regression_explained.md

+mx.random.seed(12345)  # Added for reproducibility
+```
+
+In this tutorial we will use fake dataset, which contains 10 features drawn from a normal distribution with mean equals to 0 and standard deviation equals to 1, and a class label, which can be either 0 or 1. The length of the dataset is an arbitrary value. The function below helps us to generate a dataset.


The logistic regression tutorial in straight dope uses the Adult dataset. Did you consider using similar dataset (or perhaps simply pointing the user to that tutorial)?

I didn't want to add extra code for data loading and data processing, because this is not the point of this tutorial. The optimal way would be if there is a binary classification dataset in the mxnet itself, so it can be loaded in one line and no pre-processing would be required.

safrooze · 2018-07-17T04:46:25Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+## Working with data
+
+To work with data, Apache MXNet provides Dataset and DataLoader classes. The former is used to provide an indexed access to the data, the latter is used to shuffle and batchify the data. 


Point user to links to API doc and tutorials for further reading and perhaps skip explanation of dataset/dataloader.

Yes, added links and remove my explanation. Added a link to Datasets and Dataloaders tutorial instead.

safrooze · 2018-07-17T04:47:28Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+This separation is done because a source of Dataset can vary from a simple array of numbers to complex data structures like text and images. DataLoader doesn't need to be aware of the source of data as long as Dataset provides a way to get the number of records and to load a record by index. As an outcome, Dataset doesn't need to hold in memory all data at once. Needless to say, that one can implement its own versions of Dataset and DataLoader, but we are going to use existing implementation.
+
+Below we define 2 datasets: training dataset and validation dataset. It is a good practice to measure performance of a trained model on a data that the network hasn't seen before. That is why we are going to use training set for training the model and validation set to calculate model's accuracy.


Recommend to keep it brief and mention "we define training and validation dataset". The difference between train/validation/test datasets are outside of this tutorial's scope.

Agree. Removed this one as well.

"training set" -> "training dataset"
"validation set" -> "validation dataset"

safrooze · 2018-07-17T04:50:50Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+## Defining and training the model
+
+In real application, model can be arbitrary complex. The only requirement for the logistic regression is that the last layer of the network must be a single neuron. Apache MXNet allows us to do so by using `Dense` layer and specifying the number of units to 1.


Make references to functions a link to the API.

safrooze · 2018-07-17T05:22:04Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+## Tip 4: Convert probabilities to classes before calculating Accuracy
+
+`Accuracy` metric requires 2 arguments: 1) a vector of ground-truth classes and 2) A tensor of predictions. When tensor of predictions is of the same shape as the vector of ground-truth classes, `Accuracy` class assumes that it should contain predicted classes. So, it converts the vector to `Int32` and compare each item of ground-truth classes to prediction vector. 


Why is one called a vector and the other called a tensor?

Replaced it with "vector of matrix"

safrooze · 2018-07-17T05:24:50Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+`Accuracy` metric requires 2 arguments: 1) a vector of ground-truth classes and 2) A tensor of predictions. When tensor of predictions is of the same shape as the vector of ground-truth classes, `Accuracy` class assumes that it should contain predicted classes. So, it converts the vector to `Int32` and compare each item of ground-truth classes to prediction vector. 
+
+Because of the behaviour above, you will get an unexpected result if you just pass the output of `Sigmoid` function as is. `Sigmoid` function produces output in range [0; 1], and all numbers in that range are going to be casted to 0, even if it is as high as 0.99. To avoid this we write a custom bit of code, that:


Instead of having a tip here, I recommend to write the rounding as a separate function in notebook and include a paragraph on top explaining what it's doing. This paragraph is really the only motivation for writing this tutorial.

safrooze · 2018-07-17T05:25:46Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+The same is not true, if the output shape of your function is different from the shape of ground-truth classes vector. For example, when doing multiclass regression with `Softmax` as an output, the shape of the output is going to be *number_of_examples* x *number_of_classes*. In that case we don't need to do the transformation above, because `Accuracy` metric would understand that shapes are different and will assume that the prediction contains probabilities of an example to belong to these classes - exactly what we want it to be. 
+
+This makes things a little bit easier, and that's why I have seen examples where `Softmax` is used as an output of prediction. If you want to do that, make sure to change the output layer size to 2 neurons, where each neuron will provide a value of an example to belong to class 0 and 1 respectively.


Wouldn't call it easier. Softmax allows the output layer to have more flexibility in learning because it contains twice the number of parameters to learn.

safrooze · 2018-07-17T05:26:47Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+## Conclusion
+
+In this tutorial I explained some potential pitfalls to be aware about when doing logistic regression in Apache MXNet Gluon API. There might be some other challenging scenarios, which are not covered in this tutorial, like dealing with imbalanced classes, but I hope this tutorial will serve as a guidance and all other potential pitfalls would be covered in future tutorials.


"be aware of"

safrooze · 2018-07-17T05:28:03Z

docs/tutorials/gluon/logistic_regression_explained.md

+
+## Conclusion
+
+In this tutorial I explained some potential pitfalls to be aware about when doing logistic regression in Apache MXNet Gluon API. There might be some other challenging scenarios, which are not covered in this tutorial, like dealing with imbalanced classes, but I hope this tutorial will serve as a guidance and all other potential pitfalls would be covered in future tutorials.


Suggest to keep the conclusion factual and not include reference to one of many challenges or "hoping" the tutorial is useful. Also this is not a blog (no posting time associated with it). Future tutorials has no meaning.

Rewrote the conclusion by reiterating over main points of the tutorial.

yifeim · 2018-07-19T08:35:25Z

The tutorial looks awesome! A few comments:

Is it just me or everybody experiences training loss 8x larger than validation loss? I understand it may be possible due to unstable weights from backward update after every batch. But, still it is not my usual expectation. (Not to say that I have not seen similar issues before and I am still slightly puzzled by the root cause.)
The model class is linear, but the proposed network is a 3-layer overkill (which may be related to the larger training loss). If you want a fun problem, maybe consider a xor function class: https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b
Is this mxnet.gluon.loss.LogisticLoss equivalent to SigmoidBCELoss? Since you are using metrics, it may be worthwhile exploring mx.metric.F1.
There are some magic hyperparameters to be explained: a strong wd=0.01, an Xavier=2.34 initialization. As an elementary tutorial, I would try to simplify them unless they are part of the intended purposes.

safrooze · 2018-07-20T22:46:43Z

docs/tutorials/gluon/logistic_regression_explained.md

@@ -59,9 +55,9 @@ val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

 ## Defining and training the model

-In real application, model can be arbitrary complex. The only requirement for the logistic regression is that the last layer of the network must be a single neuron. Apache MXNet allows us to do so by using `Dense` layer and specifying the number of units to 1.
+The only requirement for the logistic regression is that the last layer of the network must be a single neuron. Apache MXNet allows us to do so by using [Dense](https://mxnet.incubator.apache.org/api/python/gluon/nn.html#mxnet.gluon.nn.Dense) layer and specifying the number of units to 1. The rest of the network can be arbitrary complex. 


"arbitrarily complex"

safrooze · 2018-07-20T22:49:59Z

docs/tutorials/gluon/logistic_regression_explained.md

-Usually, it is not enough to pass the training data through a network only once to achieve high Accuracy. It helps when the network sees each example multiple times. The number of displaying every example to the network is called `epoch`. How big this number should be is unknown in advance, and usually it is estimated using trial and error approach. 
-
-Below we are defining the main training loop, which go through each example in batches specified number of times (epochs). After each epoch we display training loss, validation loss and calculate accuracy of the model using validation set. For now, let's take a look into the code, and I am going to explain the details later.
+The next step is to define the training function in which we iterate over all batches of training data, execute the forward pass on each batch and calculate training loss. On line 19, we sum losses of every batch per an epoch into a single variable, because we calculate loss per single batch, but want to display it per epoch.


"batch per epoch"

safrooze · 2018-07-20T23:40:00Z

docs/tutorials/gluon/logistic_regression_explained.md


-1. Subtracts a threshold from the original prediction. Usually, the threshold is equal to 0.5, but it can be higher, if you want to increase certainty of an item to belong to class 1.
+In case when there are 3 or more classes, one cannot use a single Logistic regression, but should do multiclass regression. The solution would be to increase the number of output neurons to the number of classes and use `SoftmaxCrossEntropyLoss`. 


"In case there are".

I think the correct term would be multi-class classification. I suggest to simply remove this sentence to not create any confusion.

Ishitori · 2018-07-24T22:18:38Z

@yifeim, Thanks for comments, I am answering them in order:

The loss is so different, because I didn't do the averaging by number of items in training and validation set respectively. You are right that it is better to do so, otherwise numbers look confusing. Did that.
While XOR problem is fun and gives a good understanding why you need activation functions, it requires inputs of 0 and 1 as well. I wouldn't want to limit this part of the tutorial to this, as it seems a better use case to showcase arbitrary inputs.
LogisticLoss can be used instead of SigmoidBCELoss. There is a good explanation about both of them: https://stats.stackexchange.com/questions/229645/why-there-are-two-different-logistic-loss-formulation-notations/231994

I haven't seen LogisticLoss used much in practice. I don't want to mention the second option, because it is going to confuse these who are new to Apache MXNet.

I added a showcase how to use F1 metric.

You are probably right - I didn't have an intention of going into default parameters as it may require long explanation. Just removed wd and use default of Xavier.

yifeim · 2018-07-28T19:04:04Z

Awesome. Done and done! Thanks for addressing the questions.

* Add logistic regression tutorial * Code review fix * Add F1 metric, fix code review comments * Add Download buttons script

Add logistic regression tutorial

4fab0d1

Ishitori requested a review from szha as a code owner July 11, 2018 18:52

ifeherva reviewed Jul 12, 2018

View reviewed changes

safrooze reviewed Jul 17, 2018

View reviewed changes

Code review fix

60b3266

safrooze reviewed Jul 20, 2018

View reviewed changes

Sergey Sokolov added 2 commits July 24, 2018 15:20

Add F1 metric, fix code review comments

268bc0c

Add Download buttons script

151797e

indhub merged commit 832a5fb into apache:master Jul 25, 2018

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Add logistic regression tutorial (apache#11651)

e48bd04

* Add logistic regression tutorial * Code review fix * Add F1 metric, fix code review comments * Add Download buttons script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logistic regression tutorial #11651

Add logistic regression tutorial #11651

Ishitori commented Jul 11, 2018

Ishitori commented Jul 12, 2018

ifeherva Jul 12, 2018

Ishitori Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 17, 2018

safrooze Jul 18, 2018

safrooze Jul 17, 2018

Ishitori Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 18, 2018

safrooze Jul 17, 2018

safrooze Jul 17, 2018

safrooze Jul 17, 2018

Ishitori Jul 19, 2018

safrooze Jul 17, 2018

Ishitori Jul 19, 2018

yifeim commented Jul 19, 2018

safrooze Jul 20, 2018

Ishitori Jul 20, 2018

safrooze Jul 20, 2018

Ishitori Jul 21, 2018

safrooze Jul 20, 2018 •

edited

Loading

Ishitori Jul 21, 2018

Ishitori commented Jul 24, 2018

yifeim commented Jul 28, 2018


		Loss function is used to calculate how the output of the network different from the ground truth. In case of the logistic regression the ground truth are class labels, which can be either 0 or 1. Because of that, we are using `SigmoidBinaryCrossEntropyLoss`, which suites well for that scenario.

		Trainer object allows to specify the method of training to be used. There are various methods available, and for our tutorial we use a widely accepted method Stochastic Gradient Descent. We also need to parametrize it with learning rate value, which defines how fast training happens, and weight decay which is used for regularization.

		@@ -0,0 +1,215 @@

		# Logistic regression using Gluon API explained


		## Working with data

		To work with data, Apache MXNet provides Dataset and DataLoader classes. The former is used to provide an indexed access to the data, the latter is used to shuffle and batchify the data.


		This separation is done because a source of Dataset can vary from a simple array of numbers to complex data structures like text and images. DataLoader doesn't need to be aware of the source of data as long as Dataset provides a way to get the number of records and to load a record by index. As an outcome, Dataset doesn't need to hold in memory all data at once. Needless to say, that one can implement its own versions of Dataset and DataLoader, but we are going to use existing implementation.

		Below we define 2 datasets: training dataset and validation dataset. It is a good practice to measure performance of a trained model on a data that the network hasn't seen before. That is why we are going to use training set for training the model and validation set to calculate model's accuracy.


		## Defining and training the model

		In real application, model can be arbitrary complex. The only requirement for the logistic regression is that the last layer of the network must be a single neuron. Apache MXNet allows us to do so by using `Dense` layer and specifying the number of units to 1.


		## Tip 4: Convert probabilities to classes before calculating Accuracy

		`Accuracy` metric requires 2 arguments: 1) a vector of ground-truth classes and 2) A tensor of predictions. When tensor of predictions is of the same shape as the vector of ground-truth classes, `Accuracy` class assumes that it should contain predicted classes. So, it converts the vector to `Int32` and compare each item of ground-truth classes to prediction vector.


		`Accuracy` metric requires 2 arguments: 1) a vector of ground-truth classes and 2) A tensor of predictions. When tensor of predictions is of the same shape as the vector of ground-truth classes, `Accuracy` class assumes that it should contain predicted classes. So, it converts the vector to `Int32` and compare each item of ground-truth classes to prediction vector.

		Because of the behaviour above, you will get an unexpected result if you just pass the output of `Sigmoid` function as is. `Sigmoid` function produces output in range [0; 1], and all numbers in that range are going to be casted to 0, even if it is as high as 0.99. To avoid this we write a custom bit of code, that:


		The same is not true, if the output shape of your function is different from the shape of ground-truth classes vector. For example, when doing multiclass regression with `Softmax` as an output, the shape of the output is going to be number_of_examples x number_of_classes. In that case we don't need to do the transformation above, because `Accuracy` metric would understand that shapes are different and will assume that the prediction contains probabilities of an example to belong to these classes - exactly what we want it to be.

		This makes things a little bit easier, and that's why I have seen examples where `Softmax` is used as an output of prediction. If you want to do that, make sure to change the output layer size to 2 neurons, where each neuron will provide a value of an example to belong to class 0 and 1 respectively.


		## Conclusion

		In this tutorial I explained some potential pitfalls to be aware about when doing logistic regression in Apache MXNet Gluon API. There might be some other challenging scenarios, which are not covered in this tutorial, like dealing with imbalanced classes, but I hope this tutorial will serve as a guidance and all other potential pitfalls would be covered in future tutorials.


		1. Subtracts a threshold from the original prediction. Usually, the threshold is equal to 0.5, but it can be higher, if you want to increase certainty of an item to belong to class 1.
		In case when there are 3 or more classes, one cannot use a single Logistic regression, but should do multiclass regression. The solution would be to increase the number of output neurons to the number of classes and use `SoftmaxCrossEntropyLoss`.

Add logistic regression tutorial #11651

Add logistic regression tutorial #11651

Conversation

Ishitori commented Jul 11, 2018

Description

Checklist

Essentials

Changes

Ishitori commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifeim commented Jul 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safrooze Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishitori commented Jul 24, 2018

yifeim commented Jul 28, 2018

safrooze Jul 20, 2018 •

edited

Loading