GSoC 6-7 #8

Yashwants19 · 2020-07-12T13:26:05Z

Hi @coatless @eddelbuettel @rcurtin, As per the timeline this PR is regarding the automatically generation of documentation in .R files. Here is an example : https://gist.github.com/Yashwants19/b52606bce1412c1972b8bdab728d992e
I have almost completed the work. This PR also contain some commits regarding #7 which I will rebase after merging GSoC 4-5 in R-Bindings. In the following weeks I will add the markdown documentation.

Thank You.

coatless · 2020-07-13T16:48:05Z

@Yashwants19 so, the main part that jumps out at me from https://gist.github.com/Yashwants19/b52606bce1412c1972b8bdab728d992e are interweaved examples with descriptions.

We're often seeking to separate out the code from the documentation, e.g.

#' @title AdaBoost
#'
#' @description
#' This program implements the AdaBoost (or Adaptive Boosting) algorithm. The
#' variant of AdaBoost implemented here is AdaBoost.MH. It uses a weak learner,
#' either decision stumps or perceptrons, and over many iterations, creates a
#' strong learner that is a weighted ensemble of weak learners. It runs these
#' iterations until a tolerance value is crossed for change in the value of the
#' weighted training error.
#' 
#' @param input_model Input AdaBoost model.
#' @param iterations The maximum number of boosting iterations to be run (0 will run
#'   until convergence.)  Default value "1000".
#' @param labels Labels for the training set.
#' @param test Test dataset.
#' @param tolerance The tolerance for change in values of the weighted error during
#'   training.  Default value "1e-10".
#' @param training Dataset for training AdaBoost.
#' @param verbose Display informational messages and the full list of parameters and
#'   timers at the end of execution.  Default value "FALSE".
#' @param weak_learner The type of weak learner to use: 'decision_stump', or
#'   'perceptron'.  Default value "decision_stump".
#'
#' @return A list with several components:
#' \item{output}{Predicted labels for the test set.}
#' \item{output_model}{Output trained AdaBoost model.}
#' \item{predictions}{Predicted labels for the test set.}
#' \item{probabilities}{Predicted class probabilities for each point in the test
#'   set.}
#'
#' @details
#' This program allows training of an AdaBoost model, and then application of
#' that model to a test dataset.  To train a model, a dataset must be passed
#' with the "training" option.  Labels can be given with the "labels" option; if
#' no labels are specified, the labels will be assumed to be the last column of
#' the input dataset.  Alternately, an AdaBoost model may be loaded with the
#' "input_model" option.
#' 
#' Once a model is trained or loaded, it may be used to provide class
#' predictions for a given test dataset.  A test dataset may be specified with
#' the "test" parameter.  The predicted classes for each point in the test
#' dataset are output to the "predictions" output parameter.  The AdaBoost model
#' itself is output to the "output_model" output parameter.
#' 
#' Note: the following parameter is deprecated and will be removed in mlpack
#' 4.0.0: "output".
#' Use "predictions" instead of "output". additional
#'
#' @references
#' For more information about the algorithm, see the paper "Improved Boosting
#' Algorithms Using Confidence-Rated Predictions", by R.E. Schapire and Y.
#' Singer.
#'
#' @author
#' MLPACK Developers
#'
#' @export
#' @examples
#' # For example, to run AdaBoost on an input dataset "data" with labels
#' # "labels"and perceptrons as the weak learner type, storing the trained model
#' # in "model", one could use the following command: 
#' 
#' output <- adaboost(training=data, labels=labels,
#'      weak_learner="perceptron")     
#' model <- output$output_model
#' 
#' # Similarly, an already-trained model in "model" can be used to provide class
#' # predictions from test data "test_data" and store the output in "predictions"
#' # with the following command: 
#' 
#' output <- adaboost(input_model=model, test=test_data)
#' predictions <- output$predictions

Also, we'll need to make sure each example takes less than ~5 seconds to run. If the example takes more than 5 seconds, we're going to need to protect it with \donttest{}, e.g.

#' @examples
#' \donttest{
#' output <- adaboost(input_model=model, test=test_data)
#' predictions <- output$predictions
#' }

The \donttest{} comes from https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-functions

In terms of the package build stage, the other options to use would be:

Command	Example	Help	R CMD check
\dontrun{}		x
\dontshow{}	x		x
\donttest{}	x	x

There was a shift away from using \dontrun{} to \donttest{} by CRAN recently.

Yashwants19 · 2020-07-14T09:39:43Z

Hi @coatless, Thank you for the great example, I will update the documentation as soon as possible.

rcurtin · 2020-07-14T21:17:09Z

@Yashwants19 @coatless I'm not sure exactly how we can do this; the "input" from the C++ side looks like this:

PROGRAM_INFO("AdaBoost",
    // Short description.
    "An implementation of the AdaBoost.MH (Adaptive Boosting) algorithm for "
    "classification.  This can be used to train an AdaBoost model on labeled "
    "data or use an existing AdaBoost model to predict the classes of new "
    "points.",
    // Long description.
    "This program implements the AdaBoost (or Adaptive "
    "Boosting) algorithm. The variant of AdaBoost implemented here is "
    "AdaBoost.MH. It uses a weak learner, either decision stumps or "
    "perceptrons, and over many iterations, creates a strong learner that is a "
    "weighted ensemble of weak learners. It runs these iterations until a "
    "tolerance value is crossed for change in the value of the weighted "
    "training error."
    "\n\n"
    "For more information about the algorithm, see the paper \"Improved "
    "Boosting Algorithms Using Confidence-Rated Predictions\", by R.E. Schapire"
    " and Y. Singer."
    "\n\n"
    "This program allows training of an AdaBoost model, and then application of"
    " that model to a test dataset.  To train a model, a dataset must be passed"
    " with the " + PRINT_PARAM_STRING("training") + " option.  Labels can be "
    "given with the " + PRINT_PARAM_STRING("labels") + " option; if no labels "
    "are specified, the labels will be assumed to be the last column of the "
    "input dataset.  Alternately, an AdaBoost model may be loaded with the " +
    PRINT_PARAM_STRING("input_model") + " option."
    "\n\n"
    "Once a model is trained or loaded, it may be used to provide class "
    "predictions for a given test dataset.  A test dataset may be specified "
    "with the " + PRINT_PARAM_STRING("test") + " parameter.  The predicted "
    "classes for each point in the test dataset are output to the " +
    PRINT_PARAM_STRING("predictions") + " output parameter.  The AdaBoost "
    "model itself is output to the " + PRINT_PARAM_STRING("output_model") +
    " output parameter."
    "\n\n"
    "Note: the following parameter is deprecated and "
    "will be removed in mlpack 4.0.0: " + PRINT_PARAM_STRING("output") +
    "."
    "\n"
    "Use " + PRINT_PARAM_STRING("predictions") + " instead of " +
    PRINT_PARAM_STRING("output") + '.' +
    "\n\n"
    "For example, to run AdaBoost on an input dataset " +
    PRINT_DATASET("data") + " with labels " + PRINT_DATASET("labels") +
    "and perceptrons as the weak learner type, storing the trained model in " +
    PRINT_MODEL("model") + ", one could use the following command: "
    "\n\n" +
    PRINT_CALL("adaboost", "training", "data", "labels", "labels",
        "output_model", "model", "weak_learner", "perceptron") +
    "\n\n"
    "Similarly, an already-trained model in " + PRINT_MODEL("model") + " can"
    " be used to provide class predictions from test data " +
    PRINT_DATASET("test_data") + " and store the output in " +
    PRINT_DATASET("predictions") + " with the following command: "
    "\n\n" +
    PRINT_CALL("adaboost", "input_model", "model", "test", "test_data",
        "predictions", "predictions"),
    // See also...
    SEE_ALSO("AdaBoost on Wikipedia", "https://en.wikipedia.org/wiki/AdaBoost"),
    SEE_ALSO("Improved boosting algorithms using confidence-rated predictions "
        "(pdf)", "http://rob.schapire.net/papers/SchapireSi98.pdf"),
    SEE_ALSO("Perceptron", "#perceptron"),
    SEE_ALSO("Decision Stump", "#decision_stump"),
    SEE_ALSO("mlpack::adaboost::AdaBoost C++ class documentation",
        "@doxygen/classmlpack_1_1adaboost_1_1AdaBoost.html"));

So, basically, even in our sources, the examples are interleaved with the "long description". Unless @Yashwants19 has a better idea it seems like maybe the only option here is to add some kind of EXAMPLE() macro to define each of the individual examples, and then go through our existing bindings and split out the examples. @Yashwants19 what do you think? If that's the general idea you were thinking I have a few more comments on maybe how it should be done best.

rcurtin

Personally I think the implementation looks great. My only comments are really minor.

It seems to me like this would be the right time to make sure that the documentation looks "R-native"; so, @coatless and @eddelbuettel, if you have additional comments about the generated documentation @Yashwants19 showed, now would be a great time to work out those issues. :)

rcurtin · 2020-07-14T21:21:44Z

src/mlpack/bindings/R/print_doc_functions_impl.hpp

+inline std::string ParamString(const std::string& paramName)
+{
+  // For a R binding we don't need to know the type.
+


Suggested change

Seems like this is an extra line.

rcurtin · 2020-07-14T21:22:51Z

src/mlpack/bindings/go/print_doc.hpp

@@ -16,7 +16,7 @@
 #include <mlpack/prereqs.hpp>
 #include <mlpack/core/util/hyphenate_string.hpp>
 #include "get_go_type.hpp"
-#include "camel_case.hpp"
+#include <mlpack/bindings/utils/camel_case.hpp>


So, there also exists the directory src/mlpack/core/util/ which contains things in the namespace mlpack::core::util. Maybe it is better here to match that name with mlpack::bindings::util? (Instead of having util there and utils here.)

rcurtin · 2020-07-14T21:23:40Z

src/mlpack/bindings/utils/camel_case.hpp

-#ifndef MLPACK_BINDINGS_GO_CAMEL_CASE_HPP
-#define MLPACK_BINDINGS_GO_CAMEL_CASE_HPP
+#ifndef MLPACK_BINDINGS_UTILS_CAMEL_CASE_HPP
+#define MLPACK_BINDINGS_UTILS_CAMEL_CASE_HPP


This is a nice cleanup---I appreciate that you've done this. It reduces the overall amount of code which is great. 👍

rcurtin · 2020-07-14T21:24:15Z

src/mlpack/core/util/hyphenate_string.hpp

+  if (prefix.size() >= 80)
+  {
+    throw std::invalid_argument("Prefix size must be less than 80");
+  }


Nice catch! I imagine that was kind of irritating to debug... :)

Yashwants19 · 2020-07-14T23:57:18Z

So, basically, even in our sources, the examples are interleaved with the "long description". Unless @Yashwants19 has a better idea it seems like maybe the only option here is to add some kind of EXAMPLE() macro to define each of the individual examples, and then go through our existing bindings and split out the examples. @Yashwants19 what do you think? If that's the general idea you were thinking I have a few more comments on maybe how it should be done best.

I was also looking for the same. I will love to discussion that how this can done in the best way.

eddelbuettel · 2020-07-15T00:11:34Z

I haven't had time to look in any detail but I was also hoping that maybe a two step process of first doing what is done at the source level here, and then letting R's roxygen2 generate the help pages and maybe reorder in the process may do the trick. Maybe that'll work.

Yashwants19 · 2020-07-15T10:18:30Z

Hi @rcurtin I have done some rough work for "example" on the source side, Please take a look.

Yashwants19 · 2020-07-16T16:02:33Z

src/mlpack/bindings/tests/clean_memory.cpp

@@ -2,7 +2,7 @@
 * @file bindings/tests/clean_memory.cpp
 * @author Ryan Curtin
 *
- * Delete any pointers held by the CLI object.
+ * Delete any pointers held by the IO object.


I am not sure, whether these CLI's are ignored or these are intentional. And If they are ignored should I open a different PR or we are good here with these changes?

Yes, it looks to me like there were some things that were accidentally missed. Do you want to open a PR to the main repository with those things? Alternately I can do it, just let me know---I can see that these were missed, and also the names of the utility functions in src/mlpack/bindings/julia/mlpack/cli.jl.in. I'm not sure if the Python utility functions need to be adapted too; I didn't check. Anyway let me know what you'd like to do.

I will open a PR soon for the same.

rcurtin

Hey @Yashwants19, everything looks great to me. Many of the comments I left are perhaps larger scope than just this PR, so, let me know what you think of them. We can always open issues in the main mlpack repository to handle some of those other things, or, even, we could handle them some other time in this repo, or maybe not at all depending on what you think. 👍

rcurtin · 2020-07-18T18:56:20Z

CMake/R/ConfigureRCPP.cmake

@@ -30,17 +30,17 @@ if (NOT (MODEL_FILE_TYPE MATCHES "\"${MODEL_SAFE_TYPES}\""))
      set(MODEL_PTR_IMPLS "${MODEL_PTR_IMPLS}
 // Get the pointer to a ${MODEL_TYPE} parameter.
 // [[Rcpp::export]]
-SEXP CLI_GetParam${MODEL_SAFE_TYPE}Ptr(const std::string& paramName)
+SEXP IO_GetParam${MODEL_SAFE_TYPE}Ptr(const std::string& paramName)


Thanks for taking the time to do this---I know it is a bit of a tedious refactoring. :)

rcurtin · 2020-07-18T19:01:26Z

doc/guide/cli_quickstart.hpp

@@ -108,7 +108,7 @@ The example above has only shown a little bit of the functionality of mlpack.
 Lots of other commands are available with different functionality.  A full list
 of commands and full documentation for each can be found on the following page:

- - <a href="https://mlpack.org/doc/mlpack-git/cli_documentation.html">IO documentation</a>
+ - <a href="https://mlpack.org/doc/mlpack-git/cli_documentation.html">CLI documentation</a>


Ahhh, thank you. Actually do you want to submit this one as a patch batch to the mlpack repository?

rcurtin · 2020-07-18T19:09:29Z

src/mlpack/bindings/tests/clean_memory.cpp

@@ -2,7 +2,7 @@
 * @file bindings/tests/clean_memory.cpp
 * @author Ryan Curtin
 *
- * Delete any pointers held by the CLI object.
+ * Delete any pointers held by the IO object.


Yes, it looks to me like there were some things that were accidentally missed. Do you want to open a PR to the main repository with those things? Alternately I can do it, just let me know---I can see that these were missed, and also the names of the utility functions in src/mlpack/bindings/julia/mlpack/cli.jl.in. I'm not sure if the Python utility functions need to be adapted too; I didn't check. Anyway let me know what you'd like to do.

rcurtin · 2020-07-18T19:13:51Z

src/mlpack/core/util/mlpack_main.hpp

+    mlpack::util::ProgramDoc \
+    io_programdoc_dummy_object = mlpack::util::ProgramDoc(NAME, SHORT_DESC, \
+    []() { return std::string(DESC) + std::string(EXAMPLE); }, []() \
+    { return ""; }, { __VA_ARGS__ })


Nice, this is one option for what I was thinking. A couple further thoughts:

It looks like this only allows one EXAMPLE, so we don't have a way to split into multiple examples.

The number of parameters to PROGRAM_INFO is getting large. What would you think about splitting PROGRAM_INFO into a bunch of macros, like this:

BINDING_NAME("program name"); BINDING_SHORT_DESC("This is a short, two-sentence description of what the program does."); BINDING_LONG_DESC("This is a long description of what the program does." " It might be many lines long and have lots of details about different options."); BINDING_EXAMPLE("This contains one example for this particular binding.\n" + PROGRAM_CALL(...)); BINDING_EXAMPLE("This contains another example for this particular binding.\n" + PROGRAM_CALL(...)); // There could be many of these "see alsos". BINDING_SEE_ALSO("https://en.wikipedia.org/wiki/Machine_learning");

It would take a little bit of refactoring but I wonder if that is a cleaner approach overall. What you've done in this PR is fine for now, I think, but I do think maybe it's worth some discussion about a longer-term cleanup. If you think either of those ideas are useful, and we can work it into a quick proposal, we should open an issue back in the main mlpack repository and then we can do it (I can help with the refactoring of course! :)).

This looks awesome. We should definitely discuss on this, it will give a great help if you can open a issue in the main mlpack repository. May be it will be better if we handle this in some different PR in the main mlpack repository, as this translation is not directly related to R-bindings. It seems like according to the current progress, I might complete R-bindings before time and this translation may result into a great stretch goal to my timeline.

I agree with that---let me open an issue in the mlpack repository later today. 👍

mlpack#2521

Awesome thank you for opening the issue.

rcurtin · 2020-07-18T19:16:22Z

src/mlpack/core/util/param.hpp

@@ -1129,55 +1130,55 @@ using DatasetInfo = DatasetMapper<IncrementPolicy, std::string>;
 #ifdef __COUNTER__
  #define PARAM_IN(T, ID, DESC, ALIAS, DEF, REQ) \
      static mlpack::util::Option<T> \
-      JOIN(cli_option_dummy_object_in_, __COUNTER__) \
+      JOIN(io_option_dummy_object_in_, __COUNTER__) \


Ahhh, nice catch!

Thank you :)

rcurtin · 2020-07-18T19:18:12Z

src/mlpack/methods/cf/cf_main.cpp

+    "\n\n",
+    // Example.


Suggested change

"\n\n",

// Example.

// Example.

Nice, it's really cool how simple this refactoring turned out to be. 👍 Still, I think maybe we can remove the extra newlines now, since the binding printing code should hopefully be handling that for us. Plus, as users add new bindings, they may not know that they need to add newlines to the end of the program documentation.

Anyway, my suggestion here doesn't totally work, because I can't put the comma with the previous line. 👍

I will soon push the changes regarding newline 👍.

I have made changes in the last commit.

Looks great, thanks! 👍 I would say whenever you're ready to merge this go ahead. Not sure if you were done here yet or if you wanted to do some more, but in any case, don't feel obligated to wait on another approval from me. :)

rcurtin · 2020-07-18T19:18:25Z

src/mlpack/methods/hmm/hmm_train_main.cpp

-    "--tolerance option.  By default, the transition matrix is randomly "
-    "initialized and the emission distributions are initialized to fit the "
-    "extent of the data."
+    + PRINT_PARAM_STRING("tolerance") + "option.  By default, the transition "


Nice catch! 💯

src/mlpack/bindings/R/print_R.cpp

rcurtin · 2020-07-20T22:05:57Z

src/mlpack/bindings/R/print_R.cpp

  cout << util::HyphenateString(programInfo.documentation(), "#' ") << endl;
-  cout << "#' @author" << endl;
+  cout << "#'\n#' @author" << endl;


Maybe it would be easier to just add another endl after HyphenateString? i.e.

cout << util::HyphenateString(programInfo.documentation(), "#' ") << endl << endl;

or similar. Realistically both of these function just the same, but I think there's something nice about consistency of always using endl or always using \n, but not mixing both. Just my opinion. 😄

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Yashwants19 · 2020-07-21T13:32:07Z

Okay, lets merge this. :)

Yashwants19 · 2020-07-21T18:21:47Z

Hey everyone, here is the link for my weekly blog. Kindly share your views.

Thank you.

rcurtin · 2020-07-21T18:22:44Z

Awesome, thanks! If you like, it might be useful to include some example usages from R that now work, etc., for any reader who might not want to dig into the details. However, up to you. 👍

Yashwants19 · 2020-07-21T18:29:30Z

Sure I will update the blog post.

updating master

Yashwants19 force-pushed the Add-R-documentation branch from 5637bc6 to c128054 Compare July 13, 2020 10:54

rcurtin reviewed Jul 14, 2020

View reviewed changes

Yashwants19 force-pushed the R-bindings branch from 08cc204 to 49383ba Compare July 15, 2020 04:42

Yashwants19 force-pushed the Add-R-documentation branch 2 times, most recently from a788479 to 75fa69f Compare July 15, 2020 07:21

Yashwants19 force-pushed the R-bindings branch from 4978424 to 6bbb77a Compare July 15, 2020 09:56

Yashwants19 force-pushed the Add-R-documentation branch 2 times, most recently from 0e52b1d to 5db856e Compare July 15, 2020 09:59

Yashwants19 commented Jul 16, 2020

View reviewed changes

Yashwants19 mentioned this pull request Jul 18, 2020

R bindings with documentation Yashwants19/RcppMLPACK#11

Open

rcurtin approved these changes Jul 18, 2020

View reviewed changes

rcurtin mentioned this pull request Jul 19, 2020

Refactor ProgramInfo to separate out all the different information mlpack/mlpack#2521

Closed

rcurtin reviewed Jul 20, 2020

View reviewed changes

Yashwants19 force-pushed the R-bindings branch from 6bbb77a to 0d960ae Compare July 21, 2020 13:24

Yashwants19 added 10 commits July 21, 2020 18:59

Setup R-directory-structure using cmake.

6cb9dd5

Add documentation To .R file.

717320f

Rough changes.

e443778

Bad regex.

741e82b

Remove extra 'clenup'.

aed7887

Resolve GH actions.

fae1149

Add proper comments.

962d13a

Update size.

83a6d92

Bad rebase.

b7c3c26

Resolve comments.

2964dc7

Yashwants19 and others added 5 commits July 21, 2020 18:59

Improvement in CMake.

d0bc7b3

Remove less portable flag.

8c5b0e0

Correct branding similar to rest of project.

e1ee83c

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Resolve some final comments.

765172d

Update suggested change.

f622640

Yashwants19 force-pushed the Add-R-documentation branch from b130822 to f622640 Compare July 21, 2020 13:31

Yashwants19 merged commit ed60e22 into R-bindings Jul 21, 2020

Yashwants19 deleted the Add-R-documentation branch October 1, 2020 07:03

Yashwants19 pushed a commit that referenced this pull request Nov 15, 2020

Merge pull request #8 from mlpack/master

ca26c75

updating master

GSoC 6-7 #8

GSoC 6-7 #8

Conversation

Yashwants19 commented Jul 12, 2020

coatless commented Jul 13, 2020

Yashwants19 commented Jul 14, 2020

rcurtin commented Jul 14, 2020

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yashwants19 commented Jul 14, 2020

eddelbuettel commented Jul 15, 2020

Yashwants19 commented Jul 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin Jul 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yashwants19 commented Jul 21, 2020

Yashwants19 commented Jul 21, 2020

rcurtin commented Jul 21, 2020

Yashwants19 commented Jul 21, 2020

rcurtin Jul 19, 2020 •

edited