Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 6-7 #8

Merged
merged 15 commits into from
Jul 21, 2020
Merged

GSoC 6-7 #8

merged 15 commits into from
Jul 21, 2020

Conversation

Yashwants19
Copy link
Owner

Hi @coatless @eddelbuettel @rcurtin, As per the timeline this PR is regarding the automatically generation of documentation in .R files. Here is an example : https://gist.github.com/Yashwants19/b52606bce1412c1972b8bdab728d992e
I have almost completed the work. This PR also contain some commits regarding #7 which I will rebase after merging GSoC 4-5 in R-Bindings. In the following weeks I will add the markdown documentation.

Thank You.

@coatless
Copy link

@Yashwants19 so, the main part that jumps out at me from https://gist.github.com/Yashwants19/b52606bce1412c1972b8bdab728d992e are interweaved examples with descriptions.

We're often seeking to separate out the code from the documentation, e.g.

#' @title AdaBoost
#'
#' @description
#' This program implements the AdaBoost (or Adaptive Boosting) algorithm. The
#' variant of AdaBoost implemented here is AdaBoost.MH. It uses a weak learner,
#' either decision stumps or perceptrons, and over many iterations, creates a
#' strong learner that is a weighted ensemble of weak learners. It runs these
#' iterations until a tolerance value is crossed for change in the value of the
#' weighted training error.
#' 
#' @param input_model Input AdaBoost model.
#' @param iterations The maximum number of boosting iterations to be run (0 will run
#'   until convergence.)  Default value "1000".
#' @param labels Labels for the training set.
#' @param test Test dataset.
#' @param tolerance The tolerance for change in values of the weighted error during
#'   training.  Default value "1e-10".
#' @param training Dataset for training AdaBoost.
#' @param verbose Display informational messages and the full list of parameters and
#'   timers at the end of execution.  Default value "FALSE".
#' @param weak_learner The type of weak learner to use: 'decision_stump', or
#'   'perceptron'.  Default value "decision_stump".
#'
#' @return A list with several components:
#' \item{output}{Predicted labels for the test set.}
#' \item{output_model}{Output trained AdaBoost model.}
#' \item{predictions}{Predicted labels for the test set.}
#' \item{probabilities}{Predicted class probabilities for each point in the test
#'   set.}
#'
#' @details
#' This program allows training of an AdaBoost model, and then application of
#' that model to a test dataset.  To train a model, a dataset must be passed
#' with the "training" option.  Labels can be given with the "labels" option; if
#' no labels are specified, the labels will be assumed to be the last column of
#' the input dataset.  Alternately, an AdaBoost model may be loaded with the
#' "input_model" option.
#' 
#' Once a model is trained or loaded, it may be used to provide class
#' predictions for a given test dataset.  A test dataset may be specified with
#' the "test" parameter.  The predicted classes for each point in the test
#' dataset are output to the "predictions" output parameter.  The AdaBoost model
#' itself is output to the "output_model" output parameter.
#' 
#' Note: the following parameter is deprecated and will be removed in mlpack
#' 4.0.0: "output".
#' Use "predictions" instead of "output". additional
#'
#' @references
#' For more information about the algorithm, see the paper "Improved Boosting
#' Algorithms Using Confidence-Rated Predictions", by R.E. Schapire and Y.
#' Singer.
#'
#' @author
#' MLPACK Developers
#'
#' @export
#' @examples
#' # For example, to run AdaBoost on an input dataset "data" with labels
#' # "labels"and perceptrons as the weak learner type, storing the trained model
#' # in "model", one could use the following command: 
#' 
#' output <- adaboost(training=data, labels=labels,
#'      weak_learner="perceptron")     
#' model <- output$output_model
#' 
#' # Similarly, an already-trained model in "model" can be used to provide class
#' # predictions from test data "test_data" and store the output in "predictions"
#' # with the following command: 
#' 
#' output <- adaboost(input_model=model, test=test_data)
#' predictions <- output$predictions

Also, we'll need to make sure each example takes less than ~5 seconds to run. If the example takes more than 5 seconds, we're going to need to protect it with \donttest{}, e.g.

#' @examples
#' \donttest{
#' output <- adaboost(input_model=model, test=test_data)
#' predictions <- output$predictions
#' }

The \donttest{} comes from https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-functions

In terms of the package build stage, the other options to use would be:

Command Example Help R CMD check
\dontrun{}   x  
\dontshow{} x   x
\donttest{} x x  

There was a shift away from using \dontrun{} to \donttest{} by CRAN recently.

@Yashwants19
Copy link
Owner Author

Hi @coatless, Thank you for the great example, I will update the documentation as soon as possible.

@rcurtin
Copy link

rcurtin commented Jul 14, 2020

@Yashwants19 @coatless I'm not sure exactly how we can do this; the "input" from the C++ side looks like this:

PROGRAM_INFO("AdaBoost",
    // Short description.
    "An implementation of the AdaBoost.MH (Adaptive Boosting) algorithm for "
    "classification.  This can be used to train an AdaBoost model on labeled "
    "data or use an existing AdaBoost model to predict the classes of new "
    "points.",
    // Long description.
    "This program implements the AdaBoost (or Adaptive "
    "Boosting) algorithm. The variant of AdaBoost implemented here is "
    "AdaBoost.MH. It uses a weak learner, either decision stumps or "
    "perceptrons, and over many iterations, creates a strong learner that is a "
    "weighted ensemble of weak learners. It runs these iterations until a "
    "tolerance value is crossed for change in the value of the weighted "
    "training error."
    "\n\n"
    "For more information about the algorithm, see the paper \"Improved "
    "Boosting Algorithms Using Confidence-Rated Predictions\", by R.E. Schapire"
    " and Y. Singer."
    "\n\n"
    "This program allows training of an AdaBoost model, and then application of"
    " that model to a test dataset.  To train a model, a dataset must be passed"
    " with the " + PRINT_PARAM_STRING("training") + " option.  Labels can be "
    "given with the " + PRINT_PARAM_STRING("labels") + " option; if no labels "
    "are specified, the labels will be assumed to be the last column of the "
    "input dataset.  Alternately, an AdaBoost model may be loaded with the " +
    PRINT_PARAM_STRING("input_model") + " option."
    "\n\n"
    "Once a model is trained or loaded, it may be used to provide class "
    "predictions for a given test dataset.  A test dataset may be specified "
    "with the " + PRINT_PARAM_STRING("test") + " parameter.  The predicted "
    "classes for each point in the test dataset are output to the " +
    PRINT_PARAM_STRING("predictions") + " output parameter.  The AdaBoost "
    "model itself is output to the " + PRINT_PARAM_STRING("output_model") +
    " output parameter."
    "\n\n"
    "Note: the following parameter is deprecated and "
    "will be removed in mlpack 4.0.0: " + PRINT_PARAM_STRING("output") +
    "."
    "\n"
    "Use " + PRINT_PARAM_STRING("predictions") + " instead of " +
    PRINT_PARAM_STRING("output") + '.' +
    "\n\n"
    "For example, to run AdaBoost on an input dataset " +
    PRINT_DATASET("data") + " with labels " + PRINT_DATASET("labels") +
    "and perceptrons as the weak learner type, storing the trained model in " +
    PRINT_MODEL("model") + ", one could use the following command: "
    "\n\n" +
    PRINT_CALL("adaboost", "training", "data", "labels", "labels",
        "output_model", "model", "weak_learner", "perceptron") +
    "\n\n"
    "Similarly, an already-trained model in " + PRINT_MODEL("model") + " can"
    " be used to provide class predictions from test data " +
    PRINT_DATASET("test_data") + " and store the output in " +
    PRINT_DATASET("predictions") + " with the following command: "
    "\n\n" +
    PRINT_CALL("adaboost", "input_model", "model", "test", "test_data",
        "predictions", "predictions"),
    // See also...
    SEE_ALSO("AdaBoost on Wikipedia", "https://en.wikipedia.org/wiki/AdaBoost"),
    SEE_ALSO("Improved boosting algorithms using confidence-rated predictions "
        "(pdf)", "http://rob.schapire.net/papers/SchapireSi98.pdf"),
    SEE_ALSO("Perceptron", "#perceptron"),
    SEE_ALSO("Decision Stump", "#decision_stump"),
    SEE_ALSO("mlpack::adaboost::AdaBoost C++ class documentation",
        "@doxygen/classmlpack_1_1adaboost_1_1AdaBoost.html"));

So, basically, even in our sources, the examples are interleaved with the "long description". Unless @Yashwants19 has a better idea it seems like maybe the only option here is to add some kind of EXAMPLE() macro to define each of the individual examples, and then go through our existing bindings and split out the examples. @Yashwants19 what do you think? If that's the general idea you were thinking I have a few more comments on maybe how it should be done best.

Copy link

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think the implementation looks great. My only comments are really minor.

It seems to me like this would be the right time to make sure that the documentation looks "R-native"; so, @coatless and @eddelbuettel, if you have additional comments about the generated documentation @Yashwants19 showed, now would be a great time to work out those issues. :)

inline std::string ParamString(const std::string& paramName)
{
// For a R binding we don't need to know the type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Seems like this is an extra line.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -16,7 +16,7 @@
#include <mlpack/prereqs.hpp>
#include <mlpack/core/util/hyphenate_string.hpp>
#include "get_go_type.hpp"
#include "camel_case.hpp"
#include <mlpack/bindings/utils/camel_case.hpp>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there also exists the directory src/mlpack/core/util/ which contains things in the namespace mlpack::core::util. Maybe it is better here to match that name with mlpack::bindings::util? (Instead of having util there and utils here.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#ifndef MLPACK_BINDINGS_GO_CAMEL_CASE_HPP
#define MLPACK_BINDINGS_GO_CAMEL_CASE_HPP
#ifndef MLPACK_BINDINGS_UTILS_CAMEL_CASE_HPP
#define MLPACK_BINDINGS_UTILS_CAMEL_CASE_HPP
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice cleanup---I appreciate that you've done this. It reduces the overall amount of code which is great. 👍

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You.

if (prefix.size() >= 80)
{
throw std::invalid_argument("Prefix size must be less than 80");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I imagine that was kind of irritating to debug... :)

@Yashwants19
Copy link
Owner Author

So, basically, even in our sources, the examples are interleaved with the "long description". Unless @Yashwants19 has a better idea it seems like maybe the only option here is to add some kind of EXAMPLE() macro to define each of the individual examples, and then go through our existing bindings and split out the examples. @Yashwants19 what do you think? If that's the general idea you were thinking I have a few more comments on maybe how it should be done best.

I was also looking for the same. I will love to discussion that how this can done in the best way.

@eddelbuettel
Copy link

I haven't had time to look in any detail but I was also hoping that maybe a two step process of first doing what is done at the source level here, and then letting R's roxygen2 generate the help pages and maybe reorder in the process may do the trick. Maybe that'll work.

@Yashwants19
Copy link
Owner Author

Hi @rcurtin I have done some rough work for "example" on the source side, Please take a look.

@@ -2,7 +2,7 @@
* @file bindings/tests/clean_memory.cpp
* @author Ryan Curtin
*
* Delete any pointers held by the CLI object.
* Delete any pointers held by the IO object.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, whether these CLI's are ignored or these are intentional. And If they are ignored should I open a different PR or we are good here with these changes?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks to me like there were some things that were accidentally missed. Do you want to open a PR to the main repository with those things? Alternately I can do it, just let me know---I can see that these were missed, and also the names of the utility functions in src/mlpack/bindings/julia/mlpack/cli.jl.in. I'm not sure if the Python utility functions need to be adapted too; I didn't check. Anyway let me know what you'd like to do.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a PR soon for the same.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Yashwants19, everything looks great to me. Many of the comments I left are perhaps larger scope than just this PR, so, let me know what you think of them. We can always open issues in the main mlpack repository to handle some of those other things, or, even, we could handle them some other time in this repo, or maybe not at all depending on what you think. 👍

@@ -30,17 +30,17 @@ if (NOT (MODEL_FILE_TYPE MATCHES "\"${MODEL_SAFE_TYPES}\""))
set(MODEL_PTR_IMPLS "${MODEL_PTR_IMPLS}
// Get the pointer to a ${MODEL_TYPE} parameter.
// [[Rcpp::export]]
SEXP CLI_GetParam${MODEL_SAFE_TYPE}Ptr(const std::string& paramName)
SEXP IO_GetParam${MODEL_SAFE_TYPE}Ptr(const std::string& paramName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to do this---I know it is a bit of a tedious refactoring. :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@@ -108,7 +108,7 @@ The example above has only shown a little bit of the functionality of mlpack.
Lots of other commands are available with different functionality. A full list
of commands and full documentation for each can be found on the following page:

- <a href="https://mlpack.org/doc/mlpack-git/cli_documentation.html">IO documentation</a>
- <a href="https://mlpack.org/doc/mlpack-git/cli_documentation.html">CLI documentation</a>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, thank you. Actually do you want to submit this one as a patch batch to the mlpack repository?

@@ -2,7 +2,7 @@
* @file bindings/tests/clean_memory.cpp
* @author Ryan Curtin
*
* Delete any pointers held by the CLI object.
* Delete any pointers held by the IO object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks to me like there were some things that were accidentally missed. Do you want to open a PR to the main repository with those things? Alternately I can do it, just let me know---I can see that these were missed, and also the names of the utility functions in src/mlpack/bindings/julia/mlpack/cli.jl.in. I'm not sure if the Python utility functions need to be adapted too; I didn't check. Anyway let me know what you'd like to do.

mlpack::util::ProgramDoc \
io_programdoc_dummy_object = mlpack::util::ProgramDoc(NAME, SHORT_DESC, \
[]() { return std::string(DESC) + std::string(EXAMPLE); }, []() \
{ return ""; }, { __VA_ARGS__ })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is one option for what I was thinking. A couple further thoughts:

  • It looks like this only allows one EXAMPLE, so we don't have a way to split into multiple examples.
  • The number of parameters to PROGRAM_INFO is getting large. What would you think about splitting PROGRAM_INFO into a bunch of macros, like this:
BINDING_NAME("program name");
BINDING_SHORT_DESC("This is a short, two-sentence description of what the program does.");
BINDING_LONG_DESC("This is a long description of what the program does."
    " It might be many lines long and have lots of details about different options.");
BINDING_EXAMPLE("This contains one example for this particular binding.\n" +
    PROGRAM_CALL(...));
BINDING_EXAMPLE("This contains another example for this particular binding.\n" +
    PROGRAM_CALL(...));
// There could be many of these "see alsos".
BINDING_SEE_ALSO("https://en.wikipedia.org/wiki/Machine_learning");

It would take a little bit of refactoring but I wonder if that is a cleaner approach overall. What you've done in this PR is fine for now, I think, but I do think maybe it's worth some discussion about a longer-term cleanup. If you think either of those ideas are useful, and we can work it into a quick proposal, we should open an issue back in the main mlpack repository and then we can do it (I can help with the refactoring of course! :)).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. We should definitely discuss on this, it will give a great help if you can open a issue in the main mlpack repository. May be it will be better if we handle this in some different PR in the main mlpack repository, as this translation is not directly related to R-bindings. It seems like according to the current progress, I might complete R-bindings before time and this translation may result into a great stretch goal to my timeline.

Copy link

@rcurtin rcurtin Jul 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that---let me open an issue in the mlpack repository later today. 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you for opening the issue.

@@ -1129,55 +1130,55 @@ using DatasetInfo = DatasetMapper<IncrementPolicy, std::string>;
#ifdef __COUNTER__
#define PARAM_IN(T, ID, DESC, ALIAS, DEF, REQ) \
static mlpack::util::Option<T> \
JOIN(cli_option_dummy_object_in_, __COUNTER__) \
JOIN(io_option_dummy_object_in_, __COUNTER__) \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, nice catch!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

"\n\n",
// Example.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\n\n",
// Example.
// Example.

Nice, it's really cool how simple this refactoring turned out to be. 👍 Still, I think maybe we can remove the extra newlines now, since the binding printing code should hopefully be handling that for us. Plus, as users add new bindings, they may not know that they need to add newlines to the end of the program documentation.

Anyway, my suggestion here doesn't totally work, because I can't put the comma with the previous line. 👍

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will soon push the changes regarding newline 👍.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes in the last commit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! 👍 I would say whenever you're ready to merge this go ahead. Not sure if you were done here yet or if you wanted to do some more, but in any case, don't feel obligated to wait on another approval from me. :)

"--tolerance option. By default, the transition matrix is randomly "
"initialized and the emission distributions are initialized to fit the "
"extent of the data."
+ PRINT_PARAM_STRING("tolerance") + "option. By default, the transition "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! 💯

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

src/mlpack/bindings/R/print_R.cpp Outdated Show resolved Hide resolved
Comment on lines 112 to 113
cout << util::HyphenateString(programInfo.documentation(), "#' ") << endl;
cout << "#' @author" << endl;
cout << "#'\n#' @author" << endl;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be easier to just add another endl after HyphenateString? i.e.

cout << util::HyphenateString(programInfo.documentation(), "#' ") << endl << endl;

or similar. Realistically both of these function just the same, but I think there's something nice about consistency of always using endl or always using \n, but not mixing both. Just my opinion. 😄

@Yashwants19
Copy link
Owner Author

Okay, lets merge this. :)

@Yashwants19 Yashwants19 merged commit ed60e22 into R-bindings Jul 21, 2020
@Yashwants19
Copy link
Owner Author

Hey everyone, here is the link for my weekly blog. Kindly share your views.

Thank you.

@rcurtin
Copy link

rcurtin commented Jul 21, 2020

Awesome, thanks! If you like, it might be useful to include some example usages from R that now work, etc., for any reader who might not want to dig into the details. However, up to you. 👍

@Yashwants19
Copy link
Owner Author

Sure I will update the blog post.

@Yashwants19 Yashwants19 deleted the Add-R-documentation branch October 1, 2020 07:03
Yashwants19 pushed a commit that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants