Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Quick start Example for R-bindings. #12

Open
Yashwants19 opened this issue Jul 22, 2020 · 31 comments
Open

Initial Quick start Example for R-bindings. #12

Yashwants19 opened this issue Jul 22, 2020 · 31 comments

Comments

@Yashwants19
Copy link
Owner

Hello everyone,

We are almost ready with R-bindings, but one fine review is still required. And one more important task is that we are left with is markdown documentation, which is not directly related to the package that we going distribute.

For building the package from source, you can go through these steps:

git clone https://github.com/Yashwants19/mlpack.git
cd mlpack
git checkout R-bindings
mkdir build && cd build
cmake ..
make R                             # For building

Currently you can also try to install R-bindings from github which is pretty straightforward; you can just use:

R -e 'devtools::install_github("Yashwants19/RcppMLPACK")'

Lets see a quickstart example. After installing the package, you can copy-paste this code directly into R to run it.

library("RcppMLPACK")

# Load the dataset from an online URL.  Replace with 'covertype.csv.gz' if you
# want to use on the full dataset.
temp <- tempfile()
download.file("https://www.mlpack.org/datasets/covertype-small.csv.gz", temp)
df <- read.csv(temp)
unlink(temp)

# Split the labels.
labels <- df["label"]
dataset <- subset(df, select = -c(label))

# Split the dataset using RcppMLPACK.
output <- preprocess_split(
  input = dataset,
  input_labels = labels,
  test_ratio = 0.3,
  verbose = TRUE
)
training_set <- output$training
training_labels <- output$training_labels
test_set <- output$test
test_labels <- output$test_labels

# Train a random forest.
output <- random_forest(
  training = training_set,
  labels = training_labels,
  print_training_accuracy = TRUE,
  num_trees = 10,
  minimum_leaf_size = 3,
  verbose = TRUE
)
rf_model <- output$output_model

# Predict the labels of the test points.
output <- random_forest(
  input_model = rf_model,
  test = test_set,
  verbose = TRUE
)


# Now print the accuracy.  The third return value ('probabilities'), which we
# ignored here, could also be used to generate an ROC curve.
correct <- sum(output$predictions == test_labels)
cat(
  correct, "out of", length(test_labels), "test points correct",
  correct / length(test_labels) * 100.0, "%\n"
)

Thank You :).

@eddelbuettel
Copy link
Collaborator

Fantastic stuff. Just ran it.

(Minor: Any chance we can rename these to not clash?

The following object is masked frompackage:stats:

    kmeans

The following object is masked frompackage:base:

    det

Caused by library(RcppMLPACK).

@eddelbuettel
Copy link
Collaborator

eddelbuettel commented Jul 22, 2020

Minor edits just to "standard style" (which for long-time R users is what ESS encoded, these days of course everybody is using the RStudio defaults :-/)

I used data.table mostly for fread here; it can do much more. The column selection / removal is a little different. You can ignore all that as we don't "really" do any data splicing.

Yuge congrats on getting all this done without any other R dependencies. I really like that you stuck to standard data.frame objects ... as one should.

*Edit: I was sloppy with the code. Better version below now.

suppressMessages({
    library("RcppMLPACK")
    library("data.table")
})

# Load the dataset from an online URL.  Replace with 'covertype.csv.gz' if you
# want to use on the full dataset.
#df <- fread("https://www.mlpack.org/datasets/covertype-small.csv.gz")
df <- fread("https://www.mlpack.org/datasets/covertype.csv.gz")

# Split the labels.
labels <- df[, .(label)]
dataset <- df[, label:=NULL]

# Split the dataset using RcppMLPACK.
prepdata <- preprocess_split(input = dataset, input_labels = labels,
                             test_ratio = 0.3, verbose = TRUE)

# Train a random forest.
output <- random_forest(training = prepdata$training,
                        labels = prepdata$training_labels,
                        print_training_accuracy = TRUE,
                        num_trees = 10,
                        minimum_leaf_size = 3,
                        verbose = TRUE)
rf_model <- output$output_model

# Predict the labels of the test points.
output <- random_forest(input_model = rf_model, test = prepdata$test, verbose = TRUE)

# Now print the accuracy.  The third return value ('probabilities'), which we
# ignored here, could also be used to generate an ROC curve.
correct <- sum(output$predictions == prepdata$test_labels)
cat(correct, "out of", length(prepdata$test_labels), "test points correct",
    correct / length(prepdata$test_labels) * 100.0, "%\n")

@Yashwants19
Copy link
Owner Author

Fantastic stuff. Just ran it.
Yuge congrats on getting all this done without any other R dependencies. I really like that you stuck to standard data.frame objects ... as one should.

Thank You. :)

(Minor: Any chance we can rename these to not clash?

The following object is masked from ‘package:stats’:

  kmeans

The following object is masked from ‘package:base’:

   det

Caused by library(RcppMLPACK).

Here @rcurtin can tell us more about renaming these bindings.

@Yashwants19
Copy link
Owner Author

Minor edits just to "standard style" (which for long-time R users is what ESS encoded, these days of course everybody is using the RStudio defaults :-/)

I used data.table mostly for fread here; it can do much more. The column selection / removal is a little different. You can ignore all that as we don't "really" do any data splicing.

Yuge congrats on getting all this done without any other R dependencies. I really like that you stuck to standard data.frame objects ... as one should.

*Edit: I was sloppy with the code. Better version below now.

suppressMessages({
    library("RcppMLPACK")
    library("data.table")
})

# Load the dataset from an online URL.  Replace with 'covertype.csv.gz' if you
# want to use on the full dataset.
#df <- fread("https://www.mlpack.org/datasets/covertype-small.csv.gz")
df <- fread("https://www.mlpack.org/datasets/covertype.csv.gz")

# Split the labels.
labels <- df[, .(label)]
dataset <- df[, label:=NULL]

# Split the dataset using RcppMLPACK.
prepdata <- preprocess_split(input = dataset, input_labels = labels,
                             test_ratio = 0.3, verbose = TRUE)

# Train a random forest.
output <- random_forest(training = prepdata$training,
                        labels = prepdata$training_labels,
                        print_training_accuracy = TRUE,
                        num_trees = 10,
                        minimum_leaf_size = 3,
                        verbose = TRUE)
rf_model <- output$output_model

# Predict the labels of the test points.
output <- random_forest(input_model = rf_model, test = prepdata$test, verbose = TRUE)

# Now print the accuracy.  The third return value ('probabilities'), which we
# ignored here, could also be used to generate an ROC curve.
correct <- sum(output$predictions == prepdata$test_labels)
cat(correct, "out of", length(prepdata$test_labels), "test points correct",
    correct / length(prepdata$test_labels) * 100.0, "%\n")

This looks awesome. Thank you for the great edit

@eddelbuettel
Copy link
Collaborator

Regarding det and kmeans: "maybe we can't" because consistency across MLPACK builds also matters. But the message is a nag. I personally don't like them.

Food for thought: Maybe we keep det and kmeans "inside" RcppMLPACK but try not to export them under these names?

@Yashwants19
Copy link
Owner Author

Regarding det and kmeans: "maybe we can't" because consistency across MLPACK builds also matters. But the message is a nag. I personally don't like them.

Food for thought: Maybe we keep det and kmeans "inside" RcppMLPACK but try not to export them under these names?

This sounds valid. I was also thinking about the same. I am really bad at naming part, should I replace kmeans and det from mlpack_kmeans and mlpack_det? Further we can add a note that these bindings have different name from other mlpack bindings because it may create a warning due to clash with stats and base.

@coatless
Copy link
Collaborator

How about uppercasing d in det and k in kmeans to match with the earlier decision of capitalizing the s in serialization?

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 23, 2020

Awesome! It works great.

I'd be a bit hesitant to do manual things like uppercasing some of the letters---ideally, we don't need to do any modification of the original bindings as produced by mlpack. Perhaps we should just prefix each name with mlpack_ like Yashwant suggested? I think that would be just fine to do that for all the mlpack methods, unless packages in R commonly use some kind of namespace concept like mlpack.kmeans(). It's likely we'll get other collisions too, eventually; mlpack has bindings with some very short names, like pca and nmf.

@eddelbuettel
Copy link
Collaborator

May I suggest we all take a step back, take a deep breath, and count to sixty.

There are 16000 R packages. Collisions happen. There are things we can control, and others we cannot. We may well clash (as a package) which another package also providing pca or nmf. Little we can do here short of prefixing everything which is inelegant in an interactive environment such as R (which also happens to be a programming language).

But clashing with base R is another matter. Stomping on baseR det is Certified Bad Idea (TM). So "other" "unnamed" "large" software ecosystem within R was cavalier about this years ago, this is now seen as a mistake.

There are a few things we can, or cannot do. The easiest may be to not export as det and kmeans.[1] If we keep these package local everything still works by folks need to call these two as RcppMLPACK::det and RcppMLPACK::kmeans. That would be defensive. Prefixing is another matter, and just suffixing these two with a _. Or capitalizing.

There are other choices. I recommend discussing a few before baby-bathwater style jumping to broad conclusions.

[1] For argument's sake cleaner load as in

R> library(RcppMLPACK)
R>

without side-effects can be had by not exporting:

edd@rob:~/git/rcppmlpack-gsos2020(master)$ git diff  NAMESPACE
diff --git a/NAMESPACE b/NAMESPACE
index c799fc3..c4f98f3 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -9,7 +9,7 @@ export(cf)
 export(dbscan)
 export(decision_stump)
 export(decision_tree)
-export(det)
+#export(det)
 export(emst)
 export(fastmks)
 export(gmm_generate)
@@ -24,7 +24,7 @@ export(image_converter)
 export(kde)
 export(kernel_pca)
 export(kfn)
-export(kmeans)
+#export(kmeans)
 export(knn)
 export(krann)
 export(lars)
edd@rob:~/git/rcppmlpack-gsos2020(master)$ 

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 23, 2020

I'm not really picky on what the best way to handle this issue is, but, from my end it is key that we don't manually change names of just a couple of packages if it's avoidable. All of those exported functions are automatically generated, and if we want to change just one or two names, we have to integrate that into our automatic binding generation system, and we should ideally avoid that---it's kind of a kludge, and it breaks consistency with all of the rest of the bindings in other languages.

For instance, in Python it's mlpack.det(); in Julia it's mlpack.det(); from the command-line it's mlpack_det, in Go it's mlpack.Det() (and all the functions have to be capitalized). So it's weird if we have, in R, mlpack_det() but then decision_stump(), or if for some functions we have to call it like mlpack::det() but others just decision_stump() is fine. Det() vs. decision_stump() is also confusing.

I hope that what I'm writing here makes sense---consistency is key. So however @Yashwants19 wants to solve it is fine with me, so long as we are consistent in the way I described above. 👍

@eddelbuettel
Copy link
Collaborator

I hear you. One option is also to leave as is, and you get consistently created names from MLPACK. Which is a plus.

Two of those clobber and hide base R function. Which is a minus.

The rest, as always in life, is choices and trade-offs and reasonable people may disagree over best practices.

@eddelbuettel
Copy link
Collaborator

eddelbuettel commented Jul 23, 2020

There is even this (and I since reverted to an unchanged NAMESPACE)

R> library("RcppMLPACK", exclude=c("det", "kmeans"))
R> 

That loads RcppMLPACK and all its exported functions except those listed in exclude=. Same can be done, I believe, in imports() and importFrom() for inter-package relationships (but I'd have to check to know for sure).

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 23, 2020

Which do you think is "best" from the R perspective? (I realize it is a tradeoff 😄) If exclude= is commonly used and people will be okay with it, that seems fine to me; otherwise prefixing mlpack_ to exported symbols seems the easiest option.

@eddelbuettel
Copy link
Collaborator

Nobody knows about exclude. My personal recommendation is to rename, or not to export. (But I am very hard nosed on this aspect.)

Let's take a break, maybe come back in a day.

@Yashwants19
Copy link
Owner Author

Can we use mlpack.{binding_name}(mlpack.kmeans() or mlpack.linear_svm()) or BigCamelCase(Kmeans() or LinearSvm()) as suggested by this Google's R Style Guide, but I didn't find many package using this(BigCamelCase) naming convention, just an idea I am not 100% sure about these naming convention.

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 24, 2020

I've always thought camelcasing is a bit weird, because it's hard to get right---technically SVM is an acronym, so it should be LinearSVM. But we can only programmatically produce LinearSvm, which is a bit odd to me. Is mlpack.kmeans() or mlpack.linear_svm() possible? I like that idea. (I don't know the R language restrictions or how common a thing like that is, if it is possible.)

@coatless
Copy link
Collaborator

@rcurtin we can prefix all functions with either mlpack_*() or mlpack.*(). Though, the latter is used to construct S3 objects usually (one of many OO systems for R). In the Syntax section of Google Style guide for R linked by @Yashwants19, this is a practice they are looking to stop:

We previously recommended naming objects with dot.case. We’re moving away from that, as it creates confusion with S3 methods.

I'm in favor of mlpack_*(). That said, style is a personal/organizational choice.

I will strongly say that we should avoid conflicting with any base R package. So, prefixing or capitalization seems to be where we need to go. (This change should be backported to the serialization() case.)

@eddelbuettel
Copy link
Collaborator

eddelbuettel commented Jul 24, 2020

@rcurtin for the love of $deity please do not make this a snake_case vs camelCase discussion.

This. Is. A. Much. Narrower. Question.

We have det() and kmeans() clashing. Would it be possible to just talk about

  • living with it as consensus is possible
  • a way to fix it.

My personal preference would be to

  • leave everything as is for faithfully autogenerated bindings from MLPACK into RcppMLPACK
  • recognise the conflict with just these two and add one layer of indirection by possibly adding post-processing step of
    • not exporting det() and kmeans() (by commenting them out in NAMESPACE and/or fix the @export tag addition)
    • one-time addition of manual wrappers with a to-be-chosen name maybe mlpackDet() and mlpackKmeans()
    • which simply dispatch into RcppMLPACK:::det() and RcppMLPACK:::kmeans() (which still exist but are unexported)

This is the last message I plan to write on this as I have too many other bike sheds to paint 😁

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 24, 2020

This. Is. A. Much. Narrower. Question.

I really don't think it is in this case. I'm not sure how I can write this message in a way that can signal to you that I'm not in it for the sake of arguing but for the sake of finding the way forward, but please rest assured that I have attempted to write and rewrite this response about three times to figure it out. I hope I've done an okay job. We're on the same side and we have the same goals! I promise.

My personal preference is also the same as yours:

leave everything as is for faithfully autogenerated bindings from MLPACK into RcppMLPACK

But with a slight modification, which the reason that I brought camelcase and capitalization and all the other options onto the table: we have control over the mlpack side so we can actually avoid conflicts!

Based on what @coatless suggested, if we prefix with mlpack_, we get the best of both worlds:

  • everything is autogenerated; no manual handling necessary
  • there are no conflicts and certainly won't be in the future! 😄

So that would be my vote. If we all agree that that's a reasonable way forward, then I think we are set here. mlpack_*() is easy from the mlpack side and maintain coherence of bindings across different languages.

@eddelbuettel
Copy link
Collaborator

eddelbuettel commented Jul 24, 2020

Needlessly clutters the user's namespace. Stylistically-speaking worse than the status quo. "Cleaner" but less usable. Sometimes the only way out. I don't think we are in that situation "but opinions about the shape of the earth differ". Also, vi and emacs.

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 24, 2020

👍 I see---since I have little R context, I wouldn't have understood those things. Thank you for clarifying. It is possible (although it implies some extra code to create) to maintain a "list" on the mlpack side of function names that don't get exported; we'd also have to make clear in the autogenerated documentation that (a) the name interferes with other R functionality and (b) it should be used as mlpack::kmeans(), etc.

I don't feel like I have enough R context to choose what's best here (my preference that I wrote above remains, but it's a weak preference). I think all the pros and cons of each approach are well documented now. Maybe @Yashwants19 should decide what he thinks is best here since he is the one of us who lives in "both worlds"? :)

@eddelbuettel
Copy link
Collaborator

We can also punt for now and revisit later.

@coatless
Copy link
Collaborator

Re-focusing to the markdown docs, @Yashwants19 let's place the above example into a vignette that ships with the package. Let's ship it "pre-made" instead of rendering it with the package. Could you try adding it in a similar way to the Rcpp package? https://github.com/RcppCore/Rcpp/tree/master/vignettes/rmd

@coatless
Copy link
Collaborator

Also, do we want to ship data with the package? I noticed we're downloading data from a remote source:

temp <- tempfile()
download.file("https://www.mlpack.org/datasets/covertype-small.csv.gz", temp)
df <- read.csv(temp)

We could probably use examples from https://www.mlpack.org/datasets/ that are less than 1 MB in size. Glancing at the included R data sets and the example directory, I think only iris appears in both places?

data(package="datasets")

@eddelbuettel
Copy link
Collaborator

Yes, I like pre-made vignettes as they are simpler. They rely on Sweave which comes with R, and basic latex/pdf processing. Simpler, in the sense of fewer files, examples are in the anytime package and others. And the basic "key" to how it works was written up by my pal Mark here in a short blog post -- in essence just a latex/pdf trick to point to an existing pdf file.

Please poke us if you have questions. I am doing the same in a few packages.

(Re the data set download; I think downloading ot perfectly fine for a static demo or even for a recurrent demo. In every build it would be a bit much. Ambivalent as to whether we need to ship that data set....)

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 26, 2020

I don't think we need to ship the dataset. This quickstart example is kind of intended as a "copy-pastable" example that you can immediately run and then perhaps adapt to your needs. (Plus, as the documentation points out you can change covertype-small.csv.gz to covertype.csv.gz (which is much bigger) and see that mlpack will run well on large data. But I assume we wouldn't want to ship that larger csv.gz.)

@Yashwants19
Copy link
Owner Author

Yashwants19 commented Jul 27, 2020

I am really sorry for the late reply, I think so vignettes doesn't require for building the package(from mlpack repo) or vignettes doesn't require any auto-generation touch. Hence it might be useful if we add these files directly to mlpack/mlpack-r(for e.g. mlpack.jl or mlpack-go ) repository, which will be created after we merge the R-bindings to mlpack/master. And then I can directly open a PR into mlpack/mlpack-r, or else I can open a PR here(Yashwants19/RcppMLPACK), if merging of R-Bindings takes some time.

Further, I will follow mentor's advice for the same.

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 29, 2020

@Yashwants19 do you mean you'd like to ship covertype-small.csv.gz (and perhaps covertype.csv.gz) with the CRAN package for mlpack? Shipping both would be ~15MB; I think just the small one is about 2MB. I'll defer to Dirk and James on what we should do here.

By the way, if we can set up some scripts to directly build the R bindings and send them off to CRAN, a separate mlpack/mlpack-r repository might be unnecessary. But I think we can figure that out when we get there. 😄

@eddelbuettel
Copy link
Collaborator

Subsample it one more time and include a 200kb (compressed) data set ?

@rcurtin
Copy link
Collaborator

rcurtin commented Jul 29, 2020

covertype-small.csv.gz contains 100k points, so, yeah, that could be reduced to far fewer points for sure. That covertype dataset has 7 classes, though, so we should make sure in any subsample that all 7 classes are represented.

@coatless
Copy link
Collaborator

@Yashwants19 @rcurtin let's avoid shipping data. Doing so will likely open up a new can of worms. We can revisit on a subsequent CRAN release/update or during the revision period.

I think the key here would be to just ensure a pre-made vignette of the quickstart guide -- that sprouted from this issue -- is made available:

https://github.com/Yashwants19/mlpack/pull/10/files#diff-808b284b15b5e4942bed147539b9665d

Since it is pre-made, we're not having to worry about data sizes or including data inside the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants