Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAHOUT-1856][WIP] create a framework for new Mahout Clustering, Classification, and Optimization Algorithms #246

Closed
wants to merge 21 commits into from

Conversation

rawkintrevo
Copy link
Contributor

@rawkintrevo rawkintrevo commented Aug 4, 2016

Relevant JIRA: https://issues.apache.org/jira/browse/MAHOUT-1856

Readme.md provides a more comprehensive (yet still incomplete) overview.

Key Points:
Top Level Class:
Model has one method- fit, and coefs.

Transformers map a vector input to a vector output (same or different length)
Regressors map a vector input to a single output (e.g. a Double)
Classifiers extend Transformers which have created a probability vector by 'selecting' the class and returning the label (instead of the entire p-vector)

Pipelines and Ensembles are models as well, except they are composed from other models listed above, or from other pipelines and ensembles.

ToDo:

  • All models need a uniform way to expose their tuning parameters -> this will be required for a auto-tuning algo.
  • Pipelines / Ensembles must be able to account and report the tunable paremeters of their sub models
  • Need fitness functions
  • Native method wrappers- Underlying engines and third party packages have implementations of many ML models, let's not recreate the wheel by exposing YET ANOTHER sgd algorithm. Instead should be able to convert matrix to expected format of 'other' library, run model, get results, package back into matrix and pass on in pipeline or ensemble. (This is especially useful for DeepLearning4J integration). Also Native implementations on engine of some algos probably more efficient by leveraging engine specific tricks (think Flink delta iterators) than implementations we would make.
  • Lots more, open for discussion.

This is merely a conversation started on what to do.

I've included OLS as an example regressor and a normalizer as an example transformer, only for illustrative purposes. I really don't want to pack to many algos in to this initial commit, just an example/ proof of concept so we can say, yea- this framework makes sense for this kind of model OR ooh, we probably want to have these features too.

@andrewpalumbo
Copy link
Member

Thanks @rawkintrevo.. This is a great start. I'd originally thought that we'd put the algos in the math-scala module, but looking at it, I think this makes sense.

@andrewpalumbo
Copy link
Member

andrewpalumbo commented Aug 5, 2016

Nice work on the slides also! 💯

Another thought that I had is that we may want to allow In-Core matrices as parameters. Just throwing it out there for discussion. I cant think of a particular use case off the top of my head but It seems that there should be.

@rawkintrevo rawkintrevo changed the title [MAHOUT-1856][WIP] reate a framework for new Mahout Clustering, Classification, and Optimization Algorithms [MAHOUT-1856][WIP] create a framework for new Mahout Clustering, Classification, and Optimization Algorithms Aug 8, 2016
@rawkintrevo
Copy link
Contributor Author

This is WIP so it doesn't really matter that it's failing atm but this isn't good.. Couldn't find maven on the wget command.

Building framework

add placeholders for ensemble pipeline and fitness test

added readme

fixed readme image

fixed readme image

fixed readme image

removed autogen comments from files

added readme

fixed readme image

fixed readme image

add placeholders for ensemble pipeline and fitness test

added readme

fixed readme image

fixed readme image

fixed readme image

removed autogen comments from files

add placeholders for ensemble pipeline and fitness test

added readme

fixed readme image

fixed readme image

fixed readme image

removed autogen comments from files

updated pom.xml for 0.12.3-SNAP

Added MeanSubtract

[MAHOUT-1856] Added Subtract Mean

[MAHOUT-1856] Added Durbin Watson Test

[MAHOUT-1856] Refactored to use fitParams

[MAHOUT-1856] Added Factorize refactored others

[MAHOUT-1856] Added Factorize Test

[MAHOUT-1856] Refactor to scala-math

[MAHOUT-1856] Rename Normalizer to StandardScaler

[MAHOUT-1856] Don't collect for ols

[MAHOUT-1856] Add Several Fittness Tests

[MAHOUT-1856] Clean up files

[MAHOUT-1856] Remove old algos
@andrewpalumbo
Copy link
Member

@dlyubimov, @mahout-team could you review/provide feedback on this? Originally Trevor had a separate module for this, and I asked him to move it into math-scala.

@andrewpalumbo
Copy link
Member

andrewpalumbo commented Jan 16, 2017

I should note that my above comment was meant as a question to all concerned: "Does this (the algorithms base traits package) Belong in math-scala or in its own module... Trevor initially had it in its own module, I asked that he move it to math-scala, due in most part to size concerns of the entire binary distribution.. however I am now questioning whether that is the correct call.. I can think of several good reasons that it should have its own module, (and others that it should be in math-scala).. For now, I think that math-scala is an ok place, and we can just move it to another module if we want later. @smarthi @pferrel @rawkintrevo @dlyubimov please weigh in here. Not a big deal right now, but thinking ahead it may be good to get the location right.


abstract class Model extends Serializable {

var fitParams = collection.mutable.Map[String, MahoutVector]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So model requires all parameters be named and be vectors? Shouldn't this be an artifact of a more specialized models, like say glms? there are plenty of ML models that would probably not fit that fairly rigid definition, not easily or pragmatically, at least.

regressor.fit(drmY, drmX)
fitParams("beta0") = regressor.fitParams("beta")

val Y = drmY(1 until drmY.nrow.toInt, 0 until 1).checkpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider giving the callee an option to specify cache hint here, since it seems essential that this algorithm relies on plenty of things being put into memory. right now this implies memory only, so if it doesn't fit then the algorithm is going to a crawl. (in all fairness, in spark it would go to a crawl with memory and disk spec too, but to put things in perspective, we are probably talking of a difference between crawling snail and snail skeleton 20 years after its death.)

*/
abstract class Regressor extends Model {

def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps first parameter being predictors and second parameter being target is more intuitive signature for most

Copy link
Contributor

@dlyubimov dlyubimov Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe drmT for target ?

Also: why [Int] only? In any kind of fitter, it should be fit[K](...)

there's no reason why correspondence between predictor and target cannot be established on any type of key in general case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also : i am not sure fitter should extend a model. Rather, fitter should return a model, i.e.,
fit[k](...) : Model, right? i think that's how this pattern goes in most kits. Fitter is just a startegy.

And i'd abstain from doing abstract classes in Scala, unless trait absolutely cannot do it. (and it can in this case). Abstract class points to a specific, single and necessary base implementation in hirerarchy, which is too constraining without need for the actual implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on swapping Y and X- I've had to catch myself more than once on that already. I think the original motivation was a tip of the hat to R's Y ~ x but I agree with you.

Re [Int] I realized that later, but haven't gone through to swap them all back to K. It is (or was) K in some places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess if this is abstract enough, we also need to be able admit hyperparameters which are of course specific for every fitter. in R this is trivial (any call can be made a bag of whatever named arguments), but in Scala this may need a bit of a thought (if this abstraction needs to be that high). otherwise, i guess most scala kits just create a concrete fit signature per implementation.

if the Regressor trait is meant to be common to all possible regression class algorithms, we either need a way to universally pass in the hyperparameters, or just not have fit abstraction in the regressor trait at all . (then what i guess :) )

Copy link
Contributor

@dlyubimov dlyubimov Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like perhaps fit[K](X:drmLike[K],T:drmLike[K], (Symbol,Any)*):Model where the optional list is hyperparameter list. Then hyperparameterized calls could be something like:

fit(X,Y, 'alpha ->10.0, 'lambda -> 1.e-15)

etc.
This loses strong type-iness of the signature but call is not that ugly, and it might be ok if specific implementations are cleanly documented .

@rawkintrevo
Copy link
Contributor Author

Thanks for the quick review @dlyubimov will incorporate your sugguestions


var residuals: DrmLike[K] = _

def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that fit(..) should be able to return a List of errors per sample so possibly: def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Any would be a better signature for the trait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well, drmFeatures is somewhat confusing, with the Drm being a matrix of samples x features, i.e. features are columns, I think something like drmSamples, or drmObservations, or even drmX may be more straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by errors- do you mean like code errors, or the residuals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes eg. residuals, a hessian matrix, etc, anything that a user who is developing their own algorithm might like to see returned from their own fit(..) method. My point was No reason to limit the return value to a Unit, rather I'd think Any would be a more appropriate return value from the base trait. Though maybe This would not work with the pipeline structure that you're setting up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewpalumbo re drmFeatures vs drmX and @dlyubimov
We're mixing conventions-
I feel like, for consistency- we either use drmFeatures and drmTarget (similar to sparkML) or drmX and drmY similar to sklearn and R- leaving for now, but open to debate- I have a slight bias towards drmX , drmY

var addIntercept: Boolean = hyperparameters.asInstanceOf[Map[String, Boolean]].getOrElse("addIntercept", true)

var summary = ""
def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing on from our discussion on Slack I would think that Fit may be a more appropriate place for Hyperparameters eg:

fit(observed_independent: Drm[K], observerd_targets: Drm[K],  hyperparamters: Option[List[double]]): List[double]

I think that this may be a matter of Convention, so If you're following a convention that I am not familiar with, this may be fine. However I feel that this may be more robust.

d = 2 : no auto-correlation
d > 2 : negative auto-correlation
*/
def DurbinWaton[K](model: Regressor[K]): Regressor[K] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misspelling- "DurbinWatson"


def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] ={
if (!isFit) {
//throw an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this complete? I.e., do you want to throw an error here for this release?

def invTransform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {

if (!isFit) {
//throw an error
Copy link
Member

@andrewpalumbo andrewpalumbo Jan 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same qestion here as before- do you want to throw an error here for 0.13.0?

def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {

if (!isFit) {
//throw an error
Copy link
Member

@andrewpalumbo andrewpalumbo Jan 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit confusing (as well as the other mentions of 'if (!isFit) {// throw }`. So if i want to standardize a Drm to (mean = 0,std_dev = 1), I would need to do something like:

val drmStandardized = StandardScaler(unscaledDrm).fit().transform()

E.g., does the Transformer trait need to extend Model ? If so, we do need this to throw / catch here if fit() is not called by the user. Or possibly just move fit() into the constructor in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well, I think that this could be another argument for moving Hyperparamaters into fit(...), e.g. If for some reason we wanted to standardize on N(mean = 0, stdDev = 2), we could still call StandardScaler and fit (Map["mu" -> 0, "sigma" ->2]):

val drmStandardized = StandardScaler(unscaledDrm).fit(Map["mu" -> 0, "sigma" ->2]).transform()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct and agreed- sklearn convention is to also have a fitTransform method to do it in one step.

will add that.

(estimate - Ranswers).sum should be < epsilon

}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a couple of more tests here; at least one for Transform(...)

@andrewpalumbo
Copy link
Member

Trevor- looks really good to me- I've left some comments mainly about hyperparameter being moved to fit(...) from model, I think that his makes sense in many ways, E.g, When doing an highly iterative Hyperparameter search, It would eliminate a good amount of overhead to call:

aModel.fit(....,HyperParameters: Map["hParameter1" -> value , "hParameter2" -> ...])

rather than re-constructing the entire class each time. As well as i noted in line, I think that the fit(...) method should have the ability to return at least a List[double] of errors per row if needed, So I would suggest that it return Any rather than Unit in the base Traits. (unless the convention that you're following is to rely on predict for this.

@andrewpalumbo
Copy link
Member

I want to emphasize though that if you're building up to follow a convention (sk-learn-like, MlLib etc) which I am not familiar with That may e better to follow than my suggestions to make this framework as easy on new users as possible.

@rawkintrevo
Copy link
Contributor Author

rawkintrevo commented Jan 22, 2017

Thanks for the review @andrewpalumbo !

sklearn parameters are set when the Estimator is instantiated..

MLlib on the other hand, passes parameter maps in fit as you suggest

BOTH however, allow hyper parameters to be updated. And in the case you refer too, the model would not be re-instantiated, but something like this:

model.param1 = 1
model.fit(X, y) 
model.param1 = 2
model.fit

To your point, I also want to make this as easy as possible for new users- so I think it would be best to leave the option to pass a parameter map at initiation, and also expose it as a optional parameter of the fit method.

It also seems it would be logical (to avoid duplicating code) so have a setHyperparameters(paramMap: Map([String,Any])

val paramSet1 = Map( "param1" -> 1 )
model.setHyperparameters(paramSet1)

The would be called on the map when instantiated, and again if a Map is passed in fit or if the user wants to, and someday by a CV method.

@andrewpalumbo
Copy link
Member

@rawkintrevo Great- best of both worlds then!

drmTarget: DrmLike[K],
hyperparameters: Map[String, Any] = Map("" -> None)): Unit = {

var hyperparameters: Option[Map[String,Any]] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a setHyperparameters right about here...

@dlyubimov
Copy link
Contributor

dlyubimov commented Jan 24, 2017

If not too much hassle, please consider using unicode characters →, ⇒, ← (for new code).
In intelliJ this is easily facilitated by adding substitution 'live' templates:
spectacle kp8565


def fit(drmFeatures: DrmLike[K],
drmTarget: DrmLike[K],
hyperparameters: Map[String,Any] = Map("" -> None)): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case it was not quite clear, my suggestion was to use hyperparameters: (Symbol, Any)*. First, symbols are faster for something that meant to be an id, and second, have somewhat more palatable notation, e.g.,

val model = fit(X,Y,'k -> 10, 'alpha -> 1e-5)

and the signature overall

def fit(drmFeatures: DrmLike[K],
 +          drmTarget: DrmLike[K],
 +          hyperparameters: (Symbol, Any)*): Model

of course the implementation can easily get a map, should it need to:

  val hmap = hyperparameters.toMap

That's actually a Scala pattern i developed and used in a similar situation.

/**
* Abstract of Regressors
*/
trait Regressor[K] extends Model {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regressor still extends model? Regressor's fit IMO should be a factory method w.r.t. model instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that Spark and Flink follow the paradigm you are suggesting, however sklearn doesn't. If we're just going off of what others do, following the other larger packages- yea we should probably follow conventions of what other scala based "Big Data" packages do. However, I can't understand WHY they do it that way- it makes the code hard to read/follow and I assume is an artifact of all the serialization and the way they execute models (having to ship object around for map / reduce phases), that is to say they do it because they are forced to and at the expense of readability.

In Mahout, most of that is taken care of at the distributed engine level.

If we start going down the rabbit hole of "do as Spark and Flink do" we may find ourselves with entire class just for the summary of a linear model. I for one, want to stay as far away from that as possible. I'd like to see algorithm code (these and future) be as succinct and tractable as possible so that

  1. New contributors aren't intimidated (that is to say are encouraged to commit algorithms)
  2. Those algorithms can be easily reviewed and maintained with minimal Scala knowledge (as it limits the pool of willing and able contributors who understand the actual math at play)

That isn't to say, at the end of the day, your proposal is incorrect- you usually are correct and I value and appreciate you taking the time to review. I am saying, " i think that's how this pattern goes in most kits." is neither necessary nor sufficient imo, as in some respects I'm explicitly trying to avoid the approach of other packages, in this case- refactoring something to be more complex with no clear understanding of the benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most kits i know (including two i am working on right now myself) the pattern is that fit returns a model object. sckit seems to be on the outlier end on this.

Also i think the approach "package A does it like this and we do/don't like it therefore it is good/bad" to be a dogma fallacy. IMO We just need to do what makes sense.

And it makes sense to me to serialize or persist the model, not the (fitting algorithm+model). This will cause problems both on user and implementation ends IMO

Copy link
Member

@andrewpalumbo andrewpalumbo Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to model persistence. Though we need some thinking here. We could use persist/read the model as a set of Drms, since we have native-to-mahout support for this. ie. drm.dfsWrite(..), dfsRead(...). this can be quite clunky though. (see NaiveBayes model.persist(...) eg.)

d > 2 : negative auto-correlation
*/
def DurbinWatson[K](model: Regressor[K]): Regressor[K] = {
val e: DrmLike[K] = model.residuals(1 until model.residuals.nrow.toInt, 0 until 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nrow.toInt is generally dangerous as it does not catch the algorithm limitation. The problem with this (and i actually have run into this before) is that the algorithm obviously has in this case a limitation of 2 bln rows, and it should explicitly fall apart once this limit is reached, instead of silently producing a nonsense. I think there's a method specifically for this purpose in one of our math-scala package drm, safeToNonNegInt that would throw IllegalArgument if conversion loses significant bits.

Of course the best approach is to avoid such limitations in the first place, but if unavoidable, please use safeToNonNegInt.

val Y = drmTarget(1 until drmTarget.nrow.toInt, 0 until 1).checkpoint(cacheHint)
val Y_lag = drmTarget(0 until drmTarget.nrow.toInt - 1, 0 until 1).checkpoint(cacheHint)
val X = drmFeatures(1 until drmFeatures.nrow.toInt, 0 until 1).checkpoint(cacheHint)
val X_lag = drmFeatures(0 until drmFeatures.nrow.toInt - 1, 0 until 1).checkpoint(cacheHint)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed all of these- but have since updated with safeToNonNegInt(


def invTransform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {

if (!isFit){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style point: if (!isFit){ - space after close parens before brace ie: if (!isFit) {
Here and other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFit has been more or less abstracted away by the Model/ModelFactory paradigm. need to clean this up

@@ -0,0 +1,8 @@
package org.apache.mahout.math.algorithms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File needs License

@@ -0,0 +1,22 @@
package org.apache.mahout.math.algorithms.regression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License

@@ -0,0 +1,8 @@
package org.apache.mahout.math.algorithms.regression.tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license

@@ -0,0 +1,8 @@
package org.apache.mahout.math.algorithms.regression.tests

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove all references to your name. (autogenereated by Intellij, etc..)

@@ -0,0 +1,135 @@
package org.apache.mahout.math.algorithms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license

val drmY = drmData(::, 4 until 5)

val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 'calcStandardErrors → false)

Copy link
Member

@andrewpalumbo andrewpalumbo Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the new factory method correctly, could this just be somethinng like:

val olsModel = OrdinaryLeastSquares[Int].fit(drmX, drmY, 'calcStandardErrors → false)

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer at the code, I see that I was wrong in last comment.. Do we need helper objects for this?

.gitignore Outdated
@@ -20,3 +20,6 @@ mr/temp
temp
foo
math-tests/
derby.log
bin/*
Copy link
Member

@andrewpalumbo andrewpalumbo Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be added? -- bin/* don't think we want that in .gitignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope- intellij run amok


import scala.reflect.ClassTag

class AsFactor extends PreprocessorModelFactory{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after opening {

}

override def invTransform[K](input: DrmLike[K]) = {
// not yet implemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw NotImplementedException here?

var colCentersV: MahoutVector = means

def setCenters(centers: MahoutVector): Unit ={
if (means.length != centers.length){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before opening brace here and line above

if (addIntercept) {
X = X cbind 1
}
val drmXtXinv = solve(X.t %*% X)
Copy link
Member

@andrewpalumbo andrewpalumbo Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way around pulling X.t %*% X into memory here? I thought that recently you'd said that this was being done differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short answer is you can solve for beta without doing this- but not the var/co-var matrix. I can refactor such that if 'calcStandardErrors -> false then it does it the distributed way.

Also, so long as XtX isn't toooo fat here, this isn't going to be that brutal. Even 50k columns dense results in a fairly manageable matrix (50k x 50k)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. that might be useful- the re-factor for the case w/o a needed covariance matrix. You could also calculate the covariance matrix in a distributed manner on the out of core Drms.. My concern is that people love to hammer on new features.. Although point is moot if the answer is to use thinQR(..) to solve which is also bottlenecked by in-memory X.t %*% X

Maybe a question for after the platform is finished...

@@ -28,10 +28,11 @@ import org.apache.mahout.math.drm.{drmBroadcast, _}
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.math.scalabindings.{dvec, _}
import org.apache.mahout.math.scalabindings.RLikeOps._
import sun.reflect.generics.reflectiveObjects.NotImplementedException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake - in my previous comment.. this should be UnsupportedOperationException.

def apply(f: Vector): Double = f.max
}))
})(0, ::)
new AsFactorModel(factorMap.sum.toInt, factorMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpicking. since this really does not overload a trait-signed method, it would be best readable if the return type is explicitly stipulated in the declaration.

def fit[K](input: DrmLike[K]):MyModelType = {

@asfgit asfgit closed this in 9a31923 Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants