-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoKMeans produces 0 clusters #48
Comments
Hi @spsanderson Looking into issue now. However, I think the data set you provided isn't the correct one? You provided customer_trends_tbl but the AutoKMeans example uses a data set named customer_product_tbl I can't reproduce the error because there is no column called bikeshop_name |
Oh boy let me look into it and fix
Steven P Sanderson II, MPH
…On Thu, Jan 23, 2020, 1:21 PM DougVegas ***@***.***> wrote:
Hi @spsanderson <https://github.com/spsanderson>
Looking into issue now. However, I think the data set you provided isn't
the correct one? You provided customer_trends_tbl but the AutoKMeans
example uses a data set named customer_product_tbl
I can't reproduce the error because there is no column called bikeshop_name
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#48>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPCNS6FCRGITEVNUPEDHU3Q7HN3FANCNFSM4KK2M6CQ>
.
|
This is the correct file |
@spsanderson Why is the data in the form that it is? What do the values in each column represent? |
proportions of purchases of each bike model from a bikeshop. Is the function expecting different form? |
I wouldn't aggregate the data before running k-means. I would use the transactional data |
I will try it and report back |
Sounds good |
My data looks like the attached, should I make my data strictly the quantity column (this is what I am aggregating) |
So with the following code and attached data I get 2 clusters 0 and 1, it should really be at least 4. Which is what I get from the method posted in the original post.
|
@spsanderson I would start tinkering with the arguments. What's going on internally is that a GLRM model from H2O is built first (for the purposes of dimensionality reduction) and you select the number of factors from that to keep and pass on to the KMEANS algo from H2O, which will run to find the optimal k using the factors data from the GLRM. If you go through the help file (?RemixAutoML::AutoKMeans), you can read up on what each argument does. The function is intended to be flexible for most kinds of data sets but you will want to try several settings if you don't already have a good idea of how to set it for your particular case. This function is just a beginning for unsupervised learning. I spend most of my time working on the supervised learning stuff since I encounter it more often in practice, but I will get around to enhancing these at some point. If you are interested in helping out let me know.
|
Thanks for the update I will take through and take a look. And see what I
can come up with.
Steven P Sanderson II, MPH
…On Sun, Jan 26, 2020, 11:04 PM Adrian ***@***.***> wrote:
@spsanderson <https://github.com/spsanderson> I would start tinkering
with the arguments. What's going on internally is that a GLRM model from
H2O is built first (for the purposes of dimensionality reduction) and you
select the number of factors from that to keep and pass on to the KMEANS
algo from H2O, which will run to find the optimal k using the factors data
from the GLRM.
If you go through the help file (?RemixAutoML::AutoKMeans), you can read
up on what each argument does. The function is intended to be flexible for
most kinds of data sets but you will want to try several settings if you
don't already have a good idea of how to set it for your particular case.
This function is just a beginning for unsupervised learning. I spend most
of my time working on the supervised learning stuff since I encounter it
more often in practice, but I will get around to enhancing these at some
point. If you are interested in helping out let me know.
AutoKMeans <- function(data,
nthreads = 8,
MaxMem = "28G",
SaveModels = NULL,
PathFile = NULL,
GridTuneGLRM = TRUE,
GridTuneKMeans = TRUE,
glrmCols = c(1:5),
IgnoreConstCols = TRUE,
glrmFactors = 5,
Loss = "Absolute",
glrmMaxIters = 1000,
SVDMethod = "Randomized",
MaxRunTimeSecs = 3600,
KMeansK = 50,
KMeansMetric = "totss") {
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#48>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPCNS5AJYEEV4ETCOTL5GDQ7ZMOPANCNFSM4KK2M6CQ>
.
|
Working through it. Seems that even on the Iris dataset the h2o::kmeans is only producing 2 clusters when we know there are 3. I forked and cloned repo. Will work on it. |
When using the following function with the following parameters:
I get only 0 returned in the cluster column. Yet when I run a skree plot I can see that there are at least 3 or 4 as a good cut off.
The data is a user-item matrix form.
customer_trends_tbl.xlsx
The text was updated successfully, but these errors were encountered: