AutoKMeans produces 0 clusters #48

spsanderson · 2020-01-23T17:46:44Z

When using the following function with the following parameters:

AutoK_obj <- RemixAutoML::AutoKMeans(
    data = customer_product_tbl %>% select(-bikeshop_name)
    , KMeansK = 15
    , KMeansMetric = "tot_withinss"
    , GridTuneGLRM = TRUE
    , GridTuneKMeans = TRUE
    )

I get only 0 returned in the cluster column. Yet when I run a skree plot I can see that there are at least 3 or 4 as a good cut off.

kmeans_mapper <- function(centers = 3) {
    
    # Body
    customer_product_tbl %>%
        select(-bikeshop_name) %>%
        kmeans(
            centers = centers
            , nstart = 100
        )
    
}
kmeans_mapper(3) %>% glance()

# Mapping the function to many elements
kmeans_mapped_tbl <- tibble(centers = 1:15) %>%
    mutate(k_means = centers %>% map(kmeans_mapper)) %>%
    mutate(glance = k_means %>% map(glance))

# Skree Plot ----
kmeans_mapped_tbl %>%
    unnest(glance) %>%
    select(centers, tot.withinss) %>%
    ggplot(
        mapping = aes(
            x = centers
            , y = tot.withinss
        )
    ) +
    geom_point() +
    geom_line() +
    ggrepel::geom_label_repel(mapping = aes(label = centers)) +
    theme_tq()

The data is a user-item matrix form.

customer_trends_tbl.xlsx

The text was updated successfully, but these errors were encountered:

DougVegas · 2020-01-23T18:21:37Z

Hi @spsanderson

Looking into issue now. However, I think the data set you provided isn't the correct one? You provided customer_trends_tbl but the AutoKMeans example uses a data set named customer_product_tbl

I can't reproduce the error because there is no column called bikeshop_name

spsanderson · 2020-01-23T18:24:50Z

Oh boy let me look into it and fix Steven P Sanderson II, MPH

…

On Thu, Jan 23, 2020, 1:21 PM DougVegas ***@***.***> wrote: Hi @spsanderson <https://github.com/spsanderson> Looking into issue now. However, I think the data set you provided isn't the correct one? You provided customer_trends_tbl but the AutoKMeans example uses a data set named customer_product_tbl I can't reproduce the error because there is no column called bikeshop_name — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#48>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPCNS6FCRGITEVNUPEDHU3Q7HN3FANCNFSM4KK2M6CQ> .

spsanderson · 2020-01-23T18:26:51Z

customer_product_tbl.xlsx

This is the correct file

AdrianAntico · 2020-01-23T20:18:46Z

@spsanderson Why is the data in the form that it is? What do the values in each column represent?

spsanderson · 2020-01-23T20:19:43Z

proportions of purchases of each bike model from a bikeshop. Is the function expecting different form?

AdrianAntico · 2020-01-23T20:21:49Z

I wouldn't aggregate the data before running k-means. I would use the transactional data

spsanderson · 2020-01-23T20:22:32Z

I will try it and report back

AdrianAntico · 2020-01-23T20:22:58Z

Sounds good

spsanderson · 2020-01-23T20:30:08Z

My data looks like the attached, should I make my data strictly the quantity column (this is what I am aggregating)

bike_orderlines_tbl.xlsx

spsanderson · 2020-01-24T03:33:32Z

So with the following code and attached data I get 2 clusters 0 and 1, it should really be at least 4. Which is what I get from the method posted in the original post.

AutoK_obj <- RemixAutoML::AutoKMeans(
    data = customer_trends_tbl %>% select(-prop_of_total)
    , KMeansK = 15
    , KMeansMetric = "tot_withinss"
    , GridTuneGLRM = TRUE
    , GridTuneKMeans = TRUE
    )

customer_trends_tbl.xlsx

AdrianAntico · 2020-01-27T04:04:55Z

@spsanderson I would start tinkering with the arguments. What's going on internally is that a GLRM model from H2O is built first (for the purposes of dimensionality reduction) and you select the number of factors from that to keep and pass on to the KMEANS algo from H2O, which will run to find the optimal k using the factors data from the GLRM.

If you go through the help file (?RemixAutoML::AutoKMeans), you can read up on what each argument does. The function is intended to be flexible for most kinds of data sets but you will want to try several settings if you don't already have a good idea of how to set it for your particular case.

This function is just a beginning for unsupervised learning. I spend most of my time working on the supervised learning stuff since I encounter it more often in practice, but I will get around to enhancing these at some point. If you are interested in helping out let me know.

AutoKMeans <- function(data,
                       nthreads        = 8,
                       MaxMem          = "28G",
                       SaveModels      = NULL,
                       PathFile        = NULL,
                       GridTuneGLRM    = TRUE,
                       GridTuneKMeans  = TRUE,
                       glrmCols        = c(1:5),
                       IgnoreConstCols = TRUE,
                       glrmFactors     = 5,
                       Loss            = "Absolute",
                       glrmMaxIters    = 1000,
                       SVDMethod       = "Randomized",
                       MaxRunTimeSecs  = 3600,
                       KMeansK         = 50,
                       KMeansMetric    = "totss") {

spsanderson · 2020-01-27T11:47:41Z

Thanks for the update I will take through and take a look. And see what I can come up with. Steven P Sanderson II, MPH

…

On Sun, Jan 26, 2020, 11:04 PM Adrian ***@***.***> wrote: @spsanderson <https://github.com/spsanderson> I would start tinkering with the arguments. What's going on internally is that a GLRM model from H2O is built first (for the purposes of dimensionality reduction) and you select the number of factors from that to keep and pass on to the KMEANS algo from H2O, which will run to find the optimal k using the factors data from the GLRM. If you go through the help file (?RemixAutoML::AutoKMeans), you can read up on what each argument does. The function is intended to be flexible for most kinds of data sets but you will want to try several settings if you don't already have a good idea of how to set it for your particular case. This function is just a beginning for unsupervised learning. I spend most of my time working on the supervised learning stuff since I encounter it more often in practice, but I will get around to enhancing these at some point. If you are interested in helping out let me know. AutoKMeans <- function(data, nthreads = 8, MaxMem = "28G", SaveModels = NULL, PathFile = NULL, GridTuneGLRM = TRUE, GridTuneKMeans = TRUE, glrmCols = c(1:5), IgnoreConstCols = TRUE, glrmFactors = 5, Loss = "Absolute", glrmMaxIters = 1000, SVDMethod = "Randomized", MaxRunTimeSecs = 3600, KMeansK = 50, KMeansMetric = "totss") { — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#48>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPCNS5AJYEEV4ETCOTL5GDQ7ZMOPANCNFSM4KK2M6CQ> .

spsanderson · 2020-01-27T17:19:37Z

Working through it. Seems that even on the Iris dataset the h2o::kmeans is only producing 2 clusters when we know there are 3. I forked and cloned repo. Will work on it.

AdrianAntico closed this as completed Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoKMeans produces 0 clusters #48

AutoKMeans produces 0 clusters #48

spsanderson commented Jan 23, 2020 •

edited

DougVegas commented Jan 23, 2020

spsanderson commented Jan 23, 2020 via email

spsanderson commented Jan 23, 2020

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020 •

edited

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020

spsanderson commented Jan 24, 2020

AdrianAntico commented Jan 27, 2020

spsanderson commented Jan 27, 2020 via email

spsanderson commented Jan 27, 2020

AutoKMeans produces 0 clusters #48

AutoKMeans produces 0 clusters #48

Comments

spsanderson commented Jan 23, 2020 • edited

DougVegas commented Jan 23, 2020

spsanderson commented Jan 23, 2020 via email

spsanderson commented Jan 23, 2020

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020 • edited

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020

AdrianAntico commented Jan 23, 2020

spsanderson commented Jan 23, 2020

spsanderson commented Jan 24, 2020

AdrianAntico commented Jan 27, 2020

spsanderson commented Jan 27, 2020 via email

spsanderson commented Jan 27, 2020

spsanderson commented Jan 23, 2020 •

edited

spsanderson commented Jan 23, 2020 •

edited