Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoKMeans produces 0 clusters #48

Closed
spsanderson opened this issue Jan 23, 2020 · 13 comments
Closed

AutoKMeans produces 0 clusters #48

spsanderson opened this issue Jan 23, 2020 · 13 comments

Comments

@spsanderson
Copy link

spsanderson commented Jan 23, 2020

When using the following function with the following parameters:

AutoK_obj <- RemixAutoML::AutoKMeans(
    data = customer_product_tbl %>% select(-bikeshop_name)
    , KMeansK = 15
    , KMeansMetric = "tot_withinss"
    , GridTuneGLRM = TRUE
    , GridTuneKMeans = TRUE
    )

I get only 0 returned in the cluster column. Yet when I run a skree plot I can see that there are at least 3 or 4 as a good cut off.

kmeans_mapper <- function(centers = 3) {
    
    # Body
    customer_product_tbl %>%
        select(-bikeshop_name) %>%
        kmeans(
            centers = centers
            , nstart = 100
        )
    
}
kmeans_mapper(3) %>% glance()

# Mapping the function to many elements
kmeans_mapped_tbl <- tibble(centers = 1:15) %>%
    mutate(k_means = centers %>% map(kmeans_mapper)) %>%
    mutate(glance = k_means %>% map(glance))

# Skree Plot ----
kmeans_mapped_tbl %>%
    unnest(glance) %>%
    select(centers, tot.withinss) %>%
    ggplot(
        mapping = aes(
            x = centers
            , y = tot.withinss
        )
    ) +
    geom_point() +
    geom_line() +
    ggrepel::geom_label_repel(mapping = aes(label = centers)) +
    theme_tq()

The data is a user-item matrix form.

customer_trends_tbl.xlsx

@DougVegas
Copy link
Contributor

Hi @spsanderson

Looking into issue now. However, I think the data set you provided isn't the correct one? You provided customer_trends_tbl but the AutoKMeans example uses a data set named customer_product_tbl

I can't reproduce the error because there is no column called bikeshop_name

@spsanderson
Copy link
Author

spsanderson commented Jan 23, 2020 via email

@spsanderson
Copy link
Author

customer_product_tbl.xlsx

This is the correct file

@AdrianAntico
Copy link
Owner

@spsanderson Why is the data in the form that it is? What do the values in each column represent?

@spsanderson
Copy link
Author

spsanderson commented Jan 23, 2020

proportions of purchases of each bike model from a bikeshop. Is the function expecting different form?

@AdrianAntico
Copy link
Owner

I wouldn't aggregate the data before running k-means. I would use the transactional data

@spsanderson
Copy link
Author

I will try it and report back

@AdrianAntico
Copy link
Owner

Sounds good

@spsanderson
Copy link
Author

My data looks like the attached, should I make my data strictly the quantity column (this is what I am aggregating)

bike_orderlines_tbl.xlsx

@spsanderson
Copy link
Author

So with the following code and attached data I get 2 clusters 0 and 1, it should really be at least 4. Which is what I get from the method posted in the original post.

AutoK_obj <- RemixAutoML::AutoKMeans(
    data = customer_trends_tbl %>% select(-prop_of_total)
    , KMeansK = 15
    , KMeansMetric = "tot_withinss"
    , GridTuneGLRM = TRUE
    , GridTuneKMeans = TRUE
    )

customer_trends_tbl.xlsx

@AdrianAntico
Copy link
Owner

@spsanderson I would start tinkering with the arguments. What's going on internally is that a GLRM model from H2O is built first (for the purposes of dimensionality reduction) and you select the number of factors from that to keep and pass on to the KMEANS algo from H2O, which will run to find the optimal k using the factors data from the GLRM.

If you go through the help file (?RemixAutoML::AutoKMeans), you can read up on what each argument does. The function is intended to be flexible for most kinds of data sets but you will want to try several settings if you don't already have a good idea of how to set it for your particular case.

This function is just a beginning for unsupervised learning. I spend most of my time working on the supervised learning stuff since I encounter it more often in practice, but I will get around to enhancing these at some point. If you are interested in helping out let me know.

AutoKMeans <- function(data,
                       nthreads        = 8,
                       MaxMem          = "28G",
                       SaveModels      = NULL,
                       PathFile        = NULL,
                       GridTuneGLRM    = TRUE,
                       GridTuneKMeans  = TRUE,
                       glrmCols        = c(1:5),
                       IgnoreConstCols = TRUE,
                       glrmFactors     = 5,
                       Loss            = "Absolute",
                       glrmMaxIters    = 1000,
                       SVDMethod       = "Randomized",
                       MaxRunTimeSecs  = 3600,
                       KMeansK         = 50,
                       KMeansMetric    = "totss") {

@spsanderson
Copy link
Author

spsanderson commented Jan 27, 2020 via email

@spsanderson
Copy link
Author

Working through it. Seems that even on the Iris dataset the h2o::kmeans is only producing 2 clusters when we know there are 3. I forked and cloned repo. Will work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants