how does topicTuner help the parameter setting process #20

652994331 · 2023-03-23T06:42:14Z

@drob-xx I checked your code, very impressive work, here I got a question. I think you used grid search to do different setting of min_cluster_size and min-samples and did some experiments, I also checked BaseHDBSCANTuner and gridSearch, pseudoGridSearch and randomSearch functions. But I am still having questions about how this "grid search" or more exactly, these functions help the parameters setting.

drob-xx · 2023-03-23T06:47:29Z

The short version is that you use the search functions to test different parameters. You then evaluate the resulting clustering. Typically I'm looking to find a number of clusters that makes sense for the corpus I'm working with. Then you can further tune to find the fewest outliers. There are always going to be questions about whether reducing outliers has a (negative) material effect on the cluster formation. However, it seems pretty clear that optimizing for the fewest number of outliers is the way to go. Once you have the parameters that work for you, you can generate a BERTopic model.

652994331 · 2023-03-23T07:33:06Z

@ I see, so grid-search is a tool to find all the parameters setting and you actually need to evaluate which setting is the best. the way you do the evaluation is something like "visualization" or manually checking maybe? (I am not sure). I do have a question that how did u do the grid-searching according to your docs number?(I didn't find doc nums in these functions) and how did u name a parameter setting the "best one". Thank you.

drob-xx · 2023-03-23T17:50:57Z

There are different searches which balance off the "depth" and "width" of the search. Since min_samples can be 1-n where n is whatever min_cluster_size is the number of searches can grow quickly. I start by running randomSearch over a wide range of cluster sizes using percentages of the cluster size (typically .1 -> 1 in .10 increments) to determine the min samples param. Then, once I've narrowed down some interesting clustering I'll use pseudoGridSearch and then gridSearch to get the best possible values in the fewest number of clustering iterations. The package provides visualizeSearch as well a complete history of the searches, summarizeResults, and visualizeSearch to make the searching as efficient as possible. Once you have some values you like you can persist them by setting bestParams (e.g. myTopicTuner.bestParams = (220, 44)). These values will then default for any method or operation which is looking for the params (e.g. TopicTuner.getBERTopicModel())

I suggest you take a minute to go through the API documentation and run through the provided notebook to get a more thorough understanding of the tools and how I envisioned them to be used.

arcadiahero · 2023-10-29T06:24:55Z

what is the evaluation metrics you used here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how does topicTuner help the parameter setting process #20

how does topicTuner help the parameter setting process #20

652994331 commented Mar 23, 2023

drob-xx commented Mar 23, 2023

652994331 commented Mar 23, 2023

drob-xx commented Mar 23, 2023

arcadiahero commented Oct 29, 2023

how does topicTuner help the parameter setting process #20

how does topicTuner help the parameter setting process #20

Comments

652994331 commented Mar 23, 2023

drob-xx commented Mar 23, 2023

652994331 commented Mar 23, 2023

drob-xx commented Mar 23, 2023

arcadiahero commented Oct 29, 2023