Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does topicTuner help the parameter setting process #20

Open
652994331 opened this issue Mar 23, 2023 · 4 comments
Open

how does topicTuner help the parameter setting process #20

652994331 opened this issue Mar 23, 2023 · 4 comments

Comments

@652994331
Copy link

@drob-xx I checked your code, very impressive work, here I got a question. I think you used grid search to do different setting of min_cluster_size and min-samples and did some experiments, I also checked BaseHDBSCANTuner and gridSearch, pseudoGridSearch and randomSearch functions. But I am still having questions about how this "grid search" or more exactly, these functions help the parameters setting.

@drob-xx
Copy link
Owner

drob-xx commented Mar 23, 2023

The short version is that you use the search functions to test different parameters. You then evaluate the resulting clustering. Typically I'm looking to find a number of clusters that makes sense for the corpus I'm working with. Then you can further tune to find the fewest outliers. There are always going to be questions about whether reducing outliers has a (negative) material effect on the cluster formation. However, it seems pretty clear that optimizing for the fewest number of outliers is the way to go. Once you have the parameters that work for you, you can generate a BERTopic model.

@652994331
Copy link
Author

@ I see, so grid-search is a tool to find all the parameters setting and you actually need to evaluate which setting is the best. the way you do the evaluation is something like "visualization" or manually checking maybe? (I am not sure). I do have a question that how did u do the grid-searching according to your docs number?(I didn't find doc nums in these functions) and how did u name a parameter setting the "best one". Thank you.

@drob-xx
Copy link
Owner

drob-xx commented Mar 23, 2023

There are different searches which balance off the "depth" and "width" of the search. Since min_samples can be 1-n where n is whatever min_cluster_size is the number of searches can grow quickly. I start by running randomSearch over a wide range of cluster sizes using percentages of the cluster size (typically .1 -> 1 in .10 increments) to determine the min samples param. Then, once I've narrowed down some interesting clustering I'll use pseudoGridSearch and then gridSearch to get the best possible values in the fewest number of clustering iterations. The package provides visualizeSearch as well a complete history of the searches, summarizeResults, and visualizeSearch to make the searching as efficient as possible. Once you have some values you like you can persist them by setting bestParams (e.g. myTopicTuner.bestParams = (220, 44)). These values will then default for any method or operation which is looking for the params (e.g. TopicTuner.getBERTopicModel())

I suggest you take a minute to go through the API documentation and run through the provided notebook to get a more thorough understanding of the tools and how I envisioned them to be used.

@arcadiahero
Copy link

what is the evaluation metrics you used here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants