ClustrLab2k13 is a powerful Python-based tool for clustering text, built using Streamlit.
The tool utilizes Google's Universal Sentence Encoder in conjunction with OpenTSNE, a lightning-fast implementation of t-SNE. It can process plain text files or CSV files with a single column containing text. When provided with a plain text file, it employs sentence embedding similarity to group sentences and create what we can refer to as "pseudo paragraphs." However, if you prefer to avoid this grouping, you can use the CSV mode. Additionally, all data, including text, embeddings, and TSNE output, can be downloaded. Much of the code for this tool is derived from my previous repository, 'Feed Visualizer'.
streamlit run app.py
Context-based help is available for each of the options. I won't bore 🥱 you by writing a manual here; instead, explore the tool and let it guide you.
On the chart there is a button you can use to toggle full screen view .
Instead of relying on Google's 'Universal Sentence Transformer', the 'use zero-shot embedding' option utilizes Huggingface's zero-shot classification to generate embeddings based on provided labels. For example, if you assign labels such as "positive, negative, neutral," the resulting embedding for a sentence could resemble "0.3, 0.4, 0.3".
Note: Exercise caution when experimenting with this option unless you have a GPU. This feature has not yet been tested with a GPU on large datasets.