A practical tool for query deduplication and similarity grouping, designed for dataset cleaning, benchmark construction, search-log consolidation, and question pool management.
It supports:
- Exact deduplication
- Near-text deduplication with SimHash
- Local semantic-like grouping with TF-IDF + cosine similarity + clustering
- GUI-based usage for non-technical users
- Traceable outputs for audit and manual review
In real-world query datasets, duplicates usually appear in multiple forms:
- exactly repeated queries
- formatting variants
- slight rewrites
- locally similar short texts
- semantically adjacent expressions
This tool is built to turn a noisy query set into a cleaner, more compact, and more reusable dataset.
Instead of relying on heavy remote embedding services, this project adopts a lightweight and deployable local pipeline, making it easier to run, share, and reproduce.
Standardizes queries before deduplication:
- lowercase conversion
- trimming spaces
- collapsing repeated spaces
- removing common punctuation
Removes:
- identical queries
- normalization-equivalent queries
Uses SimHash to detect highly similar short-text variations.
Uses:
- TF-IDF
- cosine similarity
- clustering / connected grouping
to merge queries that are not text-identical but still close in expression.
The tool includes a local GUI so that non-engineering users can:
- select CSV files
- specify the query column
- configure parameters
- choose output directories
- run the full pipeline with one click
This project is useful for:
- query dataset cleaning
- benchmark construction
- WebDev / product requirement dataset preparation
- search log consolidation
- FAQ / issue pool cleanup
- preprocessing before intent analysis or clustering
The full pipeline is:
Input
→ Normalization
→ Exact deduplication
→ Near-text deduplication (SimHash)
→ Local semantic grouping (TF-IDF)
→ Result export
Raw query sets may contain duplicates, formatting variants, and similar expressions.
Examples:
How to learn Python?→how to learn python北京天气怎么样!→北京天气怎么样
Examples:
北京天气怎么样北京天气怎么样!北京天气怎么样
After normalization, they collapse into one representative query:
北京天气怎么样
Examples:
北京天气怎么样北京今天天气怎么样
These are not identical, but they are textually very close and may be grouped together.
Examples:
怎么学习 PythonPython 入门怎么学如何快速掌握 Python
These can be grouped under one semantic representative query.
query-dedup-tool/
├── Query Dedup Tfidf Gui Tool.py
├── requirements.txt
├── README.md
├── .gitignore
├── LICENSE
├── gui-home.png
└── examples/
├── sample_queries.csv
├── deduped_queries_sample.csv
└── query_groups_sample.csv
Install dependencies with:
pip install -r requirements.txtOr install manually:
pip install pandas numpy scikit-learnRun the GUI tool:
python3.12 "Query Dedup Tfidf Gui Tool.py"Then:
- Select your CSV file
- Enter the query column name
- Choose an output folder
- Set text-dedup parameters
- Optionally enable local semantic grouping
- Click Start
The input file should be a CSV containing one query column.
Example file:
examples/sample_queries.csv
Example content:
query
北京天气怎么样
北京天气怎么样!
北京今天天气怎么样
怎么学习 Python
Python 入门怎么学
如何快速掌握 PythonThis repository includes sample input and output files in the examples/ folder:
examples/sample_queries.csvexamples/deduped_queries_sample.csvexamples/query_groups_sample.csv
These files help new users quickly understand the input/output format and test the tool without preparing data from scratch.
| Original Query | Representative Query |
|---|---|
| 北京天气怎么样! | 北京天气怎么样 |
| 北京今天天气怎么样 | 北京天气怎么样 |
| 如何快速掌握 Python | Python 入门怎么学 |
Representative queries after text-level deduplication.
Mapping from original queries to text-level representative queries.
Representative queries after local semantic grouping.
Mapping from representative queries to semantic groups.
This project is built around the following priorities:
- Practicality over complexity
- Local deployability
- Reproducibility
- Interpretability
- Low usage threshold
The goal is not to build the strongest semantic system possible, but to build a usable and shareable query cleaning tool.
This tool is designed for short-text query processing, not for deep semantic reasoning over long documents.
Current semantic grouping is based on local TF-IDF similarity, which is lightweight and stable, but not equivalent to large-scale embedding-based semantic understanding.
So this project is best viewed as:
- a data cleaning tool
- a query consolidation tool
- a benchmark preprocessing tool
rather than a full semantic understanding engine.
Recommended:
- benchmark dataset cleaning
- WebDev task set construction
- query pool compression
- short-text similarity consolidation
- pre-labeling cleanup
Not recommended as a direct replacement for:
- deep semantic retrieval systems
- intent classification systems
- large-model semantic equivalence judgment
- long-document clustering
Potential next steps:
- optional embedding-based semantic deduplication
- CLI packaging
- batch folder support
- report generation
- dataset quality analytics dashboard
- service/API deployment mode
Issues and pull requests are welcome.
MIT License
