Query Dedup Tool

A practical tool for query deduplication and similarity grouping, designed for dataset cleaning, benchmark construction, search-log consolidation, and question pool management.

It supports:

Exact deduplication
Near-text deduplication with SimHash
Local semantic-like grouping with TF-IDF + cosine similarity + clustering
GUI-based usage for non-technical users
Traceable outputs for audit and manual review

Screenshot

Why this project

In real-world query datasets, duplicates usually appear in multiple forms:

exactly repeated queries
formatting variants
slight rewrites
locally similar short texts
semantically adjacent expressions

This tool is built to turn a noisy query set into a cleaner, more compact, and more reusable dataset.

Instead of relying on heavy remote embedding services, this project adopts a lightweight and deployable local pipeline, making it easier to run, share, and reproduce.

Features

1. Text normalization

Standardizes queries before deduplication:

lowercase conversion
trimming spaces
collapsing repeated spaces
removing common punctuation

2. Exact deduplication

Removes:

identical queries
normalization-equivalent queries

3. Near-text deduplication

Uses SimHash to detect highly similar short-text variations.

4. Local semantic-like grouping

Uses:

TF-IDF
cosine similarity
clustering / connected grouping

to merge queries that are not text-identical but still close in expression.

5. GUI support

The tool includes a local GUI so that non-engineering users can:

select CSV files
specify the query column
configure parameters
choose output directories
run the full pipeline with one click

Use cases

This project is useful for:

query dataset cleaning
benchmark construction
WebDev / product requirement dataset preparation
search log consolidation
FAQ / issue pool cleanup
preprocessing before intent analysis or clustering

Method overview

The full pipeline is:

Input
→ Normalization
→ Exact deduplication
→ Near-text deduplication (SimHash)
→ Local semantic grouping (TF-IDF)
→ Result export

Layer-by-layer intuition

Input

Raw query sets may contain duplicates, formatting variants, and similar expressions.

Normalization

Examples:

How to learn Python? → how to learn python
北京天气怎么样！ → 北京天气怎么样

Exact deduplication

Examples:

北京天气怎么样
北京天气怎么样！
北京天气怎么样

After normalization, they collapse into one representative query:

北京天气怎么样

Near-text deduplication

Examples:

北京天气怎么样
北京今天天气怎么样

These are not identical, but they are textually very close and may be grouped together.

Local semantic-like grouping

Examples:

怎么学习 Python
Python 入门怎么学
如何快速掌握 Python

These can be grouped under one semantic representative query.

Project structure

query-dedup-tool/
├── Query Dedup Tfidf Gui Tool.py
├── requirements.txt
├── README.md
├── .gitignore
├── LICENSE
├── gui-home.png
└── examples/
    ├── sample_queries.csv
    ├── deduped_queries_sample.csv
    └── query_groups_sample.csv

Installation

Install dependencies with:

pip install -r requirements.txt

Or install manually:

pip install pandas numpy scikit-learn

Quick start

Run the GUI tool:

python3.12 "Query Dedup Tfidf Gui Tool.py"

Then:

Select your CSV file
Enter the query column name
Choose an output folder
Set text-dedup parameters
Optionally enable local semantic grouping
Click Start

Input format

The input file should be a CSV containing one query column.

Example file:

examples/sample_queries.csv

Example content:

query
北京天气怎么样
北京天气怎么样！
北京今天天气怎么样
怎么学习 Python
Python 入门怎么学
如何快速掌握 Python

Example files

This repository includes sample input and output files in the examples/ folder:

examples/sample_queries.csv
examples/deduped_queries_sample.csv
examples/query_groups_sample.csv

These files help new users quickly understand the input/output format and test the tool without preparing data from scratch.

Example Output

Original Query	Representative Query
北京天气怎么样！	北京天气怎么样
北京今天天气怎么样	北京天气怎么样
如何快速掌握 Python	Python 入门怎么学

Output files

`deduped_queries.csv`

Representative queries after text-level deduplication.

`query_groups.csv`

Mapping from original queries to text-level representative queries.

`semantic_deduped_queries.csv`

Representative queries after local semantic grouping.

`semantic_query_groups.csv`

Mapping from representative queries to semantic groups.

Design principles

This project is built around the following priorities:

Practicality over complexity
Local deployability
Reproducibility
Interpretability
Low usage threshold

The goal is not to build the strongest semantic system possible, but to build a usable and shareable query cleaning tool.

Limitations

This tool is designed for short-text query processing, not for deep semantic reasoning over long documents.

Current semantic grouping is based on local TF-IDF similarity, which is lightweight and stable, but not equivalent to large-scale embedding-based semantic understanding.

So this project is best viewed as:

a data cleaning tool
a query consolidation tool
a benchmark preprocessing tool

rather than a full semantic understanding engine.

Recommended scenarios

Recommended:

benchmark dataset cleaning
WebDev task set construction
query pool compression
short-text similarity consolidation
pre-labeling cleanup

Not recommended as a direct replacement for:

deep semantic retrieval systems
intent classification systems
large-model semantic equivalence judgment
long-document clustering

Future work

Potential next steps:

optional embedding-based semantic deduplication
CLI packaging
batch folder support
report generation
dataset quality analytics dashboard
service/API deployment mode

Contributing

Issues and pull requests are welcome.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Query Dedup Tfidf Gui Tool.py		Query Dedup Tfidf Gui Tool.py
README.md		README.md
gui-home.png		gui-home.png
requirements.txt		requirements.txt
sample_queries.csv		sample_queries.csv
semantic_deduped_queries.csv		semantic_deduped_queries.csv
semantic_query_groups.csv		semantic_query_groups.csv

Folders and files

Latest commit

History

Repository files navigation

Query Dedup Tool

Screenshot

Why this project

Features

1. Text normalization

2. Exact deduplication

3. Near-text deduplication

4. Local semantic-like grouping

5. GUI support

Use cases

Method overview

Layer-by-layer intuition

Input

Normalization

Exact deduplication

Near-text deduplication

Local semantic-like grouping

Project structure

Installation

Quick start

Input format

Example files

Example Output

Output files

deduped_queries.csv

query_groups.csv

semantic_deduped_queries.csv

semantic_query_groups.csv

Design principles

Limitations

Recommended scenarios

Future work

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`deduped_queries.csv`

`query_groups.csv`

`semantic_deduped_queries.csv`

`semantic_query_groups.csv`

Packages