NLPBench - Dataset Quality Measurement Tool

A comprehensive tool for analyzing the quality of NLP datasets using Great Expectations and advanced diversity metrics. NLPBench provides detailed quality assessments, diversity measurements, beautiful reports, and actionable recommendations for improving your datasets.

Features

Comprehensive Quality Analysis: Validates schema, content length, role consistency, duplicates, and more
Advanced Diversity Metrics: Measures lexical, semantic, syntactic, and topic diversity with tiered analysis
Great Expectations Integration: Robust data validation framework with custom NLP expectations
Multiple Dataset Formats: Supports conversations, instruction-following, Q&A, and other common formats
Rich Console Output: Beautiful CLI interface with progress bars and colored output
Multiple Report Formats: Generate reports in console, JSON, or HTML format
Configurable Validation Rules: Customize quality thresholds and validation parameters
Hugging Face Integration: Direct integration with the Hugging Face Hub
Optional Enhanced Metrics: Advanced NLP analysis with sentence transformers, spaCy, and topic modeling

Installation

Using uv (Recommended)

Clone the repository:

git clone <repository-url>
cd nlpbench

Install with uv:

# Install dependencies
uv sync

# Install with enhanced diversity metrics (optional)
uv sync --extra diversity

# For development (includes dev dependencies)
uv sync --group dev

Using pip

# Clone and install
git clone <repository-url>
cd nlpbench
pip install -e .

# Install with enhanced diversity metrics
pip install -e ".[diversity]"

# Or install from PyPI (when available)
pip install nlpbench
pip install "nlpbench[diversity]"  # with enhanced metrics

Quick Start

Basic Usage

Analyze a dataset with default settings:

nlpbench --hf-repo databricks/databricks-dolly-15k

Generate HTML Report

nlpbench --hf-repo microsoft/DialoGPT-medium --output-format html --output-file report.html

Using Custom Configuration

# Create a default config file
nlpbench init-config

# Use custom configuration
nlpbench --hf-repo org/dataset --config .nlpbench.json

CLI Commands

Main Command: `analyze`

Analyze a Hugging Face dataset for quality issues.

nlpbench analyze --hf-repo REPO_ID [OPTIONS]

Options:

--hf-repo: Hugging Face dataset repository (required)
--config: Path to configuration file
--config-name: Dataset configuration name
--split: Dataset split to analyze (default: train)
--output-format: Output format (console/json/html/all)
--output-file: Output file path
--cache-dir: Cache directory for datasets
--token: Hugging Face authentication token

Inspect Command

Quickly inspect a dataset without running full analysis:

nlpbench inspect --hf-repo REPO_ID [OPTIONS]

Configuration Commands

# Create default configuration
nlpbench init-config

# Show example configuration
nlpbench show-config

Configuration

NLPBench uses a JSON configuration file to customize validation rules. Create a default configuration:

nlpbench init-config

Example configuration:

{
  "min_content_length": 10,
  "max_content_length": 50000,
  "allowed_roles": ["user", "assistant", "system"],
  "duplicate_threshold": 0.05,
  "empty_content_threshold": 0.01,
  "quality_thresholds": {
    "excellent": 95.0,
    "good": 85.0,
    "fair": 70.0,
    "poor": 50.0
  }
}

Supported Dataset Formats

NLPBench automatically detects and handles various dataset formats:

Conversations Format:

{"conversations": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Messages Format:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Instruction-Following Format:

{"instruction": "...", "input": "...", "output": "..."}

Q&A Format:
```
{"question": "...", "answer": "..."}
```
Prompt-Response Format:
```
{"prompt": "...", "response": "..."}
```

Quality Metrics

NLPBench evaluates datasets across multiple dimensions:

Schema Validation

Required columns presence
Data types consistency
Non-null value validation

Content Quality

Content length distribution
Empty content detection
Role consistency validation

Data Integrity

Duplicate detection
Uniqueness validation
Role distribution analysis

Diversity Metrics

Basic Diversity (No Additional Dependencies)

Lexical Diversity: Type-Token Ratio (TTR), vocabulary richness
Role Distribution: Balance across conversation roles
Length Variation: Content length diversity and coefficient of variation
Character Diversity: Character set distribution and entropy

Enhanced Diversity (Optional Dependencies)

Semantic Diversity: Embedding-based clustering and similarity analysis
Syntactic Diversity: Part-of-speech patterns and dependency structures
Topic Diversity: LDA-based topic modeling and distribution
Advanced Lexical: MTLD, MATTR, and readability scores

Overall Scoring

Weighted quality score (0-100)
Diversity score (0-100)
Quality level classification
Actionable recommendations for improvement

Requirements

Core Dependencies

Python 3.8+
datasets >= 2.14.0
great-expectations >= 0.18.0
pydantic >= 2.0.0
click >= 8.0.0
rich >= 13.0.0
tqdm >= 4.65.0
pandas >= 2.0.0
numpy >= 1.21.0

Optional Dependencies (for Enhanced Diversity Metrics)

sentence-transformers >= 2.2.0 (semantic analysis)
torch >= 1.9.0 (neural embeddings)
scikit-learn >= 1.0.0 (clustering and metrics)
nltk >= 3.7 (advanced lexical analysis)
textstat >= 0.7.0 (readability metrics)
spacy >= 3.4.0 (syntactic analysis)
gensim >= 4.2.0 (topic modeling)
umap-learn >= 0.5.0 (dimensionality reduction)

Development

Setup Development Environment

# Install with development dependencies
uv sync --group dev

# Or using make
make dev-setup

Running Tests

# Using uv
uv run pytest

# With coverage
uv run pytest --cov=src --cov-report=html

# Using make
make test
make test-cov

Code Quality

# Linting
uv run ruff check .
make lint

# Auto-fix linting issues
uv run ruff check --fix .
make lint-fix

# Formatting
uv run ruff format .
make format

# Type checking
uv run mypy src/
make type-check

# Run all checks
make check

Available Make Commands

make help              # Show available commands
make install           # Install dependencies
make install-dev       # Install with dev dependencies
make test             # Run tests
make test-cov         # Run tests with coverage
make lint             # Run linting
make format           # Format code
make check            # Run all quality checks
make clean            # Clean temporary files
make build            # Build the package
make demo             # Run demo analysis
make demo-html        # Generate HTML demo report

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Run the test suite
Submit a pull request

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
assets		assets
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
example.json		example.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

always-further/NLPBench

Folders and files

Latest commit

History

Repository files navigation