A framework for benchmarking tabular data embeddings through text encoding.
tab2vec is a toolkit for evaluating different strategies to encode tabular data into text representations for embedding-based machine learning. The project aims to answer whether we can skip traditional feature engineering by converting raw tabular data into semantic text embeddings.
For a comprehensive explanation of the project's concepts and findings, see the blog post that details our journey optimizing templates for the Ames Housing dataset.
# Clone the repository
git clone https://github.com/WittmannF/tab2vec.git
cd tab2vec
# Create and activate a virtual environment using uv
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txt
# Install Ollama for local embeddings (optional)
# Follow instructions at https://ollama.ai/tab2vec/
├── blog/ # Blog posts and documentation
│ └── tabular_data_to_embeddings.md # Detailed write-up of findings
├── configs/ # Configuration files
│ ├── baseline/ # Configurations for baseline models
│ └── with_encoder/ # Configurations for encoder-based models
├── data/ # Dataset storage
├── experiments/ # Experiment tracking
│ ├── reports/ # Detailed experiment reports
│ ├── results/ # CSV results from experiments
│ └── templates/ # Template variations for testing
├── main.py # Main CLI entry point
├── requirements.txt # Project dependencies
├── runs/ # Experiment runs and results
│ ├── regression/ # Regression task runs
│ │ └── YYYY-MM-DD-THHMMSS/ # Timestamp folder for each run
│ │ ├── metrics.json # Performance metrics
│ │ └── config.yaml # Configuration used for this run
│ └── classification/ # Classification task runs
├── src/ # Source code
│ ├── data_gen.py # Generates synthetic datasets
│ ├── embeddings.py # Embedding generation utilities
│ ├── encoders/ # Text encoding strategies
│ │ ├── __init__.py
│ │ ├── encoder_pipeline.py # Pipeline for applying encoders
│ │ └── raw_text.py # Raw text encoder implementation
│ ├── get_real_datasets.py # Utilities for fetching real datasets
│ ├── models/ # Model training and evaluation
│ │ ├── __init__.py
│ │ ├── baseline.py # Baseline model implementations
│ │ └── model_pipeline.py # Training and evaluation pipeline
│ └── template_benchmark.py # Template benchmarking utilities
└── tests/ # Test suite
├── test_data_gen.py
├── test_embeddings.py
├── test_model_pipeline.py
├── test_ollama_connection.py
└── test_raw_text_encoder.py
The system is configured through YAML files in the configs/ directory:
task_type: regression # or "classification"
target_col: SalePrice # Target column name
feature_cols: # Optional - features to use
- LotArea
- GrLivArea
- ...
encoder: # List of encoders applied in order
- raw_text
embedder: nomic-embed-text # Embedding model to use
model:
type: catboost # ML model to use
params: # Model-specific parameters
iterations: 1000
learning_rate: 0.03The project uses uv for Python environment management:
# Set up the environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txtThe project includes tests to verify functionality:
# Run all tests
pytest
# Run specific test modules
pytest tests/test_data_gen.py
# Run with verbose output
pytest -vUse the data generation module to create dummy datasets:
# Generate a regression dataset with abstract features
python main.py generate --task regression --samples 1000 --features 10 --output data/abstract_regression.csv
# Generate a classification dataset with abstract features
python main.py generate --task classification --samples 1000 --features 10 --output data/abstract_classification.csv
# Generate semantic datasets with various domains
python -m src.data_gen --task regression --semantic-domain customer --samples 1000 --output data/customer_regression.csv
python -m src.data_gen --task regression --semantic-domain health --samples 1000 --output data/health_regression.csv
python -m src.data_gen --task regression --semantic-domain realestate --samples 1000 --output data/realestate_regression.csv
# See all options
python -m src.data_gen --help# Download and prepare the Ames Housing dataset
python -m src.get_real_datasets --dataset ames_housing# Run baseline model on the Ames Housing dataset
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/baseline/ames_housing.yaml
# Run an embedding-based model with the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml
# Use a custom template for the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --template "Real estate listing: ${price_per_sqft}/sqft {property_type} home with {bedrooms} BR, {bathrooms} BA, {area} sqft."# Run benchmarks on different template designs
python -m src.template_benchmark --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --templates experiments/templates/ames_housing/template_variations_v1.json --output experiments/results/ames_housing/benchmark_results.csvTo use your own dataset:
- Prepare a CSV file with your tabular data
- Update the configuration file to specify the target and feature columns
- Run the model pipeline as shown in the examples above
- Create a new file in the
src/encoders/directory (e.g.,src/encoders/my_encoder.py) - Implement a class that inherits from the base encoder:
from src.encoders import BaseEncoder, register_encoder
@register_encoder("my_encoder")
class MyEncoder(BaseEncoder):
def encode(self, df):
"""
Args:
df (pandas.DataFrame): Tabular dataframe
Returns:
list: List of strings ready for embedding
"""
# Your encoding logic here
text_representations = []
for _, row in df.iterrows():
# Process each feature and create a text representation
text = "Your encoding logic here"
text_representations.append(text)
return text_representations- Import your encoder in
src/encoders/__init__.pyto register it - Update your configuration file to use your encoder
- Update the
src/embeddings.pyfile to support your embedder - Update your configuration file to specify your embedder
The benchmark outputs the following metrics:
- R² (Coefficient of determination)
- RMSE (Root Mean Squared Error)
- MAPE (Mean Absolute Percentage Error)
- Accuracy
- F1 Score
- Precision and Recall
Results are saved in the runs/ directory with timestamps for easy comparison.
Higher R² values and lower RMSE/MAPE values indicate better model performance for regression tasks. For classification tasks, higher accuracy and F1 scores indicate better performance.
The benchmark also includes a comparison table showing how different encoding strategies perform. This helps identify which encoding methods best preserve the semantic meaning of numerical features when converted to text embeddings.