tab2vec

A framework for benchmarking tabular data embeddings through text encoding.

Overview

tab2vec is a toolkit for evaluating different strategies to encode tabular data into text representations for embedding-based machine learning. The project aims to answer whether we can skip traditional feature engineering by converting raw tabular data into semantic text embeddings.

For a comprehensive explanation of the project's concepts and findings, see the blog post that details our journey optimizing templates for the Ames Housing dataset.

Installation

# Clone the repository
git clone https://github.com/WittmannF/tab2vec.git
cd tab2vec

# Create and activate a virtual environment using uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

# Install Ollama for local embeddings (optional)
# Follow instructions at https://ollama.ai/

Project Structure

tab2vec/
├── blog/                  # Blog posts and documentation
│   └── tabular_data_to_embeddings.md # Detailed write-up of findings
├── configs/               # Configuration files
│   ├── baseline/          # Configurations for baseline models
│   └── with_encoder/      # Configurations for encoder-based models
├── data/                  # Dataset storage
├── experiments/           # Experiment tracking
│   ├── reports/           # Detailed experiment reports
│   ├── results/           # CSV results from experiments
│   └── templates/         # Template variations for testing
├── main.py                # Main CLI entry point
├── requirements.txt       # Project dependencies
├── runs/                  # Experiment runs and results
│   ├── regression/        # Regression task runs
│   │   └── YYYY-MM-DD-THHMMSS/  # Timestamp folder for each run
│   │       ├── metrics.json     # Performance metrics
│   │       └── config.yaml      # Configuration used for this run
│   └── classification/    # Classification task runs
├── src/                   # Source code
│   ├── data_gen.py        # Generates synthetic datasets
│   ├── embeddings.py      # Embedding generation utilities
│   ├── encoders/          # Text encoding strategies
│   │   ├── __init__.py
│   │   ├── encoder_pipeline.py  # Pipeline for applying encoders
│   │   └── raw_text.py          # Raw text encoder implementation
│   ├── get_real_datasets.py # Utilities for fetching real datasets
│   ├── models/            # Model training and evaluation
│   │   ├── __init__.py
│   │   ├── baseline.py    # Baseline model implementations
│   │   └── model_pipeline.py # Training and evaluation pipeline
│   └── template_benchmark.py # Template benchmarking utilities
└── tests/                 # Test suite
    ├── test_data_gen.py
    ├── test_embeddings.py
    ├── test_model_pipeline.py
    ├── test_ollama_connection.py
    └── test_raw_text_encoder.py

Configuration

The system is configured through YAML files in the configs/ directory:

task_type: regression          # or "classification"
target_col: SalePrice          # Target column name
feature_cols:                  # Optional - features to use
  - LotArea
  - GrLivArea
  - ...
encoder:                       # List of encoders applied in order
  - raw_text
embedder: nomic-embed-text     # Embedding model to use
model:
  type: catboost               # ML model to use
  params:                      # Model-specific parameters
    iterations: 1000
    learning_rate: 0.03

Setup and Testing

Environment Setup

The project uses uv for Python environment management:

# Set up the environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Running Tests

The project includes tests to verify functionality:

# Run all tests
pytest

# Run specific test modules
pytest tests/test_data_gen.py

# Run with verbose output
pytest -v

Generating Sample Datasets

Use the data generation module to create dummy datasets:

# Generate a regression dataset with abstract features
python main.py generate --task regression --samples 1000 --features 10 --output data/abstract_regression.csv

# Generate a classification dataset with abstract features
python main.py generate --task classification --samples 1000 --features 10 --output data/abstract_classification.csv

# Generate semantic datasets with various domains
python -m src.data_gen --task regression --semantic-domain customer --samples 1000 --output data/customer_regression.csv
python -m src.data_gen --task regression --semantic-domain health --samples 1000 --output data/health_regression.csv
python -m src.data_gen --task regression --semantic-domain realestate --samples 1000 --output data/realestate_regression.csv

# See all options
python -m src.data_gen --help

Usage

Working with Real-World Datasets

# Download and prepare the Ames Housing dataset
python -m src.get_real_datasets --dataset ames_housing

Running Models

# Run baseline model on the Ames Housing dataset
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/baseline/ames_housing.yaml

# Run an embedding-based model with the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml

# Use a custom template for the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --template "Real estate listing: ${price_per_sqft}/sqft {property_type} home with {bedrooms} BR, {bathrooms} BA, {area} sqft."

Benchmarking Templates

# Run benchmarks on different template designs
python -m src.template_benchmark --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --templates experiments/templates/ames_housing/template_variations_v1.json --output experiments/results/ames_housing/benchmark_results.csv

Using your own dataset

To use your own dataset:

Prepare a CSV file with your tabular data
Update the configuration file to specify the target and feature columns
Run the model pipeline as shown in the examples above

Adding New Components

Adding a New Encoder

Create a new file in the src/encoders/ directory (e.g., src/encoders/my_encoder.py)
Implement a class that inherits from the base encoder:

from src.encoders import BaseEncoder, register_encoder

@register_encoder("my_encoder")
class MyEncoder(BaseEncoder):
    def encode(self, df):
        """
        Args:
            df (pandas.DataFrame): Tabular dataframe
            
        Returns:
            list: List of strings ready for embedding
        """
        # Your encoding logic here
        text_representations = []
        for _, row in df.iterrows():
            # Process each feature and create a text representation
            text = "Your encoding logic here"
            text_representations.append(text)
        return text_representations

Import your encoder in src/encoders/__init__.py to register it
Update your configuration file to use your encoder

Adding a New Embedder

Update the src/embeddings.py file to support your embedder
Update your configuration file to specify your embedder

Benchmark Results

The benchmark outputs the following metrics:

For Regression Tasks:

R² (Coefficient of determination)
RMSE (Root Mean Squared Error)
MAPE (Mean Absolute Percentage Error)

For Classification Tasks:

Accuracy
F1 Score
Precision and Recall

Results are saved in the runs/ directory with timestamps for easy comparison.

Interpreting Results

Higher R² values and lower RMSE/MAPE values indicate better model performance for regression tasks. For classification tasks, higher accuracy and F1 scores indicate better performance.

The benchmark also includes a comparison table showing how different encoding strategies perform. This helps identify which encoding methods best preserve the semantic meaning of numerical features when converted to text embeddings.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
blog		blog
configs		configs
data		data
experiments		experiments
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
NOTES.md		NOTES.md
README.md		README.md
config_sample.yaml		config_sample.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tab2vec

Overview

Installation

Project Structure

Configuration

Setup and Testing

Environment Setup

Running Tests

Generating Sample Datasets

Usage

Working with Real-World Datasets

Running Models

Benchmarking Templates

Using your own dataset

Adding New Components

Adding a New Encoder

Adding a New Embedder

Benchmark Results

For Regression Tasks:

For Classification Tasks:

Interpreting Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tab2vec

Overview

Installation

Project Structure

Configuration

Setup and Testing

Environment Setup

Running Tests

Generating Sample Datasets

Usage

Working with Real-World Datasets

Running Models

Benchmarking Templates

Using your own dataset

Adding New Components

Adding a New Encoder

Adding a New Embedder

Benchmark Results

For Regression Tasks:

For Classification Tasks:

Interpreting Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages