Skip to content

WittmannF/tab2vec

Repository files navigation

tab2vec

A framework for benchmarking tabular data embeddings through text encoding.

Overview

tab2vec is a toolkit for evaluating different strategies to encode tabular data into text representations for embedding-based machine learning. The project aims to answer whether we can skip traditional feature engineering by converting raw tabular data into semantic text embeddings.

For a comprehensive explanation of the project's concepts and findings, see the blog post that details our journey optimizing templates for the Ames Housing dataset.

Installation

# Clone the repository
git clone https://github.com/WittmannF/tab2vec.git
cd tab2vec

# Create and activate a virtual environment using uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

# Install Ollama for local embeddings (optional)
# Follow instructions at https://ollama.ai/

Project Structure

tab2vec/
├── blog/                  # Blog posts and documentation
│   └── tabular_data_to_embeddings.md # Detailed write-up of findings
├── configs/               # Configuration files
│   ├── baseline/          # Configurations for baseline models
│   └── with_encoder/      # Configurations for encoder-based models
├── data/                  # Dataset storage
├── experiments/           # Experiment tracking
│   ├── reports/           # Detailed experiment reports
│   ├── results/           # CSV results from experiments
│   └── templates/         # Template variations for testing
├── main.py                # Main CLI entry point
├── requirements.txt       # Project dependencies
├── runs/                  # Experiment runs and results
│   ├── regression/        # Regression task runs
│   │   └── YYYY-MM-DD-THHMMSS/  # Timestamp folder for each run
│   │       ├── metrics.json     # Performance metrics
│   │       └── config.yaml      # Configuration used for this run
│   └── classification/    # Classification task runs
├── src/                   # Source code
│   ├── data_gen.py        # Generates synthetic datasets
│   ├── embeddings.py      # Embedding generation utilities
│   ├── encoders/          # Text encoding strategies
│   │   ├── __init__.py
│   │   ├── encoder_pipeline.py  # Pipeline for applying encoders
│   │   └── raw_text.py          # Raw text encoder implementation
│   ├── get_real_datasets.py # Utilities for fetching real datasets
│   ├── models/            # Model training and evaluation
│   │   ├── __init__.py
│   │   ├── baseline.py    # Baseline model implementations
│   │   └── model_pipeline.py # Training and evaluation pipeline
│   └── template_benchmark.py # Template benchmarking utilities
└── tests/                 # Test suite
    ├── test_data_gen.py
    ├── test_embeddings.py
    ├── test_model_pipeline.py
    ├── test_ollama_connection.py
    └── test_raw_text_encoder.py

Configuration

The system is configured through YAML files in the configs/ directory:

task_type: regression          # or "classification"
target_col: SalePrice          # Target column name
feature_cols:                  # Optional - features to use
  - LotArea
  - GrLivArea
  - ...
encoder:                       # List of encoders applied in order
  - raw_text
embedder: nomic-embed-text     # Embedding model to use
model:
  type: catboost               # ML model to use
  params:                      # Model-specific parameters
    iterations: 1000
    learning_rate: 0.03

Setup and Testing

Environment Setup

The project uses uv for Python environment management:

# Set up the environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Running Tests

The project includes tests to verify functionality:

# Run all tests
pytest

# Run specific test modules
pytest tests/test_data_gen.py

# Run with verbose output
pytest -v

Generating Sample Datasets

Use the data generation module to create dummy datasets:

# Generate a regression dataset with abstract features
python main.py generate --task regression --samples 1000 --features 10 --output data/abstract_regression.csv

# Generate a classification dataset with abstract features
python main.py generate --task classification --samples 1000 --features 10 --output data/abstract_classification.csv

# Generate semantic datasets with various domains
python -m src.data_gen --task regression --semantic-domain customer --samples 1000 --output data/customer_regression.csv
python -m src.data_gen --task regression --semantic-domain health --samples 1000 --output data/health_regression.csv
python -m src.data_gen --task regression --semantic-domain realestate --samples 1000 --output data/realestate_regression.csv

# See all options
python -m src.data_gen --help

Usage

Working with Real-World Datasets

# Download and prepare the Ames Housing dataset
python -m src.get_real_datasets --dataset ames_housing

Running Models

# Run baseline model on the Ames Housing dataset
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/baseline/ames_housing.yaml

# Run an embedding-based model with the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml

# Use a custom template for the raw text encoder
python -m src.models.model_pipeline --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --template "Real estate listing: ${price_per_sqft}/sqft {property_type} home with {bedrooms} BR, {bathrooms} BA, {area} sqft."

Benchmarking Templates

# Run benchmarks on different template designs
python -m src.template_benchmark --data data/ames_housing.csv --config configs/with_encoder/ames_housing_raw_text.yaml --templates experiments/templates/ames_housing/template_variations_v1.json --output experiments/results/ames_housing/benchmark_results.csv

Using your own dataset

To use your own dataset:

  1. Prepare a CSV file with your tabular data
  2. Update the configuration file to specify the target and feature columns
  3. Run the model pipeline as shown in the examples above

Adding New Components

Adding a New Encoder

  1. Create a new file in the src/encoders/ directory (e.g., src/encoders/my_encoder.py)
  2. Implement a class that inherits from the base encoder:
from src.encoders import BaseEncoder, register_encoder

@register_encoder("my_encoder")
class MyEncoder(BaseEncoder):
    def encode(self, df):
        """
        Args:
            df (pandas.DataFrame): Tabular dataframe
            
        Returns:
            list: List of strings ready for embedding
        """
        # Your encoding logic here
        text_representations = []
        for _, row in df.iterrows():
            # Process each feature and create a text representation
            text = "Your encoding logic here"
            text_representations.append(text)
        return text_representations
  1. Import your encoder in src/encoders/__init__.py to register it
  2. Update your configuration file to use your encoder

Adding a New Embedder

  1. Update the src/embeddings.py file to support your embedder
  2. Update your configuration file to specify your embedder

Benchmark Results

The benchmark outputs the following metrics:

For Regression Tasks:

  • R² (Coefficient of determination)
  • RMSE (Root Mean Squared Error)
  • MAPE (Mean Absolute Percentage Error)

For Classification Tasks:

  • Accuracy
  • F1 Score
  • Precision and Recall

Results are saved in the runs/ directory with timestamps for easy comparison.

Interpreting Results

Higher R² values and lower RMSE/MAPE values indicate better model performance for regression tasks. For classification tasks, higher accuracy and F1 scores indicate better performance.

The benchmark also includes a comparison table showing how different encoding strategies perform. This helps identify which encoding methods best preserve the semantic meaning of numerical features when converted to text embeddings.

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors