Word2Vec gave NLP a breakthrough by mapping words to vectors where similar meanings sit close together. The world embedding applies the same idea to the economy: it maps each trading day to a vector where days with similar economic conditions sit close together, so the distance between any two days measures how different the economic environment was.
The world embedding is a daily, low-dimensional vector that compresses news narratives, financial market data, policy uncertainty, geopolitical risk, and macroeconomic releases into a unified representation of the aggregate economic state. A machine-learning-based multimodal encoder produces this embedding under a strict expanding-window protocol, ensuring the series is look-ahead-free at every point in time. Unsupervised clustering of the world embedding recovers known business-cycle regimes with higher fidelity than linear methods, and the representation carries incremental out-of-sample forecasting power for labor-market indicators. The primary application addresses the interest-rate spanning puzzle: embedding principal components capture unspanned macro risks that raise the in-sample R-squared for bond excess returns by 10 to 34 percentage points beyond standard yield-curve factors. This predictive content originates from non-yield information and survives out-of-sample evaluation, orthogonalization, and a pseudo-out-of-sample extension through the Covid-19 pandemic.
This repository releases the full output: 9,520 pre-computed daily vectors covering January 2, 1985 through June 30, 2021, pre-trained model weights for all expanding windows, and the model architecture code. The vectors can be merged into any dataset in three lines of code. No GPU, no deep-learning expertise, no retraining required.
The full methodology and results are detailed in:
World Embedding: The Daily Economic State and Bond Risk Premia
If you use this data or methodology in your research, please cite the paper:
@article{Tabatabaei2026WorldEmbedding,
title = {World Embedding: The Daily Economic State and Bond Risk Premia},
author = {Tabatabaei, Elham},
year = {2026},
journal = {SSRN Working Paper},
number = {6503446},
url = {https://papers.ssrn.com/abstract=6503446}
}- Pre-computed daily vectors. 9,520 business days (1985-2021) of 64-dimensional embedding vectors, ready to download as a CSV and merge into any dataset in three lines of code.
- Seven information modalities. News narratives, economic policy uncertainty, geopolitical risk, news sentiment, domestic financial markets, international markets and FX, and macroeconomic indicators - all compressed into a single vector per day.
- Look-ahead-free. Trained under a strict expanding-window protocol. The embedding for day t uses only information available through day t.
- Validated against known benchmarks. Out-of-sample correlation of 0.50 with the ADS Business Conditions Index (not used in training). Unsupervised clustering recovers NBER recessions with 13-26x higher alignment than linear PCA.
- Bond risk premia. Embedding principal components raise out-of-sample R-squared for bond excess returns by 10-34 percentage points beyond yield-curve factors.
- Near-zero equity return prediction. The embedding does not predict stock returns, confirming it captures the economic state rather than exploitable return patterns.
- Covid-robust. With all parameters frozen at December 2017, the embedding tracks the 2020 contraction and recovery without retraining.
- Pre-trained weights included. All model weights for the three expanding windows and the full-sample reference model are available under Releases, along with hyperparameters and normalization statistics, so you can load the trained model directly without retraining.
- Pip-installable. Available as a Python package with convenience functions for loading embeddings, extracting principal components, and accessing regime labels.
Download the CSV and add embedding principal components to your regressions:
import pandas as pd
from sklearn.decomposition import PCA
# Load pre-computed daily embeddings (see Data Periods for train/test/simulation splits)
df = pd.read_csv("data/world_embedding_daily.csv", parse_dates=["date"], index_col="date")
# Extract first 5 principal components
pca = PCA(n_components=5)
epc = pd.DataFrame(
pca.fit_transform(df.values),
index=df.index,
columns=[f"EPC{i+1}" for i in range(5)],
)
# Merge into your dataset
your_data = your_data.merge(epc, left_index=True, right_index=True, how="left")pip install git+https://github.com/elittb/world-embedding.gitfrom worldembedding import load_embedding, load_regime_labels
# Daily 64-dim embedding vectors (pandas DataFrame)
emb = load_embedding()
# Unsupervised regime labels (16 VQ codes)
regimes = load_regime_labels()
# Principal components (convenience function)
from worldembedding import get_principal_components
epc = get_principal_components(n_components=5)world-embedding/
├── README.md # This file
├── CITATION.cff # GitHub "Cite this repository" metadata
├── LICENSE # MIT license
├── pyproject.toml # Package build configuration
├── data/
│ ├── world_embedding_daily.csv # Pre-computed 64-dim daily vectors (9,520 days)
│ ├── world_embedding_regime_labels.csv # Unsupervised regime labels (16 codes)
│ └── README.md # Data documentation and period details
├── worldembedding/ # Pip-installable Python package
│ ├── __init__.py
│ ├── core.py # load_embedding(), get_principal_components()
│ └── model/ # Full DSSDE architecture (inspect the pipeline)
│ ├── dssde.py # Main model class
│ ├── encoder.py # Modality encoders, cross-modal attention, GRU
│ ├── decoder.py # Observation reconstruction decoder
│ ├── loss.py # Composite loss (CPC, reconstruction, contrastive)
│ └── vq.py # Vector quantization for regime discovery
├── examples/
│ ├── quickstart.py # 30-line demo: load, PCA, historical analogs
│ ├── bond_spanning.py # Replicate the spanning puzzle result
│ └── regime_analysis.py # Unsupervised recession detection
└── docs/
├── index.html # Landing page for GitHub Pages / SEO
└── SEARCH_CONSOLE.md # Google Search Console setup guide
| Application | How to use |
|---|---|
| Event studies | Control for aggregate state using EPC1-EPC5 instead of ad hoc market-return or VIX controls |
| Asset pricing tests | Replace separate VIX + term spread + credit spread controls with a single parsimonious set of embedding PCs |
| Bond return prediction | Embedding PCs capture unspanned macro risks beyond yield-curve factors |
| Regime classification | Use VQ regime labels or k-means on the embedding for data-driven expansion/recession classification |
| Macro forecasting | Embedding carries incremental information for labor-market indicators in crisis periods |
| Historical analogs | Cosine similarity retrieval: find the most economically similar days across decades |
The embedding CSV contains 9,520 business days spanning three distinct periods:
| Period | Dates | Status | News data | All other modalities |
|---|---|---|---|---|
| Training & OOS test | 1985-01-02 to 2017-12-29 | Model trained and evaluated with expanding-window protocol | Actual (Bybee et al. WSJ topics) | Actual |
| Pseudo-OOS extension | 2018-01-02 to 2021-06-30 | All model parameters frozen at Dec 2017 values; no retraining | FAVAR-simulated (see paper Section 7) | Actual |
Within the training period (1985-2017), three expanding windows are used for out-of-sample evaluation:
| Window | Train | Out-of-sample test |
|---|---|---|
| W1 | 1985-2000 | 2001-2005 |
| W2 | 1985-2005 | 2006-2011 |
| W3 | 1985-2011 | 2012-2017 |
Important: For the 2018-2021 extension, news narratives (360 of 505 features) are FAVAR-simulated from the historical relationship between news topics and observed market/macro data. The remaining six modalities use actual observed data. See Section 7 of the paper for full details.
| Modality | Features | Source |
|---|---|---|
| News narratives | 360 | Bybee et al. (2024) WSJ topic attention |
| Policy uncertainty | 38 | Baker, Bloom & Davis (2016) EPU |
| Geopolitical risk | 5 | Caldara & Iacoviello (2022) GPR |
| News sentiment | 1 | SF Fed Daily News Sentiment |
| Domestic markets | 9 | Yahoo Finance (SPY, VIX, Treasury, commodities, USD) |
| International | 44 | Yahoo Finance + FRED (equities, FX, commodities) |
| Macro indicators | 48 | FRED (yields, spreads, conditions, labor, housing) |
| Total | 505 |
The Daily State-Space Deep Embedding implements a simple economic principle: today's state = yesterday's state + today's news.
Cross-Modal
News (360) ---> [ Encoder 1 ] ---+ +-----------+ +-------+
EPU (38) ---> [ Encoder 2 ] ---+ | | | |
GPR (5) ---> [ Encoder 3 ] ---+ | Attention| | GRU |
Sent (1) ---> [ Encoder 4 ] ---+--------> | |-------> | |---> z_t
Mkt (9) ---> [ Encoder 5 ] ---+ | Fusion | | | (64-dim)
Intl (44) ---> [ Encoder 6 ] ---+ | | +---+---+
Macro(48) ---> [ Encoder 7 ] ---+ +-----------+ |
z_{t-1}
Econometric analog: The architecture is a nonlinear state-space model. The encoders + attention = nonlinear observation equation. The GRU = nonlinear state transition (generalized Kalman filter). The expanding-window protocol = real-time econometric discipline.
See the paper for full details.
pip install pandas scikit-learnNo GPU, no PyTorch, no ML expertise required. Download the CSV and go.
pip install git+https://github.com/elittb/world-embedding.gitPython version: 3.9+
Core dependencies: pandas>=2.0, numpy>=1.24, scikit-learn>=1.3
Pre-trained model weights for all three expanding windows and the reference model are available under Releases.
Full training and replication code will be released upon publication. In the meantime, it is available upon request - please contact the author (see Contact).
The embedding is evaluated through multiple independent tests, all reported in the paper:
| Test | Result |
|---|---|
| ADS correlation | Out-of-sample correlation of 0.50 with the Aruoba-Diebold-Scotti Business Conditions Index, which is never used in training |
| NBER recession recovery | Unsupervised k-means clustering on the embedding recovers NBER recession dates with 13-26x higher alignment than linear PCA on the same inputs |
| Bond excess returns | Embedding PCs raise out-of-sample R-squared for 2- through 5-year bond excess returns by 10-34 percentage points beyond yield-curve factors alone |
| Clark-West tests | Statistically significant at the 1% level for all maturities at the annual forecast horizon, rejecting equal predictive ability against the yield-only benchmark |
| Giacomini-White tests | Confirm superior conditional predictive ability of the embedding-augmented model across expansion and recession subsamples |
| Equity return prediction | Near-zero out-of-sample R-squared for equity excess returns, confirming the embedding measures the economic state rather than exploitable return patterns |
| Covid-19 pseudo-OOS | With all parameters frozen at December 2017, the embedding tracks the 2020 contraction and recovery as the largest displacement in its 36-year history - no retraining, no parameter updating |
| Macro nowcasting | Real-time ADS nowcasting from the embedding achieves positive out-of-sample R-squared during crisis periods when standard models deteriorate |
Contributions that extend the world embedding or apply it to new settings are welcome. Areas of particular interest include:
- Temporal extension. Updating the embedding beyond June 2021 as new data becomes available
- Additional modalities. Integrating new data sources (e.g., social media sentiment, satellite imagery, earnings call transcripts) as input features
- Cross-country versions. Adapting the architecture to produce daily state embeddings for non-U.S. economies
- Alternative architectures. Experimenting with Transformer-based or diffusion-based state transition models in place of the GRU
- New applications. Using the embedding in settings beyond bond pricing and macro forecasting (e.g., corporate event studies, credit risk, portfolio construction)
- Benchmark comparisons. Evaluating the embedding against other state measures or dimensionality reduction methods on standardized tasks
To contribute, please open an issue or submit a pull request. If you use the embedding in a published study, a citation to the paper is appreciated.
This project is licensed under the MIT License. See LICENSE for details.
The pre-computed embedding vectors in data/ are released under CC BY 4.0, meaning you can use them freely with attribution.
PhD Candidate, Schulich School of Business, York University; Visiting Scholar, Rady School of Management, UC San Diego
