# Minifying Markdown Tables for Efficient Embedding
This notebook demonstrates how to minify markdown tables using **pymdt2json**, reducing character size for models with strict input limits.

In [16]:
%pip install pymdt2json llama-index pandas transformers tabulate

Note: you may need to restart the kernel to use updated packages.


In [17]:
!git clone https://github.com/amadou-6e/ai-data-zoo.git
!tar -xf ai-data-zoo/markdown.zip

Cloning into 'ai-data-zoo'...


### Step 1: Load Markdown Documents
We'll load a few example markdown files containing large tables.

In [18]:
from pathlib import Path
from llama_index.core import SimpleDirectoryReader

# Load markdown documents
source_dir = Path("markdown")  # Replace with your actual directory
documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()
print(f"Loaded {len(documents)} documents.")

Loaded 16 documents.


### Step 2: Minify Markdown Tables
Convert markdown tables to compact JSON blocks to save characters and tokens.

In [19]:
from pymdt2json import MinifyMDT

minified_docs = []

for idx, doc in enumerate(documents):
    minified_text = MinifyMDT(doc.text_resource.text).transform()
    minified_docs.append(minified_text)
    print(f"Document {idx} minified. Length: {len(minified_text)} characters.")

Document 0 minified. Length: 2379 characters.
Document 1 minified. Length: 2379 characters.
Document 2 minified. Length: 8196 characters.
Document 3 minified. Length: 13301 characters.
Document 4 minified. Length: 43774 characters.
Document 5 minified. Length: 11700 characters.
Document 6 minified. Length: 52743 characters.
Document 7 minified. Length: 22386 characters.
Document 8 minified. Length: 66389 characters.
Document 9 minified. Length: 145155 characters.
Document 10 minified. Length: 59313 characters.
Document 11 minified. Length: 37728 characters.
Document 12 minified. Length: 44536 characters.
Document 13 minified. Length: 117099 characters.
Document 14 minified. Length: 73503 characters.
Document 15 minified. Length: 12962 characters.


### Step 3: Compare a Markdown Table Before and After Minification
We'll create a large sample table and show the size difference.

In [20]:
import pandas as pd

# Create sample data
data = {
    "Name": [f"Person{i}" for i in range(30)],
    "Age": [20 + i for i in range(30)],
    "City": [f"City{i}" for i in range(30)]
}

df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]

# Convert to markdown
markdown_table = df.to_markdown(index=False)
print(markdown_table[:500])  # Preview

| A very long row content, which leads to a lot of white spaces   |   Age | City   |
|:----------------------------------------------------------------|------:|:-------|
| Person0                                                         |    20 | City0  |
| Person1                                                         |    21 | City1  |
| Person2                                                         |    22 | City2  |
| Person3                                                         |    23 |


#### Measure size before minification

In [21]:
print(f"Original characters: {len(markdown_table)}")

Original characters: 2719


In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(markdown_table, return_tensors="pt", add_special_tokens=False)
print(f"Original tokens: {encoded.input_ids.shape[-1]}")

Original tokens: 432


#### Minify and measure size after minification

In [23]:
compressed_table = MinifyMDT(markdown_table).transform()

print(f"Minified characters: {len(compressed_table)}")

compressed_encoded = tokenizer(compressed_table, return_tensors="pt", add_special_tokens=False)
print(f"Minified tokens: {compressed_encoded.input_ids.shape[-1]}")

Minified characters: 1027
Minified tokens: 461


## Conclusion
Markdown tables can waste thousands of characters and hundreds of tokens.
**Minifying** them **preserves meaning** while making the text embedding-friendly for models with character or token limits.