# Objective

This Jupyter notebook aims to explore the transformation of text from SimpleWikipedia using markdown. We are going to compare several models, both for short and long texts.

**For Short Texts (Token Count <= 3,000)**
- **Direct Transformation**: If the text contains fewer than 7,000 tokens, it will be transformed directly without segmentation. This approach is straightforward and efficient for shorter texts.

**For Long Texts (Token Count > 3,000)**
- **Segmented Transformation**: For longer texts exceeding 7,000 tokens, the text will be divided into manageable sections. Each section will be transformed individually, and the results will be combined to form the final output.

#### Rationale

1. **Model Limitations**: Many models lack the capacity to handle long contexts effectively while maintaining high quality. Segmenting the text helps mitigate this limitation.
2. **Information Preservation**: We observed that models often truncate or significantly shorten formatted texts (e.g., reducing 11,000 tokens to 2,300 tokens), leading to substantial information loss. By processing sections individually, we aim to preserve the integrity and completeness of the original content.

#### Note on 3,000 Token Limit

The maximum output depends on the model. For example Llama models have a maximum of 4k-8k of tokens, GPT-4o of 16k, Deepseek of 8k, etc.

The 3,000-token limit is chosen to provide a buffer, ensuring that even if the token count increases after transforming the text into markdown, the output remains within the model's capacity.

#### Models

We selected several cost-effective models

* `meta-llama/llama-3.2-1b-instruct`
    * \$0.01/M input tokens
    * \$0.01/M output tokens
* `meta-llama/llama-3.2-3b-instruct`
    * \$0.015/M input tokens
    * \$0.025/M output tokens
* `nousresearch/hermes-2-pro-llama-3-8b`
    * \$0.025/M input tokens
    * \$0.04/M output tokens
* `mistralai/ministral-8b`
    * \$0.1/M input tokens
    * \$0.1/M output tokens
* `mistralai/mistral-nemo-12b`
    * \$0.035/M input tokens
    * \$0.08/M output tokens
* `nousresearch/hermes-3-llama-3.1-70b`
    * \$0.12/M input tokens
    * \$0.3/M output tokens
* `openai/gpt-4o-mini`
    * \$0.15/M input tokens
    * \$0.6/M output tokens
* `deepseek/deepseek-chat`
    * \$0.14/M input tokens
    * \$0.28/M output tokens

In [22]:
import yaml
import pandas as pd

from os import getenv
from tqdm import tqdm
from dotenv import load_dotenv
from transformers import AutoTokenizer
from pathlib import Path

from src.utils.tokenizer import count_tokens
from src.format_articles import (
    convert_text_to_markdown,
    convert_long_text_to_markdown
)

def _load_yaml(path):
    with open(path) as file:
        return yaml.safe_load(file)

load_dotenv()
tqdm.pandas()
huggingface_token = getenv("HUGGINGFACE_TOKEN")

base_path = Path("..")

# 1 - Load data

In [23]:
data = pd.read_parquet(base_path / "data/parsed/articles.parquet")

# 2 - Count tokens

We should use the specific tokenizer for each case, but this is just to get an estimation...

In [24]:
model_hf = "deepseek-ai/DeepSeek-V3"

# Load the tokenizer using the Hugging Face token for authentication
tokenizer = AutoTokenizer.from_pretrained(model_hf, token=huggingface_token)

# Apply the function to each row in the 'text' column with a progress bar
data['token_count'] = data['text'].progress_apply(lambda text: count_tokens(tokenizer, text))

# Print the first row to verify
print(f"Token count: {data['token_count'].sum()}")

 68%|██████▊   | 176925/259096 [01:20<00:30, 2667.99it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (182923 > 131072). Running this sequence through the model will result in indexing errors
100%|██████████| 259096/259096 [01:49<00:00, 2374.73it/s]

Token count: 86175214





### Short article

In [25]:
short_article = data[data["id"] == 13].iloc[0]

short_article_directory = Path("./short")
short_article_path = short_article_directory / "base.md"
short_article_directory.mkdir(parents=True, exist_ok=True)
short_article_path.write_text(short_article["text"], encoding="utf-8")

print(short_article["text"])
print("\n================\n")
print(short_article["token_count"])

= Alan Turing =




Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.

== Early life and family ==

Alan Mathison Turing was born in Maida Vale, London on 23 June 1912. His father was part of a family of merchants from Scotland. His mother, Ethel Sara, was the daughter of an engineer.

=== Education ===

Turing went to St. Michael's, a school at 20 Charles Road, St Leonards-on-sea, when he was five years old.
"This is only a foretaste of what is to come, and only the shadow of what is going to be.” – Alan Turing.

The Stoney family were once prominent landlords in North Tipperary. His mother Ethel Sara Stoney (1881–1976) was daughter of Edward Waller Stoney (Borrisokane, North Tipperary) and Sarah Crawford (Cartron Abbey, Co. Longford), who were Protestant Anglo-Irish gentry. She was educated in Dublin at Alexandra School and College. On 1 October 1907, she marrie

### Long article

In [26]:
long_article = data[data["id"] == 3077].iloc[0]

long_article_directory = Path("./long")
long_article_path = long_article_directory / "base.md"
long_article_directory.mkdir(parents=True, exist_ok=True)
long_article_path.write_text(long_article["text"], encoding="utf-8")

print(long_article["text"])
print("\n================\n")
print(long_article["token_count"])

= Manchester United F.C. =


 

Manchester United Football Club (F.C.) is a professional football club. It is based in Old Trafford, Greater Manchester, England. It plays in the Premier League, the highest level of English football. Its nickname is "the Red Devils". The club started as Newton Heath LYR Football Club, in 1878. It changed its name to Manchester United in 1902. It moved to its current stadium, Old Trafford, in 1910.

Manchester United has won more trophies than any other club in English football other than Liverpool FC. It has won 20 League titles, 13 FA Cups, six League Cups and 21 FA Community Shields. United has also won three UEFA Champions Leagues, one UEFA Europa League, one UEFA Cup Winners' Cup, one UEFA Super Cup, one Intercontinental Cup and one FIFA Club World Cup. In 1998–99, the club became the first club in the history of English football to win the continental European treble. It won the UEFA Europa League in 2016–17 and became one of five clubs to win all 

# 3 - Compare models

In [27]:
# Load prompt
prompts_path = base_path / "prompts.yaml"
with open(prompts_path, "r") as file:
    prompts = yaml.safe_load(file)

## 3.1 - Llama-3.2 1B

In [28]:
short_article_formatted_llama_1b = convert_text_to_markdown(
    model_openrouter="meta-llama/llama-3.2-1b-instruct",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_llama_1b_path = short_article_directory / "llama_1b.md"
short_article_formatted_llama_1b_path.write_text(short_article_formatted_llama_1b, encoding="utf-8")

4485

In [29]:
llama_1b_tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    token=huggingface_token
)

long_article_formatted_llama_1b = convert_long_text_to_markdown(
    model_openrouter="meta-llama/llama-3.2-1b-instruct",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=llama_1b_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_llama_1b_path = long_article_directory / "llama_1b.md"
long_article_formatted_llama_1b_path.write_text(long_article_formatted_llama_1b, encoding="utf-8")

50177

## 3.2 - Llama-3.2 3B

In [30]:
short_article_formatted_llama_3b = convert_text_to_markdown(
    model_openrouter="meta-llama/llama-3.2-3b-instruct",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_llama_3b_path = short_article_directory / "llama_3b.md"
short_article_formatted_llama_3b_path.write_text(short_article_formatted_llama_3b, encoding="utf-8")

4409

In [31]:
llama_3b_tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    token=huggingface_token
)

long_article_formatted_llama_3b = convert_long_text_to_markdown(
    model_openrouter="meta-llama/llama-3.2-3b-instruct",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=llama_3b_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_llama_3b_path = long_article_directory / "llama_3b.md"
long_article_formatted_llama_3b_path.write_text(long_article_formatted_llama_3b, encoding="utf-8")

59429

## 3.3 - Llama-3.1 8B

In [32]:
short_article_formatted_llama_8b = convert_text_to_markdown(
    model_openrouter="nousresearch/hermes-2-pro-llama-3-8b",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_llama_8b_path = short_article_directory / "llama_8b.md"
short_article_formatted_llama_8b_path.write_text(short_article_formatted_llama_8b, encoding="utf-8")

4665

In [33]:
llama_8b_tokenizer = AutoTokenizer.from_pretrained(
    "NousResearch/Hermes-2-Pro-Llama-3-8B",
    token=huggingface_token
)

long_article_formatted_llama_8b = convert_long_text_to_markdown(
    model_openrouter="nousresearch/hermes-2-pro-llama-3-8b",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=llama_8b_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_llama_8b_path = long_article_directory / "llama_8b.md"
long_article_formatted_llama_8b_path.write_text(long_article_formatted_llama_3b, encoding="utf-8")

59429

## 3.4 - Ministral 8B

In [34]:
short_article_formatted_ministral_8b = convert_text_to_markdown(
    model_openrouter="mistralai/ministral-8b",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_ministral_8b_path = short_article_directory / "ministral_8b.md"
short_article_formatted_ministral_8b_path.write_text(short_article_formatted_ministral_8b, encoding="utf-8")

4768

In [36]:
ministral_8b_tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Ministral-8B-Instruct-2410",
    token=huggingface_token
)

long_article_formatted_ministral_8b = convert_long_text_to_markdown(
    model_openrouter="mistralai/ministral-8b",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=ministral_8b_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_ministral_8b_path = long_article_directory / "ministral_8b.md"
long_article_formatted_ministral_8b_path.write_text(long_article_formatted_ministral_8b, encoding="utf-8")

65732

## 3.5 - Mistral Nemo 12B

In [37]:
short_article_formatted_nemo_12b = convert_text_to_markdown(
    model_openrouter="mistralai/mistral-nemo",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_nemo_12b_path = short_article_directory / "nemo_12b.md"
short_article_formatted_nemo_12b_path.write_text(short_article_formatted_nemo_12b, encoding="utf-8")

4766

## 3.6 - Llama-3.1 70B

In [38]:
nemo_12b_tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    token=huggingface_token
)

long_article_formatted_nemo_12b = convert_long_text_to_markdown(
    model_openrouter="mistralai/mistral-nemo",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=nemo_12b_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_nemo_12b_path = long_article_directory / "nemo_12b.md"
long_article_formatted_nemo_12b_path.write_text(long_article_formatted_nemo_12b, encoding="utf-8")

51981

## 3.7 - GPT-4o-mini

In [39]:
short_article_formatted_gpt4o_mini = convert_text_to_markdown(
    model_openrouter="openai/gpt-4o-mini",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_gpt4o_mini_path = short_article_directory / "gpt4o_mini.md"
short_article_formatted_gpt4o_mini_path.write_text(short_article_formatted_gpt4o_mini, encoding="utf-8")

4766

In [40]:
gpt4o_mini_tokenizer = AutoTokenizer.from_pretrained(
    "Xenova/gpt-4o",
    token=huggingface_token
)

long_article_formatted_gpt4o_mini = convert_long_text_to_markdown(
    model_openrouter="openai/gpt-4o-mini",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=gpt4o_mini_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_gpt4o_mini_path = long_article_directory / "gpt4o_mini.md"
long_article_formatted_gpt4o_mini_path.write_text(long_article_formatted_gpt4o_mini, encoding="utf-8")

69178

## 3.8 - DeepSeek V3

In [41]:
short_article_formatted_deepseek_v3 = convert_text_to_markdown(
    model_openrouter="deepseek/deepseek-chat",
    raw_text=short_article["text"],
    template=prompts["format_markdown"],
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)
short_article_formatted_deepseek_v3_path = short_article_directory / "deepseek_v3.md"
short_article_formatted_deepseek_v3_path.write_text(short_article_formatted_deepseek_v3, encoding="utf-8")

4766

In [42]:
deepseek_v3_tokenizer = AutoTokenizer.from_pretrained(
    "deepseek-ai/DeepSeek-V3",
    token=huggingface_token
)

long_article_formatted_deepseek_v3 = convert_long_text_to_markdown(
    model_openrouter="deepseek/deepseek-chat",
    raw_text=long_article["text"],
    template=prompts["format_markdown"],
    tokenizer=deepseek_v3_tokenizer,
    max_tokens=3000,
    apply_simple_formatting=True,
    apply_llm_formatting=True,
)

long_article_formatted_deepseek_v3_path = long_article_directory / "deepseek_v3.md"
long_article_formatted_deepseek_v3_path.write_text(long_article_formatted_deepseek_v3, encoding="utf-8")

64183