# Objective

The purpose of this notebook is to showcase how to parallel process multiple Wikipedia articles and store them in a SQLite Database to avoid repetition.

In [1]:
import os
import yaml
import time
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from os import getenv

from transformers import AutoTokenizer

from src.convert_to_markdown import convert_text_to_markdown
from src.utils.database import initialize_db, filter_rows_in_db, insert_row
from src.utils.tokenizer import count_tokens
from src.utils.parallel import parallel_process_dataframe

load_dotenv()

True

## Load Config & Prompts

In [2]:
config_path = "../run_config.yaml"
prompts_path = "../prompts.yaml"

with open(config_path, 'r') as file:
    config = yaml.safe_load(file)

with open(prompts_path, 'r') as file:
    prompts = yaml.safe_load(file)

## Initialize SQLite DB

In [3]:
base_path = Path("../")
db_path = base_path / Path(config["data_folder"]) / config["db_file"]

initialize_db(db_path)

Database already exists at ../data/database.db


## Load data

In [4]:
base_path = Path("../")
data_path = base_path / Path(config["processed_folder"]) / config["output_file"]

data = pd.read_parquet(data_path)
data["id"] = data["id"].astype(int).apply(int)

print(data.shape)

(258559, 3)


## Load tokenizer

In [5]:
model_hf = "deepseek-ai/DeepSeek-V3"

# Retrieve the Hugging Face token from the environment variables
huggingface_token = getenv("HUGGINGFACE_TOKEN")

# Load the tokenizer using the Hugging Face token for authentication
tokenizer = AutoTokenizer.from_pretrained(model_hf, token=huggingface_token)

## Transform article to markdown

In [6]:
count_tokens(tokenizer, data.iloc[2]["text"])

1463

In [7]:
print(data.iloc[2]["text"])

= Art =

Art is a creative activity. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing subjects, and expressing the author's thoughts. The product of art is called a work of art, for others to experience.Various definitions in: Wilson, Simon & Lack, Jennifer 2008. The Tate guide to modern art terms. Tate Publishing. ISBN 978-1-85437-750-0E.H. Gombrich 1995. The story of art. London: Phaidon. ISBN 978-0714832470Kleiner, Gardner, Mamiya and Tansey. 2004. Art through the ages. 12th ed. 2 volumes, Wadsworth. ISBN 0-534-64095-8 (vol 1) and ISBN 0-534-64091-5 (vol 2) Some art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft. Those who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity. "The arts" 

In [8]:
model_openrouter = "deepseek/deepseek-chat"

start_time = time.time()  # Start the timer

markdown_text = convert_text_to_markdown(
    model_openrouter=model_openrouter,
    raw_text=data.iloc[2]["text"],
    template=prompts["markdown_conversion"]
)

end_time = time.time()  # End the timer

execution_time = end_time - start_time  # Calculate execution time
print(f"Execution time: {execution_time:.2f} seconds")

Execution time: 21.52 seconds


In [9]:
print(markdown_text)

# Art

Art is a creative activity. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing subjects, and expressing the author's thoughts. The product of art is called a work of art, for others to experience.

Some art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft. Those who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity. "The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.

## Types of art

Art is divided into the plastic arts, where something is made, and the performing arts, where something is done by humans in action. The other division is between pure arts, done for themselves, and practical 

## Insert article in DB

In [10]:
insert_row(
    db_path=db_path,
    id=int(data.iloc[2]["id"]),
    title=data.iloc[2]["title"],
    raw_text=data.iloc[2]["text"],
    markdown_text=markdown_text,
    raw_text_tokens=count_tokens(tokenizer, data.iloc[2]["text"]),
    markdown_text_tokens=count_tokens(tokenizer, markdown_text),
    model=model_hf,
    debug=True
)

Row inserted successfully with id 6.


## Parallel insertion of multiple articles in the DB

We are going to use what we have learned above to first filter the DB and then insert 10 articles in the DB

## Filter articles already present in the DB

In [14]:
filtered_data = filter_rows_in_db(
    df=data,
    db_path=db_path,
    id_column="id",
)

print(filtered_data.shape)

(258548, 3)


## Process and insert 10 articles in parallel

In [15]:
df_test = filtered_data.iloc[:5].copy()

# Estimate number of tokens
df_test['text'].apply(lambda x: count_tokens(tokenizer, x))

11     327
12    5794
13     978
14     234
15     364
Name: text, dtype: int64

In [16]:
# Disable tokenizers parallelism to avoid warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

parallel_process_dataframe(
    data=df_test,
    model_openrouter=model_openrouter,
    template=prompts["markdown_conversion"],
    tokenizer=tokenizer,
    model_hf=model_hf,
    db_path=db_path,
    max_tokens=config["max_tokens"],
    max_workers=os.cpu_count() * 5,
)

Processing rows: 100%|██████████| 5/5 [01:17<00:00, 15.43s/it]
