# Objective

The goal of this notebook is to explore and analyze processed Wikipedia texts with the help of the **DeepSeek V3 model**, focusing on the following:

- Counting the number of tokens in the articles.
- Identifying patterns that can help in filtering or further processing the data.

We will begin by loading the processed Parquet file, which contains the titles and contents of Simple English Wikipedia articles, and use the DeepSeek V3 model for deeper analysis.

The conclusions taken here will be later applied in the pipeline for the dataset generation.


In [8]:
import pandas as pd
from os import getenv

from dotenv import load_dotenv
from transformers import AutoTokenizer
from tqdm import tqdm

load_dotenv()
tqdm.pandas()

## Load data

In [2]:
data = pd.read_parquet("../dump/processed/data.parquet")

## Load tokenizer

In [3]:
model_hf = "deepseek-ai/DeepSeek-V3"

# Retrieve the Hugging Face token from the environment variables
huggingface_token = getenv("HUGGINGFACE_TOKEN")

# Load the tokenizer using the Hugging Face token for authentication
tokenizer = AutoTokenizer.from_pretrained(model_hf, token=huggingface_token)

## Count tokens

In [4]:
# Function to count tokens
def count_tokens(text):
    return len(tokenizer.encode(text))

# Apply the function to each row in the 'text' column with a progress bar
data['token_count'] = data['text'].progress_apply(count_tokens)

# Print the first row to verify
print(f"Token count: {data['token_count'].sum()}")

100%|██████████| 258559/258559 [01:50<00:00, 2334.27it/s]

Token count: 86522240





## Explore low-count articles

We observe there are 80K low-count articles with 20 - 100 tokens and only a few "empty" articles (<10 tokens)

In [28]:
print(data[(data["token_count"] < 15)].shape)
print(data[(data["token_count"] < 30) & (data["token_count"] >= 20)].shape)

data[(data["token_count"] < 15)]

(22, 5)
(5990, 5)


Unnamed: 0,title,text,id,token_count,token_bin
1184,Global,= Global =\n\n,4550,4,"(0, 100]"
2852,587,= 587 =\n\n== Deaths ==\n\n * Saint David,9257,13,"(0, 100]"
20373,List of professional wrestlers,= List of professional wrestlers =\n\n,79924,8,"(0, 100]"
28294,UA,= UA =\n\nUA may mean: * Ukraine * University ...,117032,14,"(0, 100]"
70479,62,= 62 =\n\nCategory:62,303542,8,"(0, 100]"
85449,Across,= Across =\n\n,378807,4,"(0, 100]"
220790,Den,= Den =\n\n,983477,4,"(0, 100]"
220927,Bludgeon,= Bludgeon =\n\n,983922,6,"(0, 100]"
244390,Cringe,= Cringe =\n\n,1085406,5,"(0, 100]"
250774,Vindicate,= Vindicate =\n\n,1120393,6,"(0, 100]"


In [29]:
data[data["token_count"] >= 7000].shape

(439, 5)

### Group articles according to token count

In [13]:
token_group_size = 100

# Step 1: Bin the token_count column into groups of 100
data['token_bin'] = pd.cut(data['token_count'], bins=range(0, data['token_count'].max() + token_group_size, token_group_size))

# Step 2: Count the number of rows in each bin
bin_counts = data['token_bin'].value_counts().sort_values(ascending=False)

# Step 3: Get the top 10 groups with the highest counts
bin_counts.head(10)

token_bin
(0, 100]       84131
(100, 200]     72788
(200, 300]     33628
(300, 400]     17592
(400, 500]     11435
(500, 600]      7770
(600, 700]      5434
(700, 800]      4154
(800, 900]      3263
(900, 1000]     2613
Name: count, dtype: int64