# Segment text data

This notebook shows some methods to segment text data

In [None]:
#!pip install pandas pyarrow

## Load the data

Global News Dataset : https://huggingface.co/datasets/NickyNicky/global-news-dataset

Download the dataset

In [None]:
!mkdir -p data
!wget -O data/train-00000-of-00001.parquet https://huggingface.co/datasets/NickyNicky/global-news-dataset/resolve/main/data/train-00000-of-00001.parquet

--2025-06-27 09:16:48--  https://huggingface.co/datasets/NickyNicky/global-news-dataset/resolve/main/data/train-00000-of-00001.parquet
Resolving huggingface.co (huggingface.co)... 18.172.134.4, 18.172.134.24, 18.172.134.124, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/fc/b9/fcb9b99c94f49325b2c47cfd2a3f4f1615b3b61a4bee0d152bbdf91676cc212f/148002d1f68e5fbf7c393687410b997d9f3fcd1f4d5b43ad28abcf4ff8f3abf0?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-00000-of-00001.parquet%3B+filename%3D%22train-00000-of-00001.parquet%22%3B&Expires=1751019408&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MTAxOTQwOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2ZjL2I5L2ZjYjliOTljOTRmNDkzMjViMmM0N2NmZDJhM2Y0ZjE2MTViM2I2MWE0YmVlMGQxNTJiYmRmOTE2NzZjYzIxMmYvMTQ4MDAyZDFmNjhlNWZiZjdjMzkzNjg3NDEwYjk5N2Q

In [None]:
ls data

train-00000-of-00001.parquet


In [None]:
import pandas as pd

In [None]:
df_raw = pd.read_parquet("data/train-00000-of-00001.parquet")
df_raw.head(3)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
0,89541,,International Business Times,Paavan MATHEMA,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,https://www.ibtimes.com/un-chief-urges-world-s...,https://d.ibtimes.com/en/full/4496078/nepals-g...,2023-10-30 10:12:35.000000,UN Secretary-General Antonio Guterres urged th...,Nepal,UN Secretary-General Antonio Guterres urged th...
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,
2,89543,,VOA News,webdesk@voanews.com (Agence France-Presse),UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,https://www.voanews.com/a/un-chief-urges-world...,https://gdb.voanews.com/01000000-0a00-0242-60f...,2023-10-30 10:53:30.000000,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal,


In [None]:
print(df_raw.shape)

(105375, 12)


Management : remove empty rows

In [None]:
df = df_raw[df_raw["full_content"].notna()]
df.shape

(58432, 12)

## Sometime, we need shorter texts (context windows)

Let's say we are using Camembert

Context windows : 512

Estimate the tokens

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert-base")

text = df["full_content"].iloc[0]

tokens = tokenizer.tokenize(text)
token_count = len(tokens)
print(f"Number of tokens: {token_count}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (856 > 512). Running this sequence through the model will result in indexing errors


Number of tokens: 856


In [None]:
df["full_content"].sample(1000).apply(lambda x : len(tokenizer.tokenize(x))).describe()

Unnamed: 0,full_content
count,1000.0
mean,1758.142
std,3650.559661
min,31.0
25%,781.25
50%,1298.5
75%,1839.0
max,80074.0


We will need to divide

## How to divide a text ?

### Divide by sentence

### Brut force method

with a rule of thumb for instance

In [None]:
def chuck_text_context_window(text, max_length_token=512, letters_per_token=4):
    """
    Cut the text into chunks of max_length.
    """
    length = max_length_token * letters_per_token
    return [text[i:i + length] for i in range(0, len(text), length)]


And explode with the id of the article

In [None]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chuck_text_context_window)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

Unnamed: 0,article_id,chunk_cw
68660,211223,"Chicago, Nov. 08, 2023 (GLOBE NEWSWIRE) -- T..."
68660,211223,or processed and convenience food. This trend ...
67778,200084,"UPDATED, 5:08 AM PT, Wednesday:Democrats won s..."
67778,200084,"g national Democrats, even hugging just a nati..."
67778,200084,on abortion as an issue weighing down the GOP....
88948,432736,Veritable L.P. decreased its holdings in Devon...
88948,432736,"h report on Monday, July 24th. Finally, Piper ..."
88948,432736,rt).
72718,267535,"Leslie’s, Inc. (NASDAQ:LESL–Get Free Report)’s..."
72718,267535,"ompany’s stock worth $240,000 after purchasing..."


You can do it better with a tokenizer to count exactly what you want

Same for paragraph : you need to define what is a paragraph. In our case, it is a line break.

In [None]:
def chunk_paragraphs(text):
    """
    Cut the text into paragraphs.
    """
    return text.split("\n")

In [None]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chunk_paragraphs)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

Unnamed: 0,article_id,chunk_cw
14927,4654,Russian President Vladimir Putin claims his ex...
58435,136180,A second suspect has been arrested after boxes...
48370,76746,"TORRANCE, Calif., Oct. 30, 2023 (GLOBE NEWSW..."
52202,119319,Quantinno Capital Management LP trimmed its po...
43877,65718,Syrian state news agency reported that two wer...


If you want to divide by sentence

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(df.iloc[0]["full_content"], language="english")

for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent}")

Sentence 1: UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon.
Sentence 2: "The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades.
Sentence 3: "Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said.
Sentence 4: "When they shrink, so do river flows."
Sentence 5: Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal.
Sentence 6: Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian 

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Or use Spacy with a model

In [1]:
#!pip install spacy
#!python -m spacy download en_core_web_trf

In [None]:
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp(df.iloc[0]["full_content"])
print("🔹 Sentences:")
for sent in doc.sents:
    print(f"- {sent.text.strip()}")

🔹 Sentences:
- UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon.
- "The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades.
- "Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said.
- "When they shrink, so do river flows."
- Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal.
- Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian and Southeast Asian river valleys below.
- The 

### Use a dedicated model to segment

For instance : [wtpsplit](https://github.com/segment-any-text/wtpsplit)

Which use dedicated models trained for segmentation : https://huggingface.co/segment-any-text/sat-3l

In [2]:
#!pip install wtpsplit

In [None]:
from wtpsplit import SaT
sat_sm = SaT("sat-3l-sm")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/965 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/855M [00:00<?, ?B/s]

In [None]:
sat_sm.split(df.iloc[0]["full_content"],

[['UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon. '],
 ['"The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades. '],
 ['"Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said. '],
 ['"When they shrink, so do river flows." '],
 ["Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal. "],
 ['Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian and Southeast Asian river valleys 