# Segment data

This notebook shows some methods to segment text data

In [19]:
#!pip install pandas pyarrow

## Load the data

Global News Dataset : https://huggingface.co/datasets/NickyNicky/global-news-dataset

Download the dataset

In [8]:
!mkdir -p data
!wget -O data/train-00000-of-00001.parquet https://huggingface.co/datasets/NickyNicky/global-news-dataset/resolve/main/data/train-00000-of-00001.parquet

  pid, fd = os.forkpty()


--2025-06-26 22:51:55--  https://huggingface.co/datasets/NickyNicky/global-news-dataset/resolve/main/data/train-00000-of-00001.parquet
Résolution de huggingface.co (huggingface.co)… 2600:9000:244f:ce00:17:b174:6d00:93a1, 2600:9000:244f:fe00:17:b174:6d00:93a1, 2600:9000:244f:f400:17:b174:6d00:93a1, ...
Connexion à huggingface.co (huggingface.co)|2600:9000:244f:ce00:17:b174:6d00:93a1|:443… connecté.
requête HTTP transmise, en attente de la réponse… 302 Found
Emplacement : https://cdn-lfs-us-1.hf.co/repos/fc/b9/fcb9b99c94f49325b2c47cfd2a3f4f1615b3b61a4bee0d152bbdf91676cc212f/148002d1f68e5fbf7c393687410b997d9f3fcd1f4d5b43ad28abcf4ff8f3abf0?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train-00000-of-00001.parquet%3B+filename%3D%22train-00000-of-00001.parquet%22%3B&Expires=1750974715&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MDk3NDcxNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2ZjL2I5L2ZjYjliOTljOTRmNDkzMjV

In [12]:
import pandas as pd

In [None]:
df_raw = pd.read_parquet("data/train-00000-of-00001.parquet"
df_raw.head(3)

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content
0,89541,,International Business Times,Paavan MATHEMA,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,https://www.ibtimes.com/un-chief-urges-world-s...,https://d.ibtimes.com/en/full/4496078/nepals-g...,2023-10-30 10:12:35.000000,UN Secretary-General Antonio Guterres urged th...,Nepal,UN Secretary-General Antonio Guterres urged th...
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,
2,89543,,VOA News,webdesk@voanews.com (Agence France-Presse),UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,https://www.voanews.com/a/un-chief-urges-world...,https://gdb.voanews.com/01000000-0a00-0242-60f...,2023-10-30 10:53:30.000000,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal,


In [14]:
print(df_raw.shape)

(105375, 12)


Management : remove empty rows

In [15]:
df = df_raw[df_raw["full_content"].notna()]
df.shape

(58432, 12)

## Let's say we will use CamemBERT

Context windows : 512

Estimate the tokens

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert-base")

text = df["full_content"].iloc[0]

tokens = tokenizer.tokenize(text)
token_count = len(tokens)
print(f"Number of tokens: {token_count}")

Number of tokens: 856


In [18]:
df["full_content"].sample(1000).apply(lambda x : len(tokenizer.tokenize(x))).describe()

count     1000.000000
mean      1742.043000
std       2699.420828
min         12.000000
25%        817.000000
50%       1292.000000
75%       1809.750000
max      32045.000000
Name: full_content, dtype: float64

We will need to divide

### Divide by sentence

### Brut force method

with a rule of thumb for instance

In [60]:
def chuck_text_context_window(text, max_length_token=512, letter_per_token=4):
    """
    Cut the text into chunks of max_length.
    """
    length = max_length_token * letter_per_token
    
    return [text[i:i + length] for i in range(0, len(text), length)]


And explode with the id of the article

In [65]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chuck_text_context_window)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

Unnamed: 0,article_id,chunk_cw
88121,452970,BJ’s Wholesale Club (NYSE:BJ–Get Free Report) ...
88121,452970,and six have given a buy rating to the company...
88121,452970,"t $46,000. Point72 Hong Kong Ltd bought a new ..."
69582,227784,Shares of NIPPON STL & SU/S (OTCMKTS:NSSMY–Get...
58047,134197,"Chico’s FAS, Inc. (NYSE:CHS–Get Free Report)’s..."
58047,134197,osition in shares of Chico’s FAS by 167.9% in ...
15393,5337,Statement by UK Political Coordinator Fergus E...
15393,5337,"iolence, is the only way to achieve lasting pe..."
94841,528246,"SANTIAGO, Chile, Nov. 20, 2023 (GLOBE NEWSWI..."
94841,528246,ment’s current expectations and beliefs and ar...


You can do it better with a tokenizer to count exactly what you want

Same for paragraph : you need to define what is a paragraph. In our case, it is a line break.

In [50]:
def chunk_paragraphs(text):
    """
    Cut the text into paragraphs.
    """
    return text.split("\n")

In [66]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chunk_paragraphs)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

Unnamed: 0,article_id,chunk_cw
81275,337940,Summit X LLC raised its stake in O’Reilly Auto...
72588,266994,TEHRAN: Iranian president Ebrahim Raisi will t...
55380,125435,"Village after village is under attack, with co..."
7024,100171,Life Time is bringing its one-of-a-kind Life T...
101296,675914,Getty Images Shares of Bajaj Holdings & Inve...
101296,675914,\t\t Powered by
101296,675914,\t\t Weekly Top Picks: Eight stock...
101296,675914,View More Sto...
101296,675914,
101296,675914,\t\t Subscribe to ETPrime


If you want to divide by sentence

In [47]:
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(df.iloc[0]["full_content"], language="english")

for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent}")

Sentence 1: UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon.
Sentence 2: "The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades.
Sentence 3: "Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said.
Sentence 4: "When they shrink, so do river flows."
Sentence 5: Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal.
Sentence 6: Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian 

[nltk_data] Downloading package punkt to /Users/emilien/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Or use Spacy with a model

In [4]:
#!pip install spacy
#!python -m spacy download en_core_web_trf

In [8]:
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp(df.iloc[0]["full_content"])
print("🔹 Sentences:")
for sent in doc.sents:
    print(f"- {sent.text.strip()}")

🔹 Sentences:
- UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon.
- "The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades.
- "Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said.
- "When they shrink, so do river flows."
- Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal.
- Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian and Southeast Asian river valleys below.
- The 

Dedicated model to segment

For instance : [wtpsplit](https://github.com/segment-any-text/wtpsplit)

Which use dedicated models trained for segmentation : https://huggingface.co/segment-any-text/sat-3l

In [14]:
#!pip install wtpsplit

zsh:1: no matches found: wtpsplit[onnx-cpu]


In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [1]:
from wtpsplit import SaT
sat_sm = SaT("sat-3l-sm")

In [7]:
sat_sm.split(df.iloc[0]["full_content"], do_paragraph_segmentation=True)

[['UN Secretary-General Antonio Guterres urged the world Monday to "stop the madness" of climate change as he visited Himalayan regions struggling from rapidly melting glaciers to witness the devastating impact of the phenomenon. '],
 ['"The rooftops of the world are caving in," Guterres said on a visit to the Everest region in mountainous Nepal, adding that the country had lost nearly a third of its ice in just over three decades. '],
 ['"Glaciers are icy reservoirs -- the ones here in the Himalayas supply fresh water to well over a billion people," he said. '],
 ['"When they shrink, so do river flows." '],
 ["Nepal's glaciers melted 65 percent faster in the last decade than in the previous one, said Guterres, who is on a four-day visit to Nepal. "],
 ['Glaciers in the wider Himalayan and Hindu Kush ranges are a crucial water source for around 240 million people in the mountainous regions, as well as for another 1.65 billion people in the South Asian and Southeast Asian river valleys 