# Segment text data

This notebook shows some methods to segment text data

In [None]:
#!pip install pandas pyarrow

## Load the data

Global News Dataset : https://huggingface.co/datasets/NickyNicky/global-news-dataset

Download the dataset

In [None]:
!mkdir -p data
!wget -O data/train-00000-of-00001.parquet https://huggingface.co/datasets/NickyNicky/global-news-dataset/resolve/main/data/train-00000-of-00001.parquet

In [None]:
ls data

In [None]:
import pandas as pd

In [None]:
df_raw = pd.read_parquet("data/train-00000-of-00001.parquet")
df_raw.head(3)

In [None]:
print(df_raw.shape)

Management : remove empty rows

In [None]:
df = df_raw[df_raw["full_content"].notna()]
df.shape

## Sometime, we need shorter texts (context windows)

Let's say we are using Camembert

Context windows : 512

Estimate the tokens

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("camembert-base")

text = df["full_content"].iloc[0]

tokens = tokenizer.tokenize(text)
token_count = len(tokens)
print(f"Number of tokens: {token_count}")

In [None]:
df["full_content"].sample(1000).apply(lambda x : len(tokenizer.tokenize(x))).describe()

We will need to divide

## How to divide a text ?

### Divide by sentence

### Brut force method

with a rule of thumb for instance

In [None]:
def chuck_text_context_window(text, max_length_token=512, letters_per_token=4):
    """
    Cut the text into chunks of max_length.
    """
    length = max_length_token * letters_per_token
    return [text[i:i + length] for i in range(0, len(text), length)]


And explode with the id of the article

In [None]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chuck_text_context_window)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

You can do it better with a tokenizer to count exactly what you want

Same for paragraph : you need to define what is a paragraph. In our case, it is a line break.

In [None]:
def chunk_paragraphs(text):
    """
    Cut the text into paragraphs.
    """
    return text.split("\n")

In [None]:
df_ss = df.sample(5)
df_ss["chunk_cw"] = df_ss["full_content"].apply(chunk_paragraphs)
df_ss.explode(column="chunk_cw")[["article_id","chunk_cw"]]

If you want to divide by sentence

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(df.iloc[0]["full_content"], language="english")

for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent}")

Or use Spacy with a model

In [None]:
#!pip install spacy
#!python -m spacy download en_core_web_trf

In [None]:
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp(df.iloc[0]["full_content"])
print("🔹 Sentences:")
for sent in doc.sents:
    print(f"- {sent.text.strip()}")

### Use a dedicated model to segment

For instance : [wtpsplit](https://github.com/segment-any-text/wtpsplit)

Which use dedicated models trained for segmentation : https://huggingface.co/segment-any-text/sat-3l

In [None]:
#!pip install wtpsplit

In [None]:
from wtpsplit import SaT
sat_sm = SaT("sat-3l-sm")

In [None]:
sat_sm.split(df.iloc[0]["full_content"],