In [1]:
# English Article 1
article_1 = """Technology is evolving rapidly; every day, we witness new advancements in artificial intelligence, robotics, and data science. 
The question is: are we ready for such changes? Many experts believe that automation will replace millions of jobs—some say this is a threat, while others see it as an opportunity! 
Nevertheless, innovation continues: new industries are emerging, and with them, new career paths are being created."""

# English Article 2
article_2 = """History has taught us valuable lessons—some we remember, others we repeat! 
Take, for example, the rise and fall of ancient civilizations: the Roman Empire, the Maya, and even the Egyptian dynasties. 
What led to their decline? Was it internal corruption, external invasion, or simply the passage of time? 
Regardless, one thing remains certain: no empire lasts forever!"""

# English Article 3
english_pipline_test = """Artificial intelligence is rapidly transforming various industries; businesses are adopting machine learning models to automate processes, 
enhance customer experiences, and gain insights from data; however, ethical concerns regarding data privacy, algorithmic bias, and job displacement continue
to be major topics of discussion in academic and professional circles worldwide."""

# Arabic Article 1
article_3 = """التكنولوجيا تتطور بسرعة؛ نشهد كل يوم تطورات جديدة في الذكاء الاصطناعي، والروبوتات، وعلم البيانات. 
السؤال هو: هل نحن مستعدون لهذه التغيرات؟ يعتقد العديد من الخبراء أن الأتمتة ستحل محل ملايين الوظائف—البعض يراها تهديدًا، بينما يراها آخرون فرصة! 
ومع ذلك، تستمر الابتكارات: تنشأ صناعات جديدة، ومعها تُخلق مسارات وظيفية حديثة."""

# Arabic Article 2
article_4 = """علمتنا التاريخ دروسًا قيّمة—بعضها نتذكره، وبعضها نعيد تكراره! 
خذ على سبيل المثال صعود وسقوط الحضارات القديمة: الإمبراطورية الرومانية، والمايا، وحتى السلالات المصرية. 
ما الذي أدى إلى انهيارها؟ هل كان الفساد الداخلي، أم الغزو الخارجي، أم مجرد مرور الزمن؟ 
بغض النظر، هناك شيء واحد مؤكد: لا إمبراطورية تدوم للأبد!"""

arabic_pipline_test = """الذكاء الاصطناعي يُحدث تغييرات جذرية في العديد من الصناعات؛ 
الشركات تتبنى نماذج التعلم الآلي لأتمتة العمليات، وتحسين تجارب العملاء؛ 
ومع ذلك، تظل المخاوف الأخلاقية بشأن خصوصية البيانات والتحيز الخوارزمي."""

# English

## Sentence Segmentation

### Introduction
Sentence segmentation is the process of dividing a text into meaningful sentences. It ensures that a passage is correctly split at appropriate sentence boundaries, which is essential for tasks like text summarization, machine translation, and speech processing.

### Conclusion
Sentence segmentation is a crucial step in Natural Language Processing (NLP) that transforms raw text into structured data by identifying and separating sentences. It plays a vital role in improving text analysis, enhancing model performance, and ensuring better readability. 🚀

In [2]:
import spacy
import nltk
from spacy.language import Language
from nltk.tokenize import sent_tokenize
from nltk.tokenize import PunktSentenceTokenizer

nlp = spacy.load("en_core_web_sm")
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\aakam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
doc_1 = nlp(article_1)
for sent in doc_1.sents:
    print(sent)

Technology is evolving rapidly; every day, we witness new advancements in artificial intelligence, robotics, and data science. 

The question is: are we ready for such changes?
Many experts believe that automation will replace millions of jobs—some say this is a threat, while others see it as an opportunity! 

Nevertheless, innovation continues: new industries are emerging, and with them, new career paths are being created.


In [4]:
doc_1[10].is_sent_end, doc_1[0].is_sent_start

(False, True)

## Why We Can't Index `doc.sents` Directly

### Handling Sentence Segmentation in Code
In **spaCy**, sentences are extracted using `doc.sents`, but it returns a generator, not a list. This means you **cannot use indexing** like `doc_1.sents[10]`. Instead, convert it to a list first:

```python
nlp = spacy.load("en_core_web_sm")
doc_1 = nlp("This is the first sentence. This is the second sentence.")
sentences = list(doc_1.sents)
print(sentences[1].text)  
```

In [5]:
sents = [sent for sent in doc_1.sents]
sents

[Technology is evolving rapidly; every day, we witness new advancements in artificial intelligence, robotics, and data science. ,
 The question is: are we ready for such changes?,
 Many experts believe that automation will replace millions of jobs—some say this is a threat, while others see it as an opportunity! ,
 Nevertheless, innovation continues: new industries are emerging, and with them, new career paths are being created.]

<Br/>
<p align="center">
  <img src="../img/1.jpeg" alt="nlp pipeline" width="800">
</p>
<Br/>

In [6]:
@Language.component("set_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ";":
            doc[token.i + 1].is_sent_start = True  
    return doc

nlp.add_pipe("set_boundaries", before="parser")      

<function __main__.set_custom_boundaries(doc)>

In [7]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [8]:
doc_pipline_en = nlp(english_pipline_test)
for sent in list(doc_pipline_en.sents):
    print(f"{sent}")
    print("==="*25)

Artificial intelligence is rapidly transforming various industries;
businesses are adopting machine learning models to automate processes, 
enhance customer experiences, and gain insights from data;
however, ethical concerns regarding data privacy, algorithmic bias, and job displacement continue
to be major topics of discussion in academic and professional circles worldwide.


<br />

### Difference Between Custom `PunktSentenceTokenizer` and `sent_tokenize()`

the PunktSentenceTokenizer is an unsupervised sentence tokenize Then, it applies those learned rules to `article_2` This means that the sentence splitting behavior depends on how `article_1`

sent_tokenize() uses the pre-trained Punkt tokenizer that comes with NLTK, It does not require training on any custom text, It applies general rules to split article_2 into sentences.

```python
custom_tokenizer = PunktSentenceTokenizer(article_1)  # Train tokenizer on article_1
doc_3 = custom_tokenizer.tokenize(article_2)  # Tokenize article_2 using trained model
```
<br />

In [9]:
doc_2 = sent_tokenize(article_2)

for s in doc_2:
    print(s)
    print("==="*25)

History has taught us valuable lessons—some we remember, others we repeat!
Take, for example, the rise and fall of ancient civilizations: the Roman Empire, the Maya, and even the Egyptian dynasties.
What led to their decline?
Was it internal corruption, external invasion, or simply the passage of time?
Regardless, one thing remains certain: no empire lasts forever!


In [10]:
custom_tokenizer = PunktSentenceTokenizer(article_1)
doc_3 = custom_tokenizer.tokenize(article_2)


for s in doc_3:
    print(s)
    print("==="*25)

History has taught us valuable lessons—some we remember, others we repeat!
Take, for example, the rise and fall of ancient civilizations: the Roman Empire, the Maya, and even the Egyptian dynasties.
What led to their decline?
Was it internal corruption, external invasion, or simply the passage of time?
Regardless, one thing remains certain: no empire lasts forever!


# Arabic

In [11]:
doc_4 = nlp(article_4)
for sent in doc_4.sents:
    print(sent)

علمتنا التاريخ دروسًا قيّمة—بعضها نتذكره، وبعضها نعيد تكراره! 

خذ على سبيل المثال صعود وسقوط الحضارات القديمة: الإمبراطورية الرومانية، والمايا،
وحتى السلالات المصرية. 

ما الذي أدى إلى انهيارها؟ هل كان الفساد الداخلي، أم الغزو الخارجي، أم مجرد مرور الزمن؟ 
بغض النظر، هناك شيء
واحد
مؤكد: لا إمبراطورية تدوم للأبد!


In [12]:
doc_pipline_ar = nlp(arabic_pipline_test)
for sent in list(doc_pipline_ar.sents):
    print(f"{sent}")
    print("==="*25)

الذكاء الاصطناعي يُحدث تغييرات جذرية في العديد من الصناعات؛ 
الشركات تتبنى نماذج التعلم الآلي لأتمتة العمليات، وتحسين تجارب العملاء؛ 

ومع ذلك
،
تظل المخاوف الأخلاقية بشأن خصوصية البيانات والتحيز الخوارزمي.


In [13]:
doc_5 = sent_tokenize(article_3)

for s in doc_5:
    print(s)
    print("==="*25)

التكنولوجيا تتطور بسرعة؛ نشهد كل يوم تطورات جديدة في الذكاء الاصطناعي، والروبوتات، وعلم البيانات.
السؤال هو: هل نحن مستعدون لهذه التغيرات؟ يعتقد العديد من الخبراء أن الأتمتة ستحل محل ملايين الوظائف—البعض يراها تهديدًا، بينما يراها آخرون فرصة!
ومع ذلك، تستمر الابتكارات: تنشأ صناعات جديدة، ومعها تُخلق مسارات وظيفية حديثة.
