TP2

EXO 1: Tokenization
Objective: Understand the process of tokenization using spaCy and analyze token properties.

In [9]:
import spacy

sentence= 'Google is planning to purchase an U.S. software company for $120 million'
#load the model we're going to use for the tokenization 'English Core Web Small' 
nlp = spacy.load("en_core_web_sm")
#apply the tokenization
doc = nlp(sentence)
#access some meta data of the tokens (properties)
for token in doc:
  print(token.text, token.pos_, token.dep_)

Google PROPN nsubj
is AUX aux
planning VERB ROOT
to PART aux
purchase VERB xcomp
an DET det
U.S. PROPN compound
software NOUN compound
company NOUN dobj
for ADP prep
$ SYM quantmod
120 NUM compound
million NUM pobj


In [10]:
for token in doc:
  print('\n The token is :', (token.text))
  print('The shape', (token.shape_))
  print('is_alpha tells if the token contains only ', (token.is_alpha))
  print('is_stop', (token.is_stop))
  print('is_punct', (token.is_punct))
  print('is number', (token.like_num))


 The token is : Google
The shape Xxxxx
is_alpha tells if the token contains only  True
is_stop False
is_punct False
is number False

 The token is : is
The shape xx
is_alpha tells if the token contains only  True
is_stop True
is_punct False
is number False

 The token is : planning
The shape xxxx
is_alpha tells if the token contains only  True
is_stop False
is_punct False
is number False

 The token is : to
The shape xx
is_alpha tells if the token contains only  True
is_stop True
is_punct False
is number False

 The token is : purchase
The shape xxxx
is_alpha tells if the token contains only  True
is_stop False
is_punct False
is number False

 The token is : an
The shape xx
is_alpha tells if the token contains only  True
is_stop True
is_punct False
is number False

 The token is : U.S.
The shape X.X.
is_alpha tells if the token contains only  False
is_stop False
is_punct False
is number False

 The token is : software
The shape xxxx
is_alpha tells if the token contains only  True
is_s

Comparing tokenizing using Regular Expression VS tokenizing using Spacy

In [11]:
import re
tokens = re.split(r'\W+', sentence)
tokens = [t for t in tokens if t]
print(tokens)

spacy_tokens = [token.text for token in doc]
print(spacy_tokens)

['Google', 'is', 'planning', 'to', 'purchase', 'an', 'U', 'S', 'software', 'company', 'for', '120', 'million']
['Google', 'is', 'planning', 'to', 'purchase', 'an', 'U.S.', 'software', 'company', 'for', '$', '120', 'million']


We can clearly notice that Spacy provides better results even on this small passage. 
This superiority is performance stems from spaCy’s advanced linguistic features and rule-based tokenization strategies, which are trained on large, high-quality corpora. 
Spacy has more complexe expressions that can handle mostly all types and forms of text, which is hard to do manually useing regular expressions.

In [12]:
import nltk

nltk.download('punkt_tab')
nltk.download('punkt')

[nltk_data] Downloading package punkt_tab to /Users/bishi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /Users/bishi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['Google', 'is', 'planning', 'to', 'purchase', 'an', 'U.S.', 'software', 'company', 'for', '$', '120', 'million']


We notice that both NLTK word tokenizer and spacy tokenizer perform similarily on this piece of text, this is due to the simplicity of this text. 
over all, the two libraries have similar accuracy when it comes to simple texts, however we may notice better results performed by Spacy when it comes to more complexe sentences. Because unlike simpler libraries (like nltk) that rely primarily on whitespace or punctuation-based splitting, spaCy integrates:
1. part-of-speech tagging
2. dependency parsing
3. language-specific rules... 
to produce more accurate and context-aware tokens. As a result, its tokenization process captures the structure and meaning of text more effectively, leading to superior overall performance.

Exo 2: Sentence Segmentation
Objective: Understand the process of sentence segmentation using various NLP libraries and
analyze different approaches.

In [15]:
par = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."

#use spaCy 'en_core_web_sm' model for sentence segmentation
doc_2 = nlp(par)
print('spaCy sentence segmentation output:')
for sent in doc_2.sents:
  print(sent)

#use nltk for sentence segmentation
nltp_sent_tokens = nltk.sent_tokenize(par)
print('\nNLTK sentence segmentation output:')
for sent in nltp_sent_tokens:
  print(sent)

#use TextBlob for sentence segmentation 
from textblob import TextBlob
print('\nTextBlob sentence segmentation output:')
blob = TextBlob(par)
for sent in blob.sentences:
    print(sent)

spaCy sentence segmentation output:
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

NLTK sentence segmentation output:
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true... Well, with a probability of .9 it isn't.

TextBlob sentence segmentation output:
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true... Well, with a probability of .9 it isn't.


Comparing the outputs of each library: 
spaCy
	•	Most robust and context-aware. Uses statistical models trained on large corpora, so it handles complex sentence boundaries more accurately.
	•	Handles abbreviations well (e.g., “Mr.”, “i.e.”, “U.S.”) without incorrectly splitting sentences.
	•	Recognizes ellipses and decimals (like “1.5” or “.9”) correctly.
	•	Better for real-world text, where punctuation can appear in many contexts.
NLTK
	•	Rule-based segmentation using PunktSentenceTokenizer.
	•	Can misinterpret abbreviations or decimals as sentence boundaries (e.g., splitting after “i.e.” or “Mr.” if not trained properly).
	•	Less context-sensitive — relies heavily on punctuation and capitalization cues.
	•	Works fine for clean text, but struggles with informal writing or irregular punctuation.
TextBlob
	•	Built on top of NLTK’s Punkt tokenizer, so its segmentation behavior is nearly identical to NLTK’s.
	•	Shares the same limitations with abbreviations and ellipses. (but it provides slightly better accuracy)
	•	Easier to use for quick tasks, but not ideal for nuanced sentence boundary detection.

In [16]:
#implementing a simple rule base sentence segmentation function 
def sent_seg(text):
    # Split text by '.', '?', or '!' followed by a space or end of line
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

sentences = sent_seg(par)
for i, s in enumerate(sentences, 1):
    print(f"{i}: {s}")

1: Mr.
2: Smith bought cheapsite.com for 1.5 million dollars, i.e.
3: he paid a lot for it.
4: Did he mind?
5: Adam Jones Jr.
6: thinks he didn't.
7: In any case, this isn't true...
8: Well, with a probability of .9 it isn't.


We notice that: 
- The function splits after “Mr.” and “i.e.” and “Jr.” even though they’re not true sentence boundaries. That’s because the regex blindly treats any . followed by a space as the end of a sentence.
- However for standard punctuation-based sentences, it works well.

Exo 4 + Homework: 
Apply and compare the out puts of the 3 libraries for each of the following: 
1. Part-of-Speech
2. Stemming 
3. Lemmatization 
4. NER (Named Entity Recognition)
5. Stop words

In [17]:
# Core tokenizers & taggers
nltk.download('punkt')
#the corpus that has the rules of pos in nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# Lemmatizer support
nltk.download('wordnet')
nltk.download('omw-1.4')

# Named Entity Recognition
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

# Corpora used by NLTK + TextBlob
nltk.download('brown')
nltk.download('conll2000')
nltk.download('movie_reviews')
#the used sentence
sent= 'The NLP system accurately classified 95% of the customer feedback as positive.'

[nltk_data] Downloading package punkt to /Users/bishi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/bishi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/bishi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/bishi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/bishi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/bishi/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/bi

POS

In [30]:
#spaCy
doc = nlp(sent)
print('POS with spaCy')
for token in doc: 
    print('\ntoken:', token)
    print('POS:', token.pos_)
    print('tag:', token.tag_)
    print('Description og the tag: ', spacy.explain(token.tag_))


POS with spaCy

token: The
POS: DET
tag: DT
Description og the tag:  determiner

token: NLP
POS: PROPN
tag: NNP
Description og the tag:  noun, proper singular

token: system
POS: NOUN
tag: NN
Description og the tag:  noun, singular or mass

token: accurately
POS: ADV
tag: RB
Description og the tag:  adverb

token: classified
POS: VERB
tag: VBD
Description og the tag:  verb, past tense

token: 95
POS: NUM
tag: CD
Description og the tag:  cardinal number

token: %
POS: NOUN
tag: NN
Description og the tag:  noun, singular or mass

token: of
POS: ADP
tag: IN
Description og the tag:  conjunction, subordinating or preposition

token: the
POS: DET
tag: DT
Description og the tag:  determiner

token: customer
POS: NOUN
tag: NN
Description og the tag:  noun, singular or mass

token: feedback
POS: NOUN
tag: NN
Description og the tag:  noun, singular or mass

token: as
POS: ADP
tag: IN
Description og the tag:  conjunction, subordinating or preposition

token: positive
POS: ADJ
tag: JJ
Description 

In [31]:
#NLTK
print('POS with NLTK')
tokens = nltk.word_tokenize(sent)
tags = nltk.pos_tag(tokens)
for token, tag in tags:
    print(f"Token: {token}  Tag: {tag}")
# NLTK doesn't have the explanation of the tags 

POS with NLTK
Token: The  Tag: DT
Token: NLP  Tag: NNP
Token: system  Tag: NN
Token: accurately  Tag: RB
Token: classified  Tag: VBD
Token: 95  Tag: CD
Token: %  Tag: NN
Token: of  Tag: IN
Token: the  Tag: DT
Token: customer  Tag: NN
Token: feedback  Tag: NN
Token: as  Tag: IN
Token: positive  Tag: JJ
Token: .  Tag: .


In [20]:
#TextBlob 
print('POS with TextBlob')
blob = TextBlob(sent)
for token, tag in blob.tags:
    print(f"Token: {token}  Tag: {tag}")

POS with TextBlob
Token: The  Tag: DT
Token: NLP  Tag: NNP
Token: system  Tag: NN
Token: accurately  Tag: RB
Token: classified  Tag: VBD
Token: 95  Tag: CD
Token: %  Tag: NN
Token: of  Tag: IN
Token: the  Tag: DT
Token: customer  Tag: NN
Token: feedback  Tag: NN
Token: as  Tag: IN
Token: positive  Tag: JJ


Stemming & Lemmatization

In [21]:
#spaCy
print('Lemmatization with spaCy')

doc = nlp(sent)
for token in doc:
    print(token.text, "→", token.lemma_)

Lemmatization with spaCy
The → the
NLP → NLP
system → system
accurately → accurately
classified → classify
95 → 95
% → %
of → of
the → the
customer → customer
feedback → feedback
as → as
positive → positive
. → .


In [39]:
#NLTK
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(sent)
tags = nltk.pos_tag(tokens)

def get_wordnet_pos(tag):
    if tag.startswith('J'):  
        return 'a'
    elif tag.startswith('V'):  
        return 'v'
    elif tag.startswith('N'):  
        return 'n'
    elif tag.startswith('R'):  
        return 'r'
    else:
        return 'n' 

stems = [stemmer.stem(word.lower(), get_wordnet_pos(tag)) for word, tag in tags]
lemmas = [lemmatizer.lemmatize(word.lower(), get_wordnet_pos(tag)) for word, tag in tags]

print('Lemmatization with NLTK')
print("Original Tokens:", tokens)
print("Stemmed Tokens:", stems)
print('Lemmatized Tokens:', lemmas)

Lemmatization with NLTK
Original Tokens: ['The', 'NLP', 'system', 'accurately', 'classified', '95', '%', 'of', 'the', 'customer', 'feedback', 'as', 'positive', '.']
Stemmed Tokens: ['the', 'nlp', 'system', 'accur', 'classifi', '95', '%', 'of', 'the', 'custom', 'feedback', 'as', 'posit', '.']
Lemmatized Tokens: ['the', 'nlp', 'system', 'accurately', 'classify', '95', '%', 'of', 'the', 'customer', 'feedback', 'a', 'positive', '.']


In [23]:
#TextBlob 
print('Lemmatization with TextBlob')
blob = TextBlob(sent)
for token in blob.words:
    print(f"{token:15} → {token.lemmatize()}")

Lemmatization with TextBlob
The             → The
NLP             → NLP
system          → system
accurately      → accurately
classified      → classified
95              → 95
of              → of
the             → the
customer        → customer
feedback        → feedback
as              → a
positive        → positive


NER

In [24]:
# spaCy
print("spaCy Named Entities:")
for ent in doc.ents:
    print(f"{ent.text:20} → {ent.label_}")

spaCy Named Entities:
NLP                  → ORG
95%                  → PERCENT


In [None]:
#NLTK
print("NLTK Named Entities:")
tokens = nltk.word_tokenize(sent)
#we need to pass through pos first because nltk rules are based on 
tags = nltk.pos_tag(tokens)
nltk_entities = nltk.ne_chunk(tags, binary=False)
print(nltk_entities)

NLTK Named Entities:
(S
  The/DT
  (ORGANIZATION NLP/NNP)
  system/NN
  accurately/RB
  classified/VBD
  95/CD
  %/NN
  of/IN
  the/DT
  customer/NN
  feedback/NN
  as/IN
  positive/JJ
  ./.)


In [26]:
#TextBlob
print("TextBlob Named Entities:")
print(blob.noun_phrases)

TextBlob Named Entities:
['nlp', 'customer feedback']


Stop Words

In [27]:
# spaCy
doc = nlp(sent)
spacy_stopwords = [token.text for token in doc if token.is_stop]
print(f"spaCy Stop Words:\n{spacy_stopwords}")

spaCy Stop Words:
['The', 'of', 'the', 'as']


In [28]:
#NLTK
from nltk.corpus import stopwords
tokens = nltk.word_tokenize(sent)
nltk_stops = set(stopwords.words('english'))
sent_filtered = [w for w in tokens if w.lower() in nltk_stops]
print(f"NLTK stopwords:\n{sent_filtered}")

NLTK stopwords:
['The', 'of', 'the', 'as']


In [29]:

blob = TextBlob(sent)
textblob_stop_words = [word for word in blob.words if word.lower() in nltk_stops]
print(f"TextBlob stop words Tokens :\n{textblob_stop_words}")

TextBlob stop words Tokens :
['The', 'of', 'the', 'as']


We need to apply the three on more complexe and long sentences to really see the difference in preformance between the three libraries. 