# Lab 1

## Exercise 01 :

In [28]:
import spacy

spacy.__version__

'3.7.5'

In [29]:
# download the model
model = spacy.load("en_core_web_sm")

In [30]:
print(type(model))

<class 'spacy.lang.en.English'>


In [31]:
sent = "Google is planning to purchase a U.S. software company for $120 million."
result = model(sent)

In [32]:
result

Google is planning to purchase a U.S. software company for $120 million.

In [33]:
print(type(result))

<class 'spacy.tokens.doc.Doc'>


In [34]:
for token in result:
    print(
        f"{token.text} | {token.pos_} | {token.dep_} |{token.lemma_} | {token.is_stop} | {token.is_punct}"
    )

Google | PROPN | nsubj |Google | False | False
is | AUX | aux |be | True | False
planning | VERB | ROOT |plan | False | False
to | PART | aux |to | True | False
purchase | VERB | xcomp |purchase | False | False
a | DET | det |a | True | False
U.S. | PROPN | compound |U.S. | False | False
software | NOUN | compound |software | False | False
company | NOUN | dobj |company | False | False
for | ADP | prep |for | True | False
$ | SYM | quantmod |$ | False | False
120 | NUM | compound |120 | False | False
million | NUM | pobj |million | False | False
. | PUNCT | punct |. | False | True


### Q2) Five different properties of tokens that can be accessed using spaCy

`text`: The original text of the token.

Example: token.text returns "Google", "is", "planning", etc.

`pos_` : The part-of-speech tag of the token.

Example: token.pos_ might return "NOUN" for nouns, "VERB" for verbs.

`dep_`: The syntactic dependency relation of the token (i.e., how the token relates to other tokens).

Example: token.dep_ might return "nsubj" for a subject or "dobj" for a direct object.

`lemma_`: The base form of the word.

Example: token.lemma_ for the token "planning" returns "plan".

`is_stop`: A boolean indicating whether the token is a stop word (e.g., "the", "is", "and").

Example: token.is_stop returns True for common stop words.

### Q3) How does spaCy handle special cases in tokenization, such as punctuation, numbers, and abbreviations?

`Punctuation:` SpaCy treats punctuation as separate tokens. For example, a period, comma, or quotation mark is treated as its own token.

`Numbers:` Numbers are treated as single tokens. For example, "$120" would be tokenized as "$" and "120".

`Abbreviations`: SpaCy usually handles abbreviations like "U.S." correctly by keeping them as a single token instead of splitting them into multiple tokens.

### Q4) How does spaCy's tokenization differ from simple string splitting? Provide an example to illustrate the difference.

In [35]:
tokens1 = sent.split()
tokens2 = model(sent)
print(f"Text splitting : {len(tokens1)} tokens")
print(tokens1)
print("====================")
print(f"Using Spacy : {len(tokens2)} tokens")
for token in tokens2:
    print(f"{token.text} ")

Text splitting : 12 tokens
['Google', 'is', 'planning', 'to', 'purchase', 'a', 'U.S.', 'software', 'company', 'for', '$120', 'million.']
Using Spacy : 14 tokens
Google 
is 
planning 
to 
purchase 
a 
U.S. 
software 
company 
for 
$ 
120 
million 
. 


SpaCy splits the punctuation marks (e.g., the period after "million" and the "$" symbol) into separate tokens, while simple string splitting doesn't.
Abbreviations like "U.S." are handled better in spaCy.

### Q5) Do the tokenization this time with word_tokenize from NLTK, what are the differences?

In [36]:
from nltk.tokenize import word_tokenize

text = "Google is planning to purchase an U.S. software company for $120 million."
nltk_tokens = word_tokenize(text)
print(nltk_tokens)

['Google', 'is', 'planning', 'to', 'purchase', 'an', 'U.S.', 'software', 'company', 'for', '$', '120', 'million', '.']


Both NLTK and spaCy split punctuation marks and numbers similarly in this case.
The primary difference is that spaCy provides richer linguistic context (e.g., part-of-speech, dependency parsing), while NLTK’s word_tokenize only splits the text into tokens without offering any additional linguistic analysis.
SpaCy also handles special cases more robustly in many scenarios (e.g., handling of multi-word proper nouns)

___________________

## Exercise 02 : 

In [37]:
sentence = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."

In [38]:
import spacy
from nltk.tokenize import sent_tokenize
from textblob import TextBlob

In [39]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_sentences = [sent.text for sent in doc.sents]

print("spaCy Sentence Segmentation:")
print(f"Number of Sentences: {len(spacy_sentences)}")
for idx, sent in enumerate(spacy_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

# 2. NLTK Approach
nltk_sentences = sent_tokenize(text)

print("NLTK Sentence Segmentation:")
print(f"Number of Sentences: {len(nltk_sentences)}")
for idx, sent in enumerate(nltk_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

# 3. TextBlob Approach
blob = TextBlob(text)
textblob_sentences = [str(sentence) for sentence in blob.sentences]

print("TextBlob Sentence Segmentation:")
print(f"Number of Sentences: {len(textblob_sentences)}")
for idx, sent in enumerate(textblob_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

spaCy Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.


NLTK Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.


TextBlob Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.




In [40]:
import spacy
import nltk
from textblob import TextBlob

nlp = spacy.load("en_core_web_sm")
nltk.download("punkt")
text = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
doc = nlp(text)
spacy_sentences = [sent.text for sent in doc.sents]
nltk_sentences = nltk.sent_tokenize(text)
blob = TextBlob(text)
textblob_sentences = [str(sentence) for sentence in blob.sentences]

[nltk_data] Downloading package punkt to /home/wissem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [41]:
print("spaCy Sentence Segmentation:")
for sentence in spacy_sentences:
    print(f"- {sentence}")

print("\nNLTK Sentence Segmentation:")
for sentence in nltk_sentences:
    print(f"- {sentence}")

print("\nTextBlob Sentence Segmentation:")
for sentence in textblob_sentences:
    print(f"- {sentence}")

spaCy Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true...
- Well, with a probability of .9 it isn't.

NLTK Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
- he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true... Well, with a probability of .9 it isn't.

TextBlob Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
- he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true... Well, with a probability of .9 it isn't.


`Spacy` gg

### Q2) Compare the results of sentence segmentation from spaCy, NLTK, and TextBlob. Are there any differences in how they handle abbreviations, ellipsis, or other special cases?
Comparison:

spaCy: Handles abbreviations and ellipses well. For example, in the text, "i.e." is correctly treated as part of the sentence and not as a sentence boundary. Similarly, ellipses like "isn't true..." do not trigger sentence splitting.
NLTK: Sometimes struggles with abbreviations. It might split sentences after abbreviations like "i.e." because it sees the period as an end-of-sentence marker.
TextBlob: Similar to spaCy, it handles abbreviations and ellipses relatively well. It does not split sentences after "i.e." and also manages ellipses like "isn't true..." without creating sentence boundaries.

## Exercise 03 : Part of Speech

In [42]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [43]:
sentence = (
    "The NLP system accurately classified 95% of the customer feedback as positive."
)
doc = nlp(sentence)

In [44]:
for token in doc:
    print(
        f"Token: {token.text} | POS: {token.pos_} | Detailed Tag: {token.tag_} | Lemma: {token.lemma_}"
    )

Token: The | POS: DET | Detailed Tag: DT | Lemma: the
Token: NLP | POS: PROPN | Detailed Tag: NNP | Lemma: NLP
Token: system | POS: NOUN | Detailed Tag: NN | Lemma: system
Token: accurately | POS: ADV | Detailed Tag: RB | Lemma: accurately
Token: classified | POS: VERB | Detailed Tag: VBD | Lemma: classify
Token: 95 | POS: NUM | Detailed Tag: CD | Lemma: 95
Token: % | POS: NOUN | Detailed Tag: NN | Lemma: %
Token: of | POS: ADP | Detailed Tag: IN | Lemma: of
Token: the | POS: DET | Detailed Tag: DT | Lemma: the
Token: customer | POS: NOUN | Detailed Tag: NN | Lemma: customer
Token: feedback | POS: NOUN | Detailed Tag: NN | Lemma: feedback
Token: as | POS: ADP | Detailed Tag: IN | Lemma: as
Token: positive | POS: ADJ | Detailed Tag: JJ | Lemma: positive
Token: . | POS: PUNCT | Detailed Tag: . | Lemma: .


### Q2) Different POS Tags in spaCy and Their Representation:
- `DET (Determiner):`
Example: "The" – Refers to a word that introduces a noun (e.g., articles like 'the', 'a', etc.).

- `PROPN (Proper Noun):`
Example: "NLP" – A proper noun is a specific name of a person, place, or entity.

- `NOUN (Noun):`
Example: "system" – Refers to a general noun, not specific to a proper name.

- `ADV (Adverb):`
Example: "accurately" – Modifies or gives additional information about a verb, adjective, or other adverbs.

- `VERB (Verb):`
Example: "classified" – An action word that represents a state, action, or occurrence

### Q3) Handling Multi-word Expressions and Abbreviations in POS Tagging:
Multi-word expressions: SpaCy treats abbreviations and multi-word expressions like "NLP" or "95%" as single tokens. For example, "NLP" will be tagged as a PROPN (proper noun) and "95%" will be tagged as NUM (numeral), indicating numerical information. SpaCy uses dependency parsing to understand the relationships between these tokens and other words in the sentence.

In [45]:
import nltk
from nltk import word_tokenize

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to /home/wissem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/wissem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [46]:
sentence = (
    "The NLP system accurately classified 95% of the customer feedback as positive."
)
tokens = word_tokenize(sentence)
print(f"We have {len(tokens)} tokens")
nltk_pos_tags = nltk.pos_tag(tokens)
for token, tag in nltk_pos_tags:
    print(f"token : {token} | pos_tag : {tag} ")

We have 14 tokens
token : The | pos_tag : DT 
token : NLP | pos_tag : NNP 
token : system | pos_tag : NN 
token : accurately | pos_tag : RB 
token : classified | pos_tag : VBD 
token : 95 | pos_tag : CD 
token : % | pos_tag : NN 
token : of | pos_tag : IN 
token : the | pos_tag : DT 
token : customer | pos_tag : NN 
token : feedback | pos_tag : NN 
token : as | pos_tag : IN 
token : positive | pos_tag : JJ 
token : . | pos_tag : . 


### Differences between spaCy and NLTK:

- spaCy: More detailed and richer tagging system with both simple POS tags (like NOUN, VERB) and fine-grained tags (like NN, VBD).
- NLTK: Uses the Penn Treebank POS tags (e.g., NN for nouns, VB for verbs), which are more limited compared to spaCy’s tagging system.

## Exercise 4 : [Stemming, Lemmatization, Name Entity Recognition, Stop words] 

In [47]:
import spacy
from nltk.stem import PorterStemmer

nlp = spacy.load("en_core_web_sm")
import nltk

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("maxent_ne_chunker")
nltk.download("words")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/wissem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/wissem/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/wissem/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/wissem/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /home/wissem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [48]:
text = "Apple is looking at buying a U.K. startup for $1 billion. John is excited about the new venture."

#### Using Spacy

In [49]:
# using spacy
doc = nlp(text)
lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized words:", lemmatized_words)

Lemmatized words: ['Apple', 'be', 'look', 'at', 'buy', 'a', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.', 'John', 'be', 'excited', 'about', 'the', 'new', 'venture', '.']


In [50]:
# NER with spacy
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY
John PERSON


In [51]:
# stop words in spacy
stop_words = [token.text for token in doc if token.is_stop]
print("Stop words:", stop_words)

Stop words: ['is', 'at', 'a', 'for', 'is', 'about', 'the']


#### Using NLTK

In [52]:
# using nktk stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(token.text) for token in nlp(text)]
print("Stemmed words:", stemmed_words)

Stemmed words: ['appl', 'is', 'look', 'at', 'buy', 'a', 'u.k.', 'startup', 'for', '$', '1', 'billion', '.', 'john', 'is', 'excit', 'about', 'the', 'new', 'ventur', '.']


In [53]:
from nltk.tokenize import word_tokenize

words = word_tokenize(text)
print("Words:", words)

Words: ['Apple', 'is', 'looking', 'at', 'buying', 'a', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.', 'John', 'is', 'excited', 'about', 'the', 'new', 'venture', '.']


| Feature           | Stemming                               | Lemmatization                         |
|-------------------|----------------------------------------|---------------------------------------|
| **Definition**    |  Stemming is the process of reducing a word to its root or base form by chopping off prefixes and suffixes. This often results in words that may not be actual valid words in the language.| Lemmatization is a more sophisticated process that reduces a word to its lemma—the base or dictionary form of a word. It takes into account the word’s meaning and context.|
| **Process**       | Chops off word endings based on rules  | Uses vocabulary and context to find base form |
| **Output**        | Can result in non-real words (e.g., "runn") | Produces actual words (e.g., "run")  |
| **Context-Awareness** | Not context-aware                    | Context-aware (considers POS)        |
| **Accuracy**      | Less accurate, faster                  | More accurate, slightly slower       |


In [54]:
from nltk import pos_tag, ne_chunk

pos_tags = pos_tag(words)
ner_tree = ne_chunk(pos_tags)
named_entities = []
for subtree in ner_tree:
    if hasattr(subtree, "label"):
        entity = " ".join([leaf[0] for leaf in subtree.leaves()])
        entity_type = subtree.label()
        named_entities.append((entity, entity_type))

print("Named Entities:", named_entities)

Named Entities: [('Apple', 'GPE'), ('John', 'PERSON')]


In [55]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
stop_words = [word for word in words if word.lower() in stop_words]
print(f"Stop words :{stop_words}")

Stop words :['is', 'at', 'a', 'for', 'is', 'about', 'the']


### Difference between Stemming and Lemmatization

## Notes : 
- When performing sentiment analysis, it is not recommended to remove all stop words, for example negation stopwords like not ùust be filtered and kept