# Lab 1

## Exercise 01 :

In [30]:
import spacy

spacy.__version__

'3.7.5'

In [31]:
# download the model
model = spacy.load("en_core_web_sm")

In [32]:
print(type(model))

<class 'spacy.lang.en.English'>


In [33]:
sent = "Google is planning to purchase a U.S. software company for $120 million."
result = model(sent)

In [34]:
result

Google is planning to purchase a U.S. software company for $120 million.

In [35]:
print(type(result))

<class 'spacy.tokens.doc.Doc'>


In [36]:
for token in result:
    print(
        f"{token.text} | {token.pos_} | {token.dep_} |{token.lemma_} | {token.is_stop} | {token.is_punct}"
    )

Google | PROPN | nsubj |Google | False | False
is | AUX | aux |be | True | False
planning | VERB | ROOT |plan | False | False
to | PART | aux |to | True | False
purchase | VERB | xcomp |purchase | False | False
a | DET | det |a | True | False
U.S. | PROPN | compound |U.S. | False | False
software | NOUN | compound |software | False | False
company | NOUN | dobj |company | False | False
for | ADP | prep |for | True | False
$ | SYM | quantmod |$ | False | False
120 | NUM | compound |120 | False | False
million | NUM | pobj |million | False | False
. | PUNCT | punct |. | False | True


### Q2) Five different properties of tokens that can be accessed using spaCy

`text`: The original text of the token.

Example: token.text returns "Google", "is", "planning", etc.

`pos_` : The part-of-speech tag of the token.

Example: token.pos_ might return "NOUN" for nouns, "VERB" for verbs.

`dep_`: The syntactic dependency relation of the token (i.e., how the token relates to other tokens).

Example: token.dep_ might return "nsubj" for a subject or "dobj" for a direct object.

`lemma_`: The base form of the word.

Example: token.lemma_ for the token "planning" returns "plan".

`is_stop`: A boolean indicating whether the token is a stop word (e.g., "the", "is", "and").

Example: token.is_stop returns True for common stop words.

### Q3) How does spaCy handle special cases in tokenization, such as punctuation, numbers, and abbreviations?

`Punctuation:` SpaCy treats punctuation as separate tokens. For example, a period, comma, or quotation mark is treated as its own token.

`Numbers:` Numbers are treated as single tokens. For example, "$120" would be tokenized as "$" and "120".

`Abbreviations`: SpaCy usually handles abbreviations like "U.S." correctly by keeping them as a single token instead of splitting them into multiple tokens.

### Q4) How does spaCy's tokenization differ from simple string splitting? Provide an example to illustrate the difference.

In [37]:
tokens1 = sent.split()
tokens2 = model(sent)
print(f"Text splitting : {len(tokens1)} tokens")
print(tokens1)
print("====================")
print(f"Using Spacy : {len(tokens2)} tokens")
for token in tokens2:
    print(f"{token.text} ")

Text splitting : 12 tokens
['Google', 'is', 'planning', 'to', 'purchase', 'a', 'U.S.', 'software', 'company', 'for', '$120', 'million.']
Using Spacy : 14 tokens
Google 
is 
planning 
to 
purchase 
a 
U.S. 
software 
company 
for 
$ 
120 
million 
. 


SpaCy splits the punctuation marks (e.g., the period after "million" and the "$" symbol) into separate tokens, while simple string splitting doesn't.
Abbreviations like "U.S." are handled better in spaCy.

### Q5) Do the tokenization this time with word_tokenize from NLTK, what are the differences?

In [38]:
from nltk.tokenize import word_tokenize

text = "Google is planning to purchase an U.S. software company for $120 million."
nltk_tokens = word_tokenize(text)
print(nltk_tokens)

['Google', 'is', 'planning', 'to', 'purchase', 'an', 'U.S.', 'software', 'company', 'for', '$', '120', 'million', '.']


Both NLTK and spaCy split punctuation marks and numbers similarly in this case.
The primary difference is that spaCy provides richer linguistic context (e.g., part-of-speech, dependency parsing), while NLTK’s word_tokenize only splits the text into tokens without offering any additional linguistic analysis.
SpaCy also handles special cases more robustly in many scenarios (e.g., handling of multi-word proper nouns)

___________________

## Exercise 02 : 

In [39]:
sentence = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."

In [40]:
import spacy
from nltk.tokenize import sent_tokenize
from textblob import TextBlob

In [41]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_sentences = [sent.text for sent in doc.sents]

print("spaCy Sentence Segmentation:")
print(f"Number of Sentences: {len(spacy_sentences)}")
for idx, sent in enumerate(spacy_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

# 2. NLTK Approach
nltk_sentences = sent_tokenize(text)

print("NLTK Sentence Segmentation:")
print(f"Number of Sentences: {len(nltk_sentences)}")
for idx, sent in enumerate(nltk_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

# 3. TextBlob Approach
blob = TextBlob(text)
textblob_sentences = [str(sentence) for sentence in blob.sentences]

print("TextBlob Sentence Segmentation:")
print(f"Number of Sentences: {len(textblob_sentences)}")
for idx, sent in enumerate(textblob_sentences):
    print(f"Sentence {idx+1}: {sent}")
print("\n")

spaCy Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.


NLTK Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.


TextBlob Sentence Segmentation:
Number of Sentences: 1
Sentence 1: Google is planning to purchase an U.S. software company for $120 million.




In [43]:
import spacy
import nltk
from textblob import TextBlob

nlp = spacy.load("en_core_web_sm")
nltk.download("punkt")
text = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
doc = nlp(text)
spacy_sentences = [sent.text for sent in doc.sents]
nltk_sentences = nltk.sent_tokenize(text)
blob = TextBlob(text)
textblob_sentences = [str(sentence) for sentence in blob.sentences]

[nltk_data] Downloading package punkt to /home/wissem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [44]:
print("spaCy Sentence Segmentation:")
for sentence in spacy_sentences:
    print(f"- {sentence}")

print("\nNLTK Sentence Segmentation:")
for sentence in nltk_sentences:
    print(f"- {sentence}")

print("\nTextBlob Sentence Segmentation:")
for sentence in textblob_sentences:
    print(f"- {sentence}")

spaCy Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true...
- Well, with a probability of .9 it isn't.

NLTK Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
- he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true... Well, with a probability of .9 it isn't.

TextBlob Sentence Segmentation:
- Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e.
- he paid a lot for it.
- Did he mind?
- Adam Jones Jr. thinks he didn't.
- In any case, this isn't true... Well, with a probability of .9 it isn't.


`Spacy` gg

### Q2) Compare the results of sentence segmentation from spaCy, NLTK, and TextBlob. Are there any differences in how they handle abbreviations, ellipsis, or other special cases?
Comparison:

spaCy: Handles abbreviations and ellipses well. For example, in the text, "i.e." is correctly treated as part of the sentence and not as a sentence boundary. Similarly, ellipses like "isn't true..." do not trigger sentence splitting.
NLTK: Sometimes struggles with abbreviations. It might split sentences after abbreviations like "i.e." because it sees the period as an end-of-sentence marker.
TextBlob: Similar to spaCy, it handles abbreviations and ellipses relatively well. It does not split sentences after "i.e." and also manages ellipses like "isn't true..." without creating sentence boundaries.