In [1]:
!pip install transformers==4.35.2 -q

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 5.2.3 requires transformers<6.0.0,>=4.41.0, but you have transformers 4.35.2 which is incompatible.[0m[31m
[0m

1. Tokenization  
Tokenize both sentences into words using spaCy. Print the list of tokens for each sentence.
Also use the benepar library.

In [13]:
import spacy
import benepar
import nltk
from nltk import Tree

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

try:
    nlp = spacy.load("en_core_web_md")
except OSError:
    nlp = spacy.load("en_core_web_sm")

benepar.download('benepar_en3', quiet=True)
bp = benepar.Parser("benepar_en3")

sentence1 = (
    "Four score and seven years ago our fathers brought forth on this continent, "
    "a new nation, conceived in Liberty, and dedicated to the proposition "
    "that all men are created equal."
)
sentence2 = (
    "Now we are engaged in a great civil war, testing whether that nation, "
    "or any nation so conceived and so dedicated, can long endure."
)

doc1 = nlp(sentence1)
doc2 = nlp(sentence2)
tree1 = bp.parse(sentence1)
tree2 = bp.parse(sentence2)

print("\n--- Sentence 1 Tokens (spaCy) ---")
print([token.text for token in doc1])

print("\n--- Sentence 2 Tokens (spaCy) ---")
print([token.text for token in doc2])

print("\n--- Sentence 1 Tokens (benepar leaves) ---")
print(tree1.leaves())

print("\n--- Sentence 2 Tokens (benepar leaves) ---")
print(tree2.leaves())

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



--- Sentence 1 Tokens (spaCy) ---
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.']

--- Sentence 2 Tokens (spaCy) ---
['Now', 'we', 'are', 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.']

--- Sentence 1 Tokens (benepar leaves) ---
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.']

--- Sentence 2 Tokens (benepar leaves) ---
['Now', 'we', 'are', 'engaged', 'in', 'a', 'great', '

2. Part-of-Speech Tagging  
Print the part-of-speech (POS) tag for each token in the first sentence.

In [14]:
print(f"\n{'Token':<20} {'Coarse POS':<12} {'Fine-grained Tag'}")
print("-" * 50)
for token in doc1:
    print(f"{token.text:<20} {token.pos_:<12} {token.tag_}")


Token                Coarse POS   Fine-grained Tag
--------------------------------------------------
Four                 NUM          CD
score                NOUN         NN
and                  CCONJ        CC
seven                NUM          CD
years                NOUN         NNS
ago                  ADV          RB
our                  PRON         PRP$
fathers              NOUN         NNS
brought              VERB         VBD
forth                ADP          RP
on                   ADP          IN
this                 DET          DT
continent            NOUN         NN
,                    PUNCT        ,
a                    DET          DT
new                  ADJ          JJ
nation               NOUN         NN
,                    PUNCT        ,
conceived            VERB         VBN
in                   ADP          IN
Liberty              PROPN        NNP
,                    PUNCT        ,
and                  CCONJ        CC
dedicated            VERB         VBN
to  

3. Dependency Parsing  
Print the dependency relation and head word for each token in the second sentence.

In [15]:
print(f"\n{'Token':<20} {'Dep Relation':<16} {'Head Word'}")
print("-" * 50)
for token in doc2:
    print(f"{token.text:<20} {token.dep_:<16} {token.head.text}")


Token                Dep Relation     Head Word
--------------------------------------------------
Now                  advmod           engaged
we                   nsubjpass        engaged
are                  auxpass          engaged
engaged              ROOT             engaged
in                   prep             engaged
a                    det              war
great                amod             war
civil                amod             war
war                  pobj             in
,                    punct            engaged
testing              advcl            engaged
whether              mark             conceived
that                 det              nation
nation               nsubj            conceived
,                    punct            nation
or                   cc               nation
any                  det              nation
nation               conj             nation
so                   advmod           conceived
conceived            ccomp            test

4. Constituent Parsing  
Using the NLTK and benepar libraries, print the constituency (phrase structure) parse tree
of the first sentence.

In [16]:
tree1.pretty_print()

print("\nRaw parse string:")
print(str(tree1))

                                                                                                          TOP                                                                                                                  
                                                                                                           |                                                                                                                    
                                                                                                           S                                                                                                                   
       ____________________________________________________________________________________________________|_________________________________________________________________________________________________________________   
      |         |                                                                                     

5. Extract Noun Phrases  
Using spaCy, extract all noun phrases (noun chunks) from both sentences.

In [17]:
print("\n--- Sentence 1 Noun Phrases ---")
for chunk in doc1.noun_chunks:
    print(f"  '{chunk.text}'  \t root='{chunk.root.text}', dep='{chunk.root.dep_}'")

print("\n--- Sentence 2 Noun Phrases ---")
for chunk in doc2.noun_chunks:
    print(f"  '{chunk.text}'  \t root='{chunk.root.text}', dep='{chunk.root.dep_}'")


--- Sentence 1 Noun Phrases ---
  'Four score'  	 root='score', dep='nsubj'
  'our fathers'  	 root='fathers', dep='nsubj'
  'this continent'  	 root='continent', dep='pobj'
  'Liberty'  	 root='Liberty', dep='pobj'
  'the proposition'  	 root='proposition', dep='pobj'
  'all men'  	 root='men', dep='nsubjpass'

--- Sentence 2 Noun Phrases ---
  'we'  	 root='we', dep='nsubjpass'
  'a great civil war'  	 root='war', dep='pobj'
  'that nation'  	 root='nation', dep='nsubj'
  'any nation'  	 root='nation', dep='conj'


5. CRF and HMM  
Why do you use CRF and HMM? How do they differ? Please summarize in less than 50
words.

HMM is a generative sequence model that estimates joint probability P(X,Y) using Markov assumptions. CRF is a discriminative model that directly estimates P(Y|X) and allows richer feature representation. CRFs typically perform better for structured prediction tasks like POS tagging and named entity recognition.