<a href="https://colab.research.google.com/github/giggsy1106/DATA-622-NLP-/blob/main/NLPHW4_KOTA_FIXED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA 622 — Homework 4 (NLP)
**Student:** Rahul Reddy Kota  

This notebook completes the following NLP tasks using **spaCy** and **benepar** (constituency parsing):
1. POS tagging (Sentence 1)
2. Dependency parsing (Sentence 2)
3. Constituency parsing (Sentence 1)
4. Noun phrase extraction (Sentences 1 & 2)
5. Short written comparison: **CRF vs HMM**

---
## Setup notes
If you run this in **Google Colab**, use the setup cell below once, then **Runtime → Restart runtime**.
If you run locally, use a virtual environment and install the same versions.


In [None]:
# (Colab/First-time setup) Install compatible versions for benepar
# After running, RESTART runtime/kernel before continuing.

!pip -q uninstall -y transformers tokenizers
!pip -q install transformers==4.17.0 benepar spacy nltk
!python -m spacy download en_core_web_md
!python -c "import benepar; benepar.download('benepar_en3')"


Installing packages... (run once, then restart runtime)


## Step 1 — Load libraries, models, and input sentences

In [None]:
# ── Step 2: Imports ────────────────────────────────────────
import spacy
import benepar
import nltk
from nltk import Tree

# ── Safe model download guard ────────────────────────────────
benepar.download('benepar_en3')  # skips if already downloaded

# ── Step 3: Load spaCy model and add benepar to pipeline ────
nlp = spacy.load('en_core_web_md')
if 'benepar' not in nlp.pipe_names:
    nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

# ── Step 4: Define source text ───────────────────────────────
sent1 = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
sent2 = 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.'

doc1 = nlp(sent1)
doc2 = nlp(sent2)

s1 = list(doc1.sents)[0]
s2 = list(doc2.sents)[0]
print('Models loaded successfully.')


Models loaded successfully.


## Task 2 — POS tagging (Sentence 1)
Below we print each token, its coarse POS tag, and spaCy’s human-readable explanation.

In [None]:
# ============================================================
# TASK 2: Part-of-Speech (POS) Tagging — Sentence 1 only
# ============================================================

# spaCy assigns POS tags based on the token's grammatical role
# token.pos_ gives the coarse-grained tag (NOUN, VERB, ADJ, etc.)
# spacy.explain() gives a human-readable description of the tag
print(f"{'Token':<25} {'POS Tag':<10} {'Description'}")
print("-" * 55)
for token in doc1:
    print(f"{token.text:<25} {token.pos_:<10} {spacy.explain(token.pos_)}")


Token                     POS Tag    Description
-------------------------------------------------------
Four                      NUM        numeral
score                     NOUN       noun
and                       CCONJ      coordinating conjunction
seven                     NUM        numeral
years                     NOUN       noun
ago                       ADV        adverb
our                       PRON       pronoun
fathers                   NOUN       noun
brought                   VERB       verb
forth                     ADV        adverb
on                        ADP        adposition
this                      DET        determiner
continent                 NOUN       noun
,                         PUNCT      punctuation
a                         DET        determiner
new                       ADJ        adjective
nation                    NOUN       noun
,                         PUNCT      punctuation
conceived                 VERB       verb
in                        A

## Task 3 — Dependency parsing (Sentence 2)
Below we show each token’s dependency relation and its head (governor) token.

In [None]:
# ============================================================
# TASK 3: Dependency Parsing — Sentence 2 only
# ============================================================

# Dependency parsing identifies grammatical relationships between tokens
# token.dep_ = the dependency label (e.g., 'nsubj', 'dobj', 'prep')
# token.head = the governor/head word this token is attached to
print(f"{'Token':<25} {'Dependency':<20} {'Head Word'}")
print("-" * 60)
for token in doc2:
    print(f"{token.text:<25} {token.dep_:<20} {token.head.text}")


Token                     Dependency           Head Word
------------------------------------------------------------
Now                       advmod               engaged
we                        nsubj                engaged
are                       aux                  engaged
engaged                   ROOT                 engaged
in                        prep                 engaged
a                         det                  war
great                     amod                 war
civil                     amod                 war
war                       pobj                 in
,                         punct                war
testing                   advcl                engaged
whether                   mark                 endure
that                      det                  nation
nation                    nsubj                endure
,                         punct                nation
or                        cc                   nation
any                       de

## Task 4 — Constituency parsing (Sentence 1)
Benepar produces a Penn Treebank-style constituency parse. We render it using `nltk.Tree`.

In [None]:
# ============================================================
# TASK 4: Constituent (Phrase Structure) Parsing — Sentence 1
# ============================================================

parse_string = s1._.parse_string
tree = Tree.fromstring(parse_string)

print('── Constituency Parse Tree (Sentence 1) ──')
print(parse_string)  # print bracket notation (always works in Colab)


── Constituency Parse Tree (Sentence 1) ──
(S
  (NP
    (NP (QP (CD Four) (NNS score) (CC and) (CD seven)) (NNS years))
    (ADVP (RB ago)))
  (NP (PRP$ our) (NNS fathers))
  (VP (VBD brought)
    (ADVP (RB forth))
    (PP (IN on)
      (NP (DT this) (NN continent)))
    (, ,)
    (NP
      (NP (DT a) (JJ new) (NN nation))
      (, ,)
      (VP (VBN conceived)
        (PP (IN in)
          (NP (NNP Liberty))))
      (, ,)
      (CC and)
      (VP (VBN dedicated)
        (PP (IN to)
          (NP
            (NP (DT the) (NN proposition))
            (SBAR (IN that)
              (S
                (NP (DT all) (NNS men))
                (VP (VBP are)
                  (VP (VBN created)
                    (ADJP (JJ equal)))))))))))))
  (. .))


## Task 5 — Noun phrases (Sentences 1 & 2)
We extract base noun phrases using spaCy’s `noun_chunks`.

In [None]:
# ============================================================
# TASK 5: Extract Noun Phrases — Both Sentences
# ============================================================

# spaCy's noun chunks are base noun phrases (NP) identified
# using dependency parse information
# Each chunk has: .text (the phrase), .root (the head noun),
# .root.dep_ (its dependency label)
print("── Noun Phrases in Sentence 1 ──")
for chunk in doc1.noun_chunks:
    print(f"  - {chunk.text}")

print("\n── Noun Phrases in Sentence 2 ──")
for chunk in doc2.noun_chunks:
    print(f"  - {chunk.text}")


── Noun Phrases in Sentence 1 ──
  - Four score and seven years ago
  - our fathers
  - this continent
  - a new nation
  - Liberty
  - the proposition
  - all men

── Noun Phrases in Sentence 2 ──
  - we
  - a great civil war
  - that nation
  - any nation


## Task 6 — CRF vs HMM (≤ 50 words)

In [None]:
summary = """HMM is a generative sequence model that uses hidden states and emission probabilities, typically assuming limited independence between observations. CRF is a discriminative model that directly models P(labels|observations) and supports rich, overlapping features, often giving better accuracy for NER and POS tagging."""
print(summary)
print("\nWord count:", len(summary.split()))


HMM is a generative sequence model that uses hidden states and emission probabilities, typically assuming limited independence between observations. CRF is a discriminative model that directly models P(labels|observations) and supports rich, overlapping features, often giving better accuracy for NER and POS tagging.

Word count: 47
