<a href="https://colab.research.google.com/github/giggsy1106/DATA-622-NLP-/blob/main/NLPHW4_KOTA_FIXED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA 622 — Homework 4 (NLP)
**Student:** Rahul Reddy Kota  

This notebook completes the following NLP tasks using **spaCy** and **benepar** (constituency parsing):
1. POS tagging (Sentence 1)
2. Dependency parsing (Sentence 2)
3. Constituency parsing (Sentence 1)
4. Noun phrase extraction (Sentences 1 & 2)
5. Short written comparison: **CRF vs HMM**



In [4]:
# (Colab/First-time setup) Install compatible versions for benepar
# After running, RESTART runtime/kernel before continuing.

!pip -q uninstall -y transformers tokenizers
!pip -q install transformers==4.17.0 benepar spacy nltk
!python -m spacy download en_core_web_md
!python -c "import benepar; benepar.download('benepar_en3')"


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 5.2.3 requires transformers<6.0.0,>=4.41.0, but you have transformers 4.17.0 which is incompatible.[0m[31m
[0mCollecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
[nltk_data] Download

## Step 1 — Load libraries, models, and input sentences

In [5]:
# ── Step 2: Imports ────────────────────────────────────────
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

import spacy
import benepar
import nltk
from nltk import Tree

# ── Safe model download guard ────────────────────────────────
benepar.download('benepar_en3')  # skips if already downloaded

# ── Step 3: Load spaCy model and add benepar to pipeline ────
nlp = spacy.load('en_core_web_md')
if 'benepar' not in nlp.pipe_names:
    nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

# ── Step 4: Define source text ───────────────────────────────
sent1 = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
sent2 = 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.'

doc1 = nlp(sent1)
doc2 = nlp(sent2)

s1 = list(doc1.sents)[0]
s2 = list(doc2.sents)[0]
print('Models loaded successfully.')

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto

## Task 2 — POS tagging (Sentence 1)
Below we print each token, its coarse POS tag, and spaCy’s human-readable explanation.

In [None]:
# ============================================================
# TASK 2: Part-of-Speech (POS) Tagging — Sentence 1 only
# ============================================================

# spaCy assigns POS tags based on the token's grammatical role
# token.pos_ gives the coarse-grained tag (NOUN, VERB, ADJ, etc.)
# spacy.explain() gives a human-readable description of the tag
print(f"{'Token':<25} {'POS Tag':<10} {'Description'}")
print("-" * 55)
for token in doc1:
    print(f"{token.text:<25} {token.pos_:<10} {spacy.explain(token.pos_)}")


## Task 3 — Dependency parsing (Sentence 2)
Below we show each token’s dependency relation and its head (governor) token.

In [None]:
# ============================================================
# TASK 3: Dependency Parsing — Sentence 2 only
# ============================================================

# Dependency parsing identifies grammatical relationships between tokens
# token.dep_ = the dependency label (e.g., 'nsubj', 'dobj', 'prep')
# token.head = the governor/head word this token is attached to
print(f"{'Token':<25} {'Dependency':<20} {'Head Word'}")
print("-" * 60)
for token in doc2:
    print(f"{token.text:<25} {token.dep_:<20} {token.head.text}")


## Task 4 — Constituency parsing (Sentence 1)
Benepar produces a Penn Treebank-style constituency parse. We render it using `nltk.Tree`.

In [None]:
# ============================================================
# TASK 4: Constituent (Phrase Structure) Parsing — Sentence 1
# ============================================================

parse_string = s1._.parse_string
tree = Tree.fromstring(parse_string)

print('── Constituency Parse Tree (Sentence 1) ──')
print(parse_string)  # print bracket notation (always works in Colab)


## Task 5 — Noun phrases (Sentences 1 & 2)
We extract base noun phrases using spaCy’s `noun_chunks`.

In [None]:
# ============================================================
# TASK 5: Extract Noun Phrases — Both Sentences
# ============================================================

# spaCy's noun chunks are base noun phrases (NP) identified
# using dependency parse information
# Each chunk has: .text (the phrase), .root (the head noun),
# .root.dep_ (its dependency label)
print("── Noun Phrases in Sentence 1 ──")
for chunk in doc1.noun_chunks:
    print(f"  - {chunk.text}")

print("\n── Noun Phrases in Sentence 2 ──")
for chunk in doc2.noun_chunks:
    print(f"  - {chunk.text}")


## Task 6 — CRF vs HMM (≤ 50 words)

In [None]:
summary = """HMM is a generative sequence model that uses hidden states and emission probabilities, typically assuming limited independence between observations. CRF is a discriminative model that directly models P(labels|observations) and supports rich, overlapping features, often giving better accuracy for NER and POS tagging."""
print(summary)
print("\nWord count:", len(summary.split()))
