# POS Tagging, Chunking, NER and N‑gram Language Model in Python

This notebook demonstrates:
1. **POS Tagging** (Part-of-Speech tagging)
2. **Syntactic Chunking** (e.g., noun phrase chunks)
3. **Named Entity Recognition (NER)**
4. **N‑gram Language Model** (unigram, bigram, trigram)

We will use the **NLTK** library for basic NLP tasks.

In [9]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [10]:
import nltk

# Download required NLTK resources (run this once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import RegexpParser


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


## 1. POS Tagging

In this section we:
1. Take an input sentence.
2. Tokenize it into words.
3. Apply NLTK's `pos_tag` to get part‑of‑speech tags.

In [12]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')

# NEW tagger model
nltk.download('averaged_perceptron_tagger_eng')

# These two are for NER
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [13]:
# Example sentence for POS tagging
sentence = "John bought a new laptop from the market yesterday."

# Step 1: Tokenize
tokens = word_tokenize(sentence)
print("Tokens:")
print(tokens)

# Step 2: POS Tagging
pos_tags = pos_tag(tokens)
print("\nPOS Tags (word, tag):")
print(pos_tags)


Tokens:
['John', 'bought', 'a', 'new', 'laptop', 'from', 'the', 'market', 'yesterday', '.']

POS Tags (word, tag):
[('John', 'NNP'), ('bought', 'VBD'), ('a', 'DT'), ('new', 'JJ'), ('laptop', 'NN'), ('from', 'IN'), ('the', 'DT'), ('market', 'NN'), ('yesterday', 'NN'), ('.', '.')]


## 2. Chunking (Syntactic Parsing)

Chunking groups words into **phrases** like noun phrases (NP).
We define a simple chunk grammar and apply it using NLTK's `RegexpParser`.

In [14]:
# Define a simple chunk grammar for Noun Phrases (NP)
# DT = determiner, JJ = adjective, NN/NNS/NNP/NNPS = noun
chunk_grammar = r"""
  NP: {<DT>?<JJ>*<NN.*>+}
"""

chunk_parser = RegexpParser(chunk_grammar)

# Use the POS‑tagged sentence from above
chunk_tree = chunk_parser.parse(pos_tags)
print(chunk_tree)

# If you run this in a local Jupyter environment with a GUI, you can visualize the tree as:
# chunk_tree.draw()


(S
  (NP John/NNP)
  bought/VBD
  (NP a/DT new/JJ laptop/NN)
  from/IN
  (NP the/DT market/NN yesterday/NN)
  ./.)


## 3. Named Entity Recognition (NER)

We use NLTK's built‑in `ne_chunk` which performs NER on POS‑tagged tokens.
It can recognize entities like **PERSON**, **ORGANIZATION**, **GPE** (Geo‑Political Entity), etc.

In [16]:
import nltk

# Core tokenizers & tagger
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

# NER models (both are needed in new NLTK)
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')

# Word list (used by NER)
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [17]:
# Example sentence for NER
sentence_ner = "Barack Obama was the 44th President of the United States of America."
tokens_ner = word_tokenize(sentence_ner)
pos_tags_ner = pos_tag(tokens_ner)

print("Tokens:")
print(tokens_ner)
print("\nPOS Tags:")
print(pos_tags_ner)

# Perform NER
ner_tree = ne_chunk(pos_tags_ner)
print("\nNamed Entity Tree:")
print(ner_tree)

# You can also visualize with ner_tree.draw() in a local environment
# ner_tree.draw()


Tokens:
['Barack', 'Obama', 'was', 'the', '44th', 'President', 'of', 'the', 'United', 'States', 'of', 'America', '.']

POS Tags:
[('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('44th', 'JJ'), ('President', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('of', 'IN'), ('America', 'NNP'), ('.', '.')]

Named Entity Tree:
(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  the/DT
  44th/JJ
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  of/IN
  (GPE America/NNP)
  ./.)


## 4. N‑gram Language Model

An **N‑gram language model** estimates the probability of a word given the previous *(N‑1)* words.

We will:
1. Create a small corpus of sentences.
2. Tokenize and add start `<s>` and end `</s>` markers.
3. Build **unigram**, **bigram** and **trigram** counts.
4. Convert counts to probabilities.
5. Generate text from the bigram/trigram model.

This is a simple **Maximum Likelihood Estimation (MLE)** model without smoothing.

In [18]:
# Small training corpus (you can replace these with any sentences)
corpus_sentences = [
    "John likes to watch movies",
    "Mary likes movies too",
    "John also likes to watch football",
    "Mary enjoys football and movies"
]

def prepare_corpus(sentences):
    """Tokenize each sentence and add <s> and </s> markers."""
    tokenized_sentences = []
    for sent in sentences:
        tokens = word_tokenize(sent.lower())
        # Add start and end tokens
        tokenized_sentences.append(["<s>"] + tokens + ["</s>"])
    return tokenized_sentences

tokenized_corpus = prepare_corpus(corpus_sentences)
for ts in tokenized_corpus:
    print(ts)


['<s>', 'john', 'likes', 'to', 'watch', 'movies', '</s>']
['<s>', 'mary', 'likes', 'movies', 'too', '</s>']
['<s>', 'john', 'also', 'likes', 'to', 'watch', 'football', '</s>']
['<s>', 'mary', 'enjoys', 'football', 'and', 'movies', '</s>']


In [19]:
from collections import defaultdict, Counter

def build_ngram_counts(tokenized_sentences, n=2):
    """Build n‑gram counts: context (tuple) -> Counter of next words."""
    counts = defaultdict(Counter)
    for sent in tokenized_sentences:
        if len(sent) < n:
            continue
        for i in range(len(sent) - n + 1):
            ngram = sent[i:i + n]
            context = tuple(ngram[:-1])  # previous n-1 words
            next_word = ngram[-1]
            counts[context][next_word] += 1
    return counts

def counts_to_probabilities(counts):
    """Convert counts to probability distributions."""
    probs = {}
    for context, counter in counts.items():
        total = float(sum(counter.values()))
        probs[context] = {w: c / total for w, c in counter.items()}
    return probs

# Build bigram (n=2) and trigram (n=3) models
bigram_counts = build_ngram_counts(tokenized_corpus, n=2)
bigram_probs = counts_to_probabilities(bigram_counts)

trigram_counts = build_ngram_counts(tokenized_corpus, n=3)
trigram_probs = counts_to_probabilities(trigram_counts)

print("Bigram probabilities for context ('likes',):")
print(bigram_probs.get(("likes",), {}))


Bigram probabilities for context ('likes',):
{'to': 0.6666666666666666, 'movies': 0.3333333333333333}


In [20]:
import random

def generate_from_ngram(probs, n=2, max_length=15, start_token='<s>'):
    """Generate a sentence from an n‑gram model.
    probs: context -> {next_word: prob}
    n: order of the model
    """
    if n == 1:
        # Unigram model: ignore context
        context = ()
    else:
        # Start with <s> repeated n-1 times
        context = tuple([start_token] * (n - 1))

    result = []
    for _ in range(max_length):
        context_probs = probs.get(context, None)
        if not context_probs:
            break
        # Sample next word according to probability distribution
        words = list(context_probs.keys())
        p = list(context_probs.values())
        next_word = random.choices(words, weights=p, k=1)[0]

        if next_word == '</s>':
            break
        result.append(next_word)

        if n > 1:
            # Update context (slide window by 1)
            context = tuple(list(context)[1:] + [next_word])

    return ' '.join(result)

# Example: generate text using bigram model
print("Generated sentence with bigram model:")
print(generate_from_ngram(bigram_probs, n=2))

# Example: generate text using trigram model
print("\nGenerated sentence with trigram model:")
print(generate_from_ngram(trigram_probs, n=3))


Generated sentence with bigram model:
john also likes to watch movies

Generated sentence with trigram model:

