# **NLP Introduction & Text Processing Assignment **

Question 1: What is Computational Linguistics and how does it relate to NLP?
sol) At its core, Computational Linguistics is a branch of linguistics that uses computer science to understand how language works. It’s less about building a "product" and more about using mathematical models to explore the structure, logic, and evolution of human language.The Goal: To model the "why" and "how" of language.The Focus: Syntax (sentence structure), semantics (meaning), phonology (sounds), and morphology (word formation).The Approach: Historically, it relied heavily on rule-based systems (like formal grammars), though it now embraces statistical and neural models to test linguistic theories.2. How it Relates to NLPNatural Language Processing (NLP) is the engineering-driven sibling of CL. While CL asks "How does language work?", NLP asks "How can I get a computer to do something useful with this text?"The relationship is essentially Theory vs. Application:FeatureComputational Linguistics (CL)Natural Language Processing (NLP)Primary GoalScientific discovery and modeling.Solving tasks and building tools.Success MetricDoes the model accurately represent human language?Does the system perform the task efficiently?Typical TasksParsing, discourse analysis, historical linguistics.Sentiment analysis, translation, chatbots.Key DisciplineHumanities-heavy (Linguistics).Engineering-heavy (Computer Science).3. The "Sweet Spot" IntersectionIn the modern era, the line between them has blurred significantly. To build a Great Language Model (like the one I'm using to talk to you right now), you need both:CL provides the understanding of nuance, context, and grammar.NLP provides the massive neural networks and processing power to handle billions of words.Basically, Computational Linguistics provides the blueprint, and NLP builds the house.

Question 2: Briefly describe the historical evolution of Natural Language Processing.
sol) The history of NLP is a fascinating journey from rigid, logic-based rules to the "black box" of modern neural networks. It’s essentially the story of moving from teaching a computer the rules of language to letting the computer experience language.

1. The Era of Rules (1950s – 1980s)
Early NLP was dominated by the belief that language could be solved with pure logic. Scientists tried to hand-code every grammatical rule.
+1

The Turing Test (1950): Alan Turing proposed his famous test for machine intelligence, setting the stage for conversational AI.

The Georgetown Experiment (1954): A famous (and overly optimistic) attempt to automatically translate sixty Russian sentences into English.
+1

ELIZA (1966): One of the first "chatterbots," which mimicked a therapist by reflecting the user’s words back at them using simple pattern matching.

2. The Statistical Revolution (1990s – 2010s)
By the 90s, researchers realized that language is too messy for fixed rules. Instead of telling the computer how language should work, they started using probability.

Corpus Linguistics: Researchers began using massive "corpora" (text datasets) to calculate the likelihood of certain words appearing together.

Machine Translation: IBM’s statistical models started outperforming rule-based systems by looking for patterns in translated documents (like UN proceedings).

3. The Neural & Deep Learning Wave (2010s – 2017)
The rise of "Deep Learning" changed everything. Instead of human-engineered statistics, we started using Neural Networks inspired by the human brain.

Word Embeddings (Word2Vec): This allowed computers to understand that "king" and "queen" are related by representing words as mathematical vectors in a high-dimensional space.

RNNs and LSTMs: These models were designed to "remember" the beginning of a sentence while reading the end, which is crucial for context.

4. The Transformer Era (2017 – Present)
This is where we are now. The invention of the Transformer architecture (from the paper "Attention Is All You Need") catalyzed a massive leap in capability.

Attention Mechanisms: Instead of reading word-by-word, models can now "look" at an entire paragraph at once to understand context.

Large Language Models (LLMs): Models like GPT-4 and Gemini are trained on nearly the entire internet, allowing them to not just process language, but to reason and generate creative content.

Fun Fact: In the 1950s, experts predicted machine translation would be "solved" within three to five years. It ended up taking closer to seventy!

Question 3: List and explain three major use cases of NLP in today’s tech industry.
sol) In 2026, NLP has moved beyond simple "word processing" to become the cognitive engine behind most modern business operations. Here are three major use cases that have become industry standards:

1. Autonomous AI Agents (Conversational AI 2.0)
While the chatbots of the early 2020s could only answer questions, 2026-era Autonomous Agents can execute multi-step tasks. These systems use NLP to understand complex instructions and "reason" through a workflow.

How it works: An agent doesn't just tell you your flight is delayed; it understands the sentiment of your frustration, looks up alternative flights, checks your calendar for conflicts, and asks if you'd like it to rebook the best option.

Industry Impact: Customer support is shifting from "FAQ bots" to "Resolution bots" that actually perform the work, drastically reducing the need for human intervention in routine logistics.

2. Intelligent Document Processing (IDP)
In heavily regulated industries like Finance, Law, and Healthcare, NLP is used to transform mountains of unstructured text (contracts, medical notes, emails) into structured, actionable data.

How it works: Instead of a human reading 500 pages of a legal merger, NLP models use Named-Entity Recognition (NER) and Summarization to automatically flag high-risk clauses, extract expiration dates, and cross-reference them with current regulations.

Industry Impact: This has automated up to 75% of manual data entry and compliance auditing, allowing professionals to focus on strategy rather than "paper-pushing."

3. Real-Time Multimodal Semantic Search
Search has evolved from matching "keywords" to understanding "intent" across different types of media. In 2026, tech companies use NLP to power search engines that treat text, voice, and video as a single searchable landscape.

How it works: A user can search for a specific moment in a 10-hour video archive by describing the concept (e.g., "The part where the CEO talks about the 2027 sustainability roadmap"). NLP models index the audio transcripts and visual cues to find the exact timestamp.

Industry Impact: This is revolutionizing E-commerce (allowing users to find products via conversational descriptions) and Enterprise Knowledge Management (allowing employees to "ask" their company's internal data for answers).

Summary Table
| Use Case | Core Tech | Primary Benefit |
| :--- | :--- | :--- |
| Autonomous Agents | LLMs & Planning Algorithms | End-to-end task execution. |
| Document Intelligence | NER & Summarization | Speeding up compliance and data entry. |
| Semantic Search | Vector Embeddings | Finding information by meaning, not words. |

Question 4: What is text normalization and why is it essential in text processing tasks?
sol) Think of Text Normalization as the "cleanup crew" of the NLP world. It is the process of transforming raw, messy text into a consistent, standard format that a computer can actually understand and process without getting confused by superficial variations.1. What exactly is Text Normalization?Human language is full of "noise"—we use capital letters, punctuation, slang, and different tenses that all point to the same core idea. Text normalization strips away this noise to reveal the underlying meaning.It typically involves several key steps:Tokenization: Breaking a sentence into individual words or "tokens."Case Folding: Converting everything to lowercase so that "Apple," "APPLE," and "apple" are treated as the same word.Stopword Removal: Filtering out common words that don't add much meaning (like "the," "is," and "at").Stemming & Lemmatization: Reducing words to their root form. For example, "running," "ran," and "runs" all become "run."Noise Removal: Stripping out HTML tags, special characters, or emojis if they aren't relevant to the analysis.2. Why is it Essential?Without normalization, a computer sees the world in a very fragmented way. Here is why we can't skip it:A. Reducing DimensionalityIf a model treats "Go," "go," and "going" as three entirely different concepts, your "vocabulary" becomes unnecessarily massive. By normalizing them to a single root, you make the data more dense and the model more efficient.B. Improving ConsistencyImagine you are building a search engine. If a user searches for "running shoes" but your database only has the phrase "run shoe," a system without normalization might fail to find a match. Normalization ensures the intent matches the data.C. Boosting AccuracyIn tasks like Sentiment Analysis, keeping punctuation or extra spaces can confuse a model. Normalization ensures the machine focuses on the "signal" (the words that carry emotion) rather than the "noise" (the formatting).3. A Quick ComparisonLook at how a single sentence changes after a standard normalization pipeline:Original Text"The Quick Brown Foxes are jumping over 2 Lazy Dogs!!"Lowercasing"the quick brown foxes are jumping over 2 lazy dogs!!"Punctuation Removal"the quick brown foxes are jumping over 2 lazy dogs"Stopword Removal"quick brown foxes jumping 2 lazy dogs"Lemmatization"quick brown fox jump 2 lazy dog"Pro Tip: In modern LLMs (like GPT-4), we actually do less aggressive normalization than we used to, because these massive models are smart enough to understand that "Foxes" and "fox" are related. However, for traditional machine learning and search indexing, it remains a critical foundation.


Question 5: Compare and contrast stemming and lemmatization with suitable
examples.
sol) While both stemming and lemmatization aim to reduce a word to its base form, they go about it in very different ways. One is a "hacker" that chops off the ends of words, while the other is a "linguist" that understands the context and dictionary definition.1. Stemming: The "Chop-and-Drop" ApproachStemming is a heuristic process that lops off the ends of words in the hope of reaching a common root. It follows a set of crude rules (like the famous Porter Stemmer) and doesn't actually "know" the language.+1How it works: It uses algorithms to remove common suffixes like "-ing," "-ed," or "-ies."Pros: It is extremely fast and requires very little memory.Cons: It often results in "non-words" that don't exist in the dictionary (e.g., "univers").Example:The word "Caring" becomes "Car". (Note: This is an error, as "caring" relates to "care," not "car.")2. Lemmatization: The "Dictionary" ApproachLemmatization is a more sophisticated process that uses a vocabulary and morphological analysis to return the lemma (the dictionary form of a word).How it works: It looks at the Part of Speech (POS) of a word. It knows that "saw" could be a tool (noun) or the past tense of "see" (verb).Pros: It is highly accurate and always returns a valid word.Cons: It is computationally "expensive" (slower) because it has to look up words in a large database like WordNet.Example:The word "Caring" becomes "Care". (Correct linguistic root.)3. Side-by-Side ComparisonFeatureStemmingLemmatizationLogicRule-based (suffix stripping).Morphological analysis (dictionary lookup).SpeedVery Fast.Slower (requires more processing).OutputCan be non-dictionary fragments.Always a valid, meaningful word.ContextIgnores context.Considers Part of Speech (POS tag).Best Use CaseLarge-scale search indexing.Chatbots, Translation, Sentiment Analysis.4. Visualizing the DifferenceTo truly see the contrast, look at how both methods handle the irregular verb "Went":Stemming: Returns "Went". (It doesn't see a suffix to chop, so it does nothing.)Lemmatization: Returns "Go". (It recognizes that "went" is the past tense of "go.")Another classic example is "Better":Stemming: Returns "Better".Lemmatization: Returns "Good".The Verdict: If you need speed and are working with massive datasets where perfect grammar doesn't matter (like a basic search engine), use Stemming. If you are building a sophisticated AI that needs to understand the actual meaning of a sentence, use Lemmatization.

Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”


In [1]:
import re

# The block of text
text = """
Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz.
"""

# Define the regex pattern for a standard email address
# This pattern matches:
# - Alphanumeric characters, dots, underscores, plus signs, dashes before the @
# - Domain name
# - Top-level domain (e.g., .com, .org, .co.us)
email_pattern = r'[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Use re.findall to extract all matches
emails = re.findall(email_pattern, text)

# Print the results
print("Extracted Email Addresses:")
for email in emails:
    print(email)

Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”



In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# The sample paragraph
text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# 1. Tokenization: Convert text to lowercase and split into words
# We use .lower() to ensure 'Natural' and 'natural' are treated the same.
tokens = word_tokenize(text.lower())

# 2. Filtering: Remove punctuation and stop words (common words like 'is', 'the')
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

filtered_tokens = [
    word for word in tokens
    if word not in stop_words and word not in punctuation
]

# 3. Frequency Distribution
frequency_dist = FreqDist(filtered_tokens)

# Print the top 5 most common words
print("Top 5 Most Frequent Words:")
for word, frequency in frequency_dist.most_common(5):
    print(f"{word}: {frequency}")

# 4. Optional: Print the full distribution
# print("\nFull Frequency Distribution:")
# print(frequency_dist.items())

Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.


In [3]:
import spacy
from spacy.language import Language
from spacy.tokens import Token

# 1. Register a custom extension on the Token class
# This allows us to store our own 'is_proper_noun' property on every token
Token.set_extension("is_proper_noun", default=False, force=True)

# 2. Define the Custom Component
@Language.component("proper_noun_annotator")
def proper_noun_annotator(doc):
    # Iterate through every token in the document
    for token in doc:
        # 'PROPN' is the standard spaCy tag for Proper Nouns
        if token.pos_ == "PROPN":
            token._.is_proper_noun = True
    return doc

# 3. Load the model and add our component to the pipeline
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("proper_noun_annotator", last=True)

# 4. Test the annotator
text = "Google was founded by Larry Page and Sergey Brin in California."
doc = nlp(text)

print(f"{'Token':<15} | {'POS':<6} | {'Custom Annotator'}")
print("-" * 40)

for token in doc:
    print(f"{token.text:<15} | {token.pos_:<6} | {token._.is_proper_noun}")

Token           | POS    | Custom Annotator
----------------------------------------
Google          | PROPN  | True
was             | AUX    | False
founded         | VERB   | False
by              | ADP    | False
Larry           | PROPN  | True
Page            | PROPN  | True
and             | CCONJ  | False
Sergey          | PROPN  | True
Brin            | PROPN  | True
in              | ADP    | False
California      | PROPN  | True
.               | PUNCT  | False


Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.


In [None]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# 1. The Dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# 2. Tokenization and Preprocessing
# simple_preprocess converts text to lowercase, tokenizes it,
# and removes punctuation/numbers.
processed_dataset = [simple_preprocess(sentence) for sentence in dataset]

# 3. Train the Word2Vec Model
# vector_size: Dimensions of the word vectors
# window: Context window size (words to the left and right)
# min_count: Ignores words with total frequency lower than this
# epochs: Number of iterations over the corpus
model = Word2Vec(sentences=processed_dataset,
                 vector_size=100,
                 window=5,
                 min_count=1,
                 workers=4,
                 epochs=10)

# 4. Test the model: Find similar words
word = "language"
if word in model.wv:
    similar_words = model.wv.most_similar(word, topn=3)
    print(f"Words similar to '{word}': {similar_words}")
else:
    print(f"'{word}' not found in vocabulary.")

Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.
sol) As a data scientist in a fintech startup, handling raw customer feedback requires a pipeline that transforms chaotic language into actionable business metrics. I would approach this task by building an NLP pipeline structured into three core phases: Ingestion & Cleaning, NLP Processing, and Insight Extraction.

1. Ingestion & Cleaning (Preprocessing)
Before analysis, the data must be standardized to remove "noise" that doesn't contribute to sentiment or intent.

Handling Raw Data: Combine reviews from various sources (App Store, emails, support tickets) into a single structured format.

Normalization: Convert text to lowercase, remove URLs, HTML tags, and special characters.

Tokenization & Cleaning: Split text into words and remove stopwords (common words like "the", "and") that don't carry significant meaning for financial sentiment.

Lemmatization: Reduce words to their dictionary roots (e.g., "banking," "banked," and "banks" all become "bank") to reduce vocabulary size.

2. NLP Processing (Modeling)
In this phase, we apply algorithms to understand the structure and emotional tone of the feedback.

Sentiment Analysis: Use a model (like a fine-tuned BERT model) to classify reviews as Positive, Negative, or Neutral. This helps quantify overall customer satisfaction.

Named-Entity Recognition (NER): Identify specific entities mentioned, such as product names ("Savings Account", "Stock Trading") or competitor names.

Topic Modeling (LDA): Use Latent Dirichlet Allocation to automatically cluster reviews into overarching themes (e.g., "App Crashes," "High Fees," "Customer Service") without having to read every single review.

3. Insight Extraction (Actionable Intelligence)
Finally, we translate the model outputs into business strategy.

Driver Analysis: Correlate specific topics with negative sentiment. For example: Do reviews mentioning "transfer speed" correlate strongly with negative feedback?

Dashboarding: Build a real-time dashboard tracking sentiment trends over time and the volume of specific complaints.

Prioritization for Product Team: Provide a ranked list of issues based on sentiment severity and frequency to guide the roadmap.

Summary Checklist for a Fintech Pipeline
Data: Unstructured Text

Tools: Python, spaCy, Hugging Face Transformers

Goal: Turn unstructured text into structured sentiment metrics