#Theory

1. Compare and contrast NLTK and spaCy in terms of features, ease of use, and performance.
- I) Tokenization:
  - i) NLTK: Rule-based tokenizers; customizable but slower.  
  - i) spaCy: Highly optimized tokenizer written in Cython (very fast).  
- II) POS Tagging:
  - i) NLTK: Several taggers (Perceptron, CRF, etc.); customizable.
  - i) spaCy: Pre-trained statistical models for multiple languages.
- III) Named Entity Recognition:
  - i) NLTK: Basic; models are older and less accurate.
  - i) spaCy: State-of-the-art pre-trained models with good accuracy.
- IV) Dependency Parsing:
  - i) NLTK: Limited (via external tools like Stanford Parser).
  - i) spaCy: Built-in dependency parser, efficient and accurate.
- V) Lemmatization:
  - i) NLTK: WordNet-based; good for English only.
  - i) spaCy: Built-in lemmatizer trained on large corpora.

2. What is TextBlob and how does it simplify common NLP tasks like sentiment analysis and translation?
- TextBlob is a high-level NLP (Natural Language Processing) library for Python that builds on top of NLTK and Pattern to make text processing simpler and more intuitive.
It’s designed to help developers and beginners perform common NLP tasks — such as sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and language detection — using just a few lines of code.

3. Explain the role of Standford NLP in academic and industry NLP Projects.
- Stanford NLP (Stanford CoreNLP) is one of the most influential and widely used natural language processing (NLP) frameworks developed by the Stanford Natural Language Processing Group at Stanford University. It plays a major role in both academic research and industry applications due to its rich linguistic tools, high accuracy, and flexibility.
- I) Role in Academic Research:
  - i) Benchmark models: Many research papers cite or use Stanford POS Tagger, NER, or Parser as baselines.
  - ii) Corpora & datasets: It provides trained models on standard linguistic datasets (Penn Treebank, OntoNotes, etc.).
  - iii) Reproducibility: Enables researchers to reproduce linguistic analyses consistently.
- II) Role in Industry Applications:
  - i) Information Extraction:- Extracting entities and relations from news legal, and financial documents.
  - ii) Chatbots & Virtual Assistants:- Understanding user intent and parsing language structure.
  - iii) Search & Recommendation Systems:- Improving semantic search using syntactic and entity information.

4. Describe the architecture and functioning of a Recurrent Natural Networ(RNN).
- An RNN's architecture is a series of interconnected units that process sequential data, featuring a "feedback loop" where the output from a previous step is fed as an input to the current step.
- Architecture:
  - i) Recurrent connections: The core feature is a loop that sends the output of a neuron (or layer) back into itself.
  - ii) Hidden state: This is the memory of the network. At each time step, a new hidden state is computed based on the current input and the previous hidden state.
  - iii) Layers: Similar to other neural networks, RNNs have input, output, and hidden layers. The hidden layer is where the "recurrent" processing happens.

5. What is the key difference between LSTM and GRU networks in NLP applications?
- I) Number of Gates
  - i) LSTM:- 3 gates: Input Gate, Forget Gate, Output Gate.
  - i) GRU:- 2 gates: Update Gate, Reset Gate.
- II) Memory Components
  - i) LSTM:- Has two states:
     - 1. Cell state (Cₜ) — long-term memory
     - 2. Hidden state (hₜ) — short-term memory
  - i) GRU:- Has one state (hₜ) that combines both long- and short-term memory.
- III) Complexity
  - i) LSTM:- More parameters and more computation (heavier model)
  - i) GRU:- Fewer parameters and simpler (lighter model)
- IV) Training Speed
  - i) LSTM:- Slower due to complex gating
  - i) GRU:- Faster to train
- V) Architecture Depth
  - i) LSTM:- Deeper and more expressive.
  - i) GRU:- Compact and faster to converge.


In [1]:
# 6. Write a Python program using TextBlob to perform sentiment analysis on the following paragraph of text:
# “I had a great experience using the new mobile banking app. The interface is intuitive, and customer support was quick to resolve my issue. However, the app did crash once during a transaction, which was frustrating"
# Your program should print out the polarity and subjectivity scores.

from textblob import TextBlob

# Input paragraph
text = """I had a great experience using the new mobile banking app.
The interface is intuitive, and customer support was quick to resolve my issue.
However, the app did crash once during a transaction, which was frustrating."""

# Create a TextBlob object
blob = TextBlob(text)

# Get sentiment analysis
sentiment = blob.sentiment

# Print the results
print("Sentiment Analysis Results:")
print(f"Polarity: {sentiment.polarity}")
print(f"Subjectivity: {sentiment.subjectivity}")

Sentiment Analysis Results:
Polarity: 0.21742424242424244
Subjectivity: 0.6511363636363636


In [6]:
# 7. Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK:
# “Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language.
# Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical.”

import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

# Download required NLTK data (run once)
nltk.download('punkt')

# Tokenize the text into words
tokens = word_tokenize(text)

# Display tokens
print("Tokens:")
print(tokens)
print("\nTotal Tokens:", len(tokens))

# Create frequency distribution
freq_dist = FreqDist(tokens)

# Display the most common words
print("\nFrequency Distribution (Top 10):")
for word, freq in freq_dist.most_common(10):
    print(f"{word}: {freq}")

Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Total Tokens: 63

Frequency Distribution (Top 10):
,: 7
.: 4
NLP: 3
and: 3
is: 2
of: 2
Natural: 1
Language: 1
Processing: 1
(: 1


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# 8. Implement a basic LSTM model in Keras for a text classification task using the following dummy dataset. Your model should classify sentences as either positive (1) or negative (0).
# Dataset
# texts = [
# “I love this project”, #Positive
# “This is an amazing experience”, #Positive
# “I hate waiting in line”, #Negative
# “This is the worst service”, #Negative
# “Absolutely fantastic!” #Positive]
# labels = [1, 1, 0, 0, 1]
# Preprocess the text, tokenize it, pad sequences, and build an LSTM model to train on this data. You may use Keras with TensorFlow backend.

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
texts = [
    "I love this project",          # Positive
    "This is an amazing experience",# Positive
    "I hate waiting in line",       # Negative
    "This is the worst service",    # Negative
    "Absolutely fantastic!"         # Positive
]
labels = [1, 1, 0, 0, 1]  # 1 = Positive, 0 = Negative
labels = np.array(labels)
tokenizer = Tokenizer(num_words=100)  # Only consider top 100 words
tokenizer.fit_on_texts(texts)

# Convert texts to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)

# Check sequences
print("Sequences:", sequences)
max_len = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

print("Padded Sequences:\n", padded_sequences)
vocab_size = 100  # same as tokenizer num_words
embedding_dim = 16

# Build LSTM model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))
model.add(LSTM(32))  # 32 LSTM units
model.add(Dense(1, activation='sigmoid'))  # Binary classification

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Summary
model.summary()
model.fit(padded_sequences, labels, epochs=20, batch_size=1)

Sequences: [[2, 4, 1, 5], [1, 3, 6, 7, 8], [2, 9, 10, 11, 12], [1, 3, 13, 14, 15], [16, 17]]
Padded Sequences:
 [[ 2  4  1  5  0]
 [ 1  3  6  7  8]
 [ 2  9 10 11 12]
 [ 1  3 13 14 15]
 [16 17  0  0  0]]




Epoch 1/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - accuracy: 0.6444 - loss: 0.6912
Epoch 2/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8917 - loss: 0.6843 
Epoch 3/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 1.0000 - loss: 0.6855 
Epoch 4/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 1.0000 - loss: 0.6836 
Epoch 5/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 1.0000 - loss: 0.6784 
Epoch 6/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 1.0000 - loss: 0.6588 
Epoch 7/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 1.0000 - loss: 0.6546 
Epoch 8/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 1.0000 - loss: 0.6474 
Epoch 9/20
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

<keras.src.callbacks.history.History at 0x7c4790e712e0>

In [8]:
# 9. Using spaCy, build a simple NLP pipeline that includes tokenization, lemmatization, and entity recognition. Use the following paragraph as your dataset:
# “Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the development of India’s atomic energy program. He was the founding director of the Tata Institute of Fundamental Research (TIFR)
# and was instrumental in establishing the Atomic Energy Commission of India.”
# Write a Python program that processes this text using spaCy, then prints tokens, their lemmas, and any named entities found.

import spacy

# Load the English model (download if not already)
# !python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Sample paragraph
text = """Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the
development of India’s atomic energy program. He was the founding director of the Tata
Institute of Fundamental Research (TIFR) and was instrumental in establishing the
Atomic Energy Commission of India."""

# Process the text
doc = nlp(text)

# --- Tokenization and Lemmatization ---
print("Tokens and Lemmas:")
for token in doc:
    print(f"Token: {token.text}\t Lemma: {token.lemma_}")

# --- Named Entity Recognition ---
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"Entity: {ent.text}\t Label: {ent.label_}")


Tokens and Lemmas:
Token: Homi	 Lemma: Homi
Token: Jehangir	 Lemma: Jehangir
Token: Bhaba	 Lemma: Bhaba
Token: was	 Lemma: be
Token: an	 Lemma: an
Token: Indian	 Lemma: indian
Token: nuclear	 Lemma: nuclear
Token: physicist	 Lemma: physicist
Token: who	 Lemma: who
Token: played	 Lemma: play
Token: a	 Lemma: a
Token: key	 Lemma: key
Token: role	 Lemma: role
Token: in	 Lemma: in
Token: the	 Lemma: the
Token: 
	 Lemma: 

Token: development	 Lemma: development
Token: of	 Lemma: of
Token: India	 Lemma: India
Token: ’s	 Lemma: ’s
Token: atomic	 Lemma: atomic
Token: energy	 Lemma: energy
Token: program	 Lemma: program
Token: .	 Lemma: .
Token: He	 Lemma: he
Token: was	 Lemma: be
Token: the	 Lemma: the
Token: founding	 Lemma: found
Token: director	 Lemma: director
Token: of	 Lemma: of
Token: the	 Lemma: the
Token: Tata	 Lemma: Tata
Token: 
	 Lemma: 

Token: Institute	 Lemma: Institute
Token: of	 Lemma: of
Token: Fundamental	 Lemma: Fundamental
Token: Research	 Lemma: Research
Token: (	 Lemma: 

10. You are working on a chatbot for a mental health platform. Explain how you would leverage LSTM or GRU networks along with libraries like spaCy or Stanford NLP to understand and respond to user input effectively. Detail your architecture, data preprocessing pipeline, and any ethical considerations.
- I) Data Preprocessing Pipeline
Before feeding data into an LSTM or GRU, we must clean, normalize, and encode text. NLP libraries help with this:
  - i) Text Cleaning

    - Lowercasing, removing unnecessary punctuation.

    - Removing sensitive or personally identifiable information (PII) for privacy.

  - ii) Tokenization & Lemmatization

    - Use spaCy or Stanford NLP to:

      - Split text into tokens (words)

      - Extract lemmas (base word forms)

     - Example: “I’ve been feeling anxious” → ["I", "have", "be", "feel", "anxious"]

  - iii) Named Entity Recognition (NER)

    - Identify names, dates, locations, or organizations to anonymize sensitive information.

    - Example: Replace “John” with <PERSON>.

  - iv) Word Embeddings

    - Convert tokens into numerical vectors:

    - Pre-trained embeddings like GloVe or Word2Vec Or trainable embeddings in Keras embedding layer