### Name : D Vamsidhar
### PRN: 24070149005
### NLP Assignment 01

# Part A: Basics of NLP & Pipeline 


## **Natural Language Processing (NLP) Pipeline**

Natural Language Processing (NLP) is a subfield of artificial intelligence that enables machines to understand, interpret, and generate human language. The NLP pipeline consists of several key steps:

### **1. Text Preprocessing**
Text preprocessing is crucial for cleaning and preparing raw text data before analysis. It includes:

#### **a. Tokenization**
- Splitting text into smaller units called tokens (words, subwords, or sentences).
- Example: "NLP is amazing!" → ["NLP", "is", "amazing", "!"]

#### **b. Stopword Removal**
- Removing commonly used words that do not contribute to meaning (e.g., "the", "is", "and").
- Example: "This is an example sentence" → ["example", "sentence"]

#### **c. Stemming and Lemmatization**
- **Stemming:** Reduces words to their root form by chopping suffixes (e.g., "running" → "run").
- **Lemmatization:** Converts words to their base form using a vocabulary (e.g., "better" → "good").

#### **d. Part-of-Speech (POS) Tagging**
- Assigning word categories such as noun, verb, adjective, etc.
- Example: "She runs fast." → [("She", PRON), ("runs", VERB), ("fast", ADV)]

---

### **2. Feature Engineering**
Feature extraction transforms text into numerical representations for machine learning models.

#### **a. Bag of Words (BoW)**
- Represents text as a frequency-based vector of words.

#### **b. TF-IDF (Term Frequency-Inverse Document Frequency)**
- Measures the importance of words in a document relative to a collection of documents.

#### **c. Word Embeddings (Word2Vec, GloVe, BERT)**
- Captures contextual meaning and relationships between words in vector space.

---

### **3. Model Training**
Machine learning or deep learning models are trained on processed text data.

#### **a. Traditional ML Models**
- Naïve Bayes, SVM, Random Forest for text classification.

#### **b. Deep Learning Models**
- RNNs, LSTMs, Transformers (BERT, GPT) for NLP tasks.

---

### **4. Model Evaluation**
Evaluating the NLP model using metrics such as:

- **Accuracy, Precision, Recall, F1-score** (for classification tasks)
- **BLEU Score, ROUGE Score** (for text generation and summarization)

---

### **5. Deployment and Real-world Application**
After training and evaluation, NLP models are deployed for various applications:

- **Chatbots & Virtual Assistants** (e.g., Siri, Alexa)
- **Sentiment Analysis** (e.g., product reviews)
- **Machine Translation** (e.g., Google Translate)
- **Speech-to-Text & Text-to-Speech** (e.g., Voice Assistants)
- **Summarization & Question Answering** (e.g., News summarization)


This structured NLP pipeline helps in various real-world applications by enabling machines to understand human language efficiently.

## **Real-World Applications of Natural Language Processing (NLP)**

Natural Language Processing (NLP) has a wide range of applications across different industries. Below are three key real-world applications and the role of the NLP pipeline in each.

---

## **1. Chatbots and Virtual Assistants**
Chatbots and virtual assistants, such as Siri, Alexa, and Google Assistant, use NLP to understand and respond to user queries in a conversational manner.

### **Role of NLP Pipeline**
1. **Text Preprocessing**:  
   - Tokenization, stopword removal, and lemmatization help clean and structure user input.  
   - Named Entity Recognition (NER) identifies key information (e.g., names, dates, locations).  

2. **Feature Extraction**:  
   - Word embeddings (e.g., BERT, Word2Vec) convert text into numerical vectors.  

3. **Model Training**:  
   - Transformer-based models (GPT, BERT) process the query and generate context-aware responses.  
   - Intent classification models categorize queries (e.g., setting reminders, checking weather).  

4. **Evaluation**:  
   - Models are fine-tuned based on user feedback and accuracy metrics.  

5. **Deployment**:  
   - The chatbot or assistant interacts with users in real-time and continuously improves via reinforcement learning.  

---

## **2. Sentiment Analysis in Customer Feedback**
Sentiment analysis helps businesses analyze customer opinions and reviews to improve products and services.

### **Role of NLP Pipeline**
1. **Text Preprocessing**:  
   - Tokenization, stopword removal, and stemming/lemmatization clean customer feedback.  

2. **Feature Extraction**:  
   - TF-IDF or word embeddings represent text numerically for analysis.  

3. **Model Training**:  
   - Machine learning models (Naïve Bayes, SVM) or deep learning models (LSTMs, Transformers) classify sentiment as positive, negative, or neutral.  

4. **Evaluation**:  
   - Accuracy, F1-score, and confusion matrices help measure performance.  

5. **Deployment**:  
   - The trained model is integrated into business dashboards for real-time analysis of customer sentiment.  

---

## **3. Machine Translation (e.g., Google Translate)**
Machine translation enables automatic language conversion, allowing global communication.

### **Role of NLP Pipeline**
1. **Text Preprocessing**:  
   - Tokenization and sentence segmentation prepare input text.  

2. **Feature Extraction**:  
   - Word embeddings capture semantic meaning across different languages.  

3. **Model Training**:  
   - Sequence-to-sequence models (LSTMs, Transformers) learn language mappings.  
   - Attention mechanisms improve context understanding in long sentences.  

4. **Evaluation**:  
   - BLEU and ROUGE scores measure translation accuracy.  

5. **Deployment**:  
   - The model is deployed in translation apps, improving over time with user feedback.  

---

## **Conclusion**
The NLP pipeline plays a critical role in various applications by processing, understanding, and generating human language. From chatbots to sentiment analysis and machine translation, NLP continues to revolutionize the way machines interact with humans.

# Part B: Tokenization

# **Word-Level vs. Sentence-Level Tokenization**

Tokenization is a fundamental step in the Natural Language Processing (NLP) pipeline that involves splitting text into meaningful units. It can be performed at different levels, such as word-level and sentence-level.

---

## **1. Word-Level Tokenization**
Word-level tokenization breaks a sentence into individual words or tokens. It is useful for tasks like text analysis, sentiment analysis, and word frequency calculations.

### **Example:**
#### **Input Text:**  
*"Natural Language Processing is fascinating! It enables machines to understand human language."*

#### **Word-Level Tokens:**  
`["Natural", "Language", "Processing", "is", "fascinating", "!", "It", "enables", "machines", "to", "understand", "human", "language", "."]`

### **Key Features:**
- Splits text into individual words, including punctuation marks as separate tokens.
- Helps in tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
- Challenges include handling contractions (e.g., "don't" → ["do", "n't"]) and special characters.

---

## **2. Sentence-Level Tokenization**
Sentence-level tokenization divides a text into individual sentences. It is useful for applications like text summarization, machine translation, and sentiment analysis at a document level.

### **Example:**
#### **Input Text:**  
*"Natural Language Processing is fascinating! It enables machines to understand human language."*

#### **Sentence-Level Tokens:**  
`["Natural Language Processing is fascinating!", "It enables machines to understand human language."]`

### **Key Features:**
- Splits text based on sentence boundaries.
- Handles punctuation such as periods, exclamation marks, and question marks.
- Challenges include detecting abbreviations (e.g., "Dr.", "etc.") where periods do not indicate sentence boundaries.

---

## **Conclusion**
| **Feature**        | **Word-Level Tokenization** | **Sentence-Level Tokenization** |
|--------------------|---------------------------|--------------------------------|
| **Definition**     | Splits text into words.    | Splits text into sentences.   |
| **Example Output** | `["Natural", "Language", "Processing", "is", "fascinating", "!"]` | `["Natural Language Processing is fascinating!"]` |
| **Use Cases**      | POS tagging, Named Entity Recognition, Machine Translation. | Text summarization, Sentiment Analysis, Document Segmentation. |
| **Challenges**     | Handling punctuation, contractions. | Handling abbreviations, sentence boundaries. |

Both tokenization techniques play a crucial role in NLP, depending on the requirements of the application.


## Write Python code using a library (e.g., NLTK or SpaCy) to perform tokenization on the text mentioned above.

In [6]:
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text
text = "Natural Language Processing is fascinating! It enables machines to understand human language."

# Using NLTK for tokenization
nltk.download('punkt')  # Ensure necessary data is downloaded
word_tokens_nltk = word_tokenize(text)
sentence_tokens_nltk = sent_tokenize(text)

print("NLTK Word Tokenization:", word_tokens_nltk)
print("NLTK Sentence Tokenization:", sentence_tokens_nltk)

# Using spaCy for tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

word_tokens_spacy = [token.text for token in doc]
sentence_tokens_spacy = [sent.text for sent in doc.sents]

print("\nspaCy Word Tokenization:", word_tokens_spacy)
print("spaCy Sentence Tokenization:", sentence_tokens_spacy)

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


NLTK Word Tokenization: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', 'It', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.']
NLTK Sentence Tokenization: ['Natural Language Processing is fascinating!', 'It enables machines to understand human language.']

spaCy Word Tokenization: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', 'It', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.']
spaCy Sentence Tokenization: ['Natural Language Processing is fascinating!', 'It enables machines to understand human language.']


In [3]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp311-cp311-win_amd64


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 4.2 MB/s eta 0:00:03
     -- ------------------------------------- 0.8/12.8 MB 3.4 MB/s eta 0:00:04
     --- ------------------------------------ 1.0/12.8 MB 2.2 MB/s eta 0:00:06
     ---- ----------------------------------- 1.6/12.8 MB 1.9 MB/s eta 0:00:07
     ----- ---------------------------------- 1.8/12.8 MB 2.1 MB/s eta 0:00:06
     ------ --------------------------------- 2.1/12.8 MB 1.8 MB/s eta 0:00:07
     -------- ------------------------------- 2.6/12.8 MB 1.7 MB/s eta 0:00:06
     --------- ------------------------------ 2.9/12.8 MB 1.8 MB/s eta 0:00:06
     --------- ------------------------------ 3.1/12.8 MB 1.8 MB/s eta 0:00:06
     --------- --------------------------


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Part C: Stemming and Lemmatization

## **Comparison: Stemming vs. Lemmatization**

Both **stemming** and **lemmatization** are text preprocessing techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. However, they differ in their approach and accuracy.

---

## **1. Stemming**
Stemming is a rule-based process of removing suffixes from words to obtain their root form. It often results in non-linguistic root words.

### **Example:**
- "Running" → "Run"
- "Happily" → "Happili"
- "Studies" → "Studi"

### **Key Characteristics:**
- Uses heuristic rules to chop off prefixes or suffixes.
- Does not consider the actual meaning of words.
- Produces stem words that may not always be valid words.

### **Common Stemmers:**
- Porter Stemmer
- Snowball Stemmer
- Lancaster Stemmer

---

## **2. Lemmatization**
Lemmatization reduces a word to its **base or dictionary form (lemma)** using linguistic rules and vocabulary.

### **Example:**
- "Running" → "Run"
- "Happily" → "Happy"
- "Studies" → "Study"
- "Better" → "Good"

### **Key Characteristics:**
- Considers the **context** and meaning of the word.
- Uses **lexical databases** like WordNet.
- Produces valid words that exist in the dictionary.

### **Common Lemmatizers:**
- WordNet Lemmatizer (NLTK)
- SpaCy Lemmatizer

---

## **3. Key Differences:**

| **Feature**       | **Stemming**                          | **Lemmatization**                   |
|-------------------|--------------------------------------|--------------------------------------|
| **Definition**    | Removes prefixes/suffixes to get a root form. | Converts words to their dictionary form (lemma). |
| **Approach**      | Rule-based heuristic approach.       | Dictionary and linguistic-based approach. |
| **Output Words**  | May not be valid words (e.g., "happi" instead of "happy"). | Always produces valid words (e.g., "happy" instead of "happi"). |
| **Context Aware** | No, simply trims words.             | Yes, considers part of speech and meaning. |
| **Accuracy**      | Less accurate, faster processing.   | More accurate, slightly slower. |
| **Example** (for "running") | "Runn" | "Run" |
| **Example** (for "better") | "Better" | "Good" |

---

## **4. When to Use Which?**
- **Use Stemming** when you need **faster processing** and approximate results (e.g., search engines).
- **Use Lemmatization** when **accuracy is important** (e.g., NLP tasks like Named Entity Recognition and Machine Translation).

### **Conclusion**
While **stemming** is faster but less precise, **lemmatization** is more accurate but computationally expensive. The choice depends on the application and accuracy requirements.



### Given the following words, perform stemming and lemmatization. Use Python for implementation and include the code snippet and output:
* Playing
* Studies
* Happier
* Knives
* Children
* Easily
* Faster
* Caring

In [9]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK data is downloaded
nltk.download('wordnet')
nltk.download('omw-1.4')

# List of words
words = ["Playing", "Studies", "Happier", "Knives", "Children", "Easily", "Faster", "Caring"]

# Stemming using PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word.lower()) for word in words]

# Lemmatization using WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word.lower(), pos='v') for word in words]  # Using 'v' for verb lemma
lemmatized_words_noun = [lemmatizer.lemmatize(word.lower(), pos='n') for word in words]  # Using 'n' for noun lemma

# Print results
print("Original Words:", words)
print("\nStemmed Words:", stemmed_words)
print("\nLemmatized Words (Verb Lemmatization):", lemmatized_words)
print("\nLemmatized Words (Noun Lemmatization):", lemmatized_words_noun)

Original Words: ['Playing', 'Studies', 'Happier', 'Knives', 'Children', 'Easily', 'Faster', 'Caring']

Stemmed Words: ['play', 'studi', 'happier', 'knive', 'children', 'easili', 'faster', 'care']

Lemmatized Words (Verb Lemmatization): ['play', 'study', 'happier', 'knives', 'children', 'easily', 'faster', 'care']

Lemmatized Words (Noun Lemmatization): ['playing', 'study', 'happier', 'knife', 'child', 'easily', 'faster', 'caring']


[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
