### Name : Parth Desai
### PRN: 24070149017
### NLP ASSIGNMENT I

# Part A: Basics of NLP & Pipeline 


<h1>Key Components of the Natural Language Processing (NLP) Pipeline</h1>

<p>Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that deals with the interaction between computers and human languages. The NLP pipeline refers to the series of steps or stages through which raw text data is processed to extract meaningful information. Below are the main components of the NLP pipeline:</p>

<h2>1. Text Preprocessing</h2>
<p>Text preprocessing is the first and most important step in the NLP pipeline. It involves cleaning the raw text data to make it suitable for further processing. Common techniques include:</p>
<ul>
    <li><strong>Tokenization</strong>: Breaking down a text into smaller units like words, sentences, or subwords.</li>
    <li><strong>Lowercasing</strong>: Converting all characters to lowercase to avoid case-sensitive mismatches.</li>
    <li><strong>Removing Stop Words</strong>: Filtering out common words (like "the", "is", etc.) that do not add meaningful context.</li>
    <li><strong>Removing Punctuation</strong>: Eliminating punctuation marks that are irrelevant to understanding the meaning of the text.</li>
    <li><strong>Stemming or Lemmatization</strong>: Reducing words to their root forms (e.g., "running" becomes "run"). Lemmatization uses vocabulary and morphology to ensure the root form is meaningful.</li>
</ul>

<h2>2. Feature Extraction</h2>
<p>Feature extraction involves transforming the raw text data into numerical features that machine learning algorithms can understand. Two common methods are:</p>
<ul>
    <li><strong>Bag of Words (BoW)</strong>: This model represents text as a collection of words without considering the order or structure of the words. Each word is treated as a feature, and its frequency in the text is counted.</li>
    <li><strong>TF-IDF (Term Frequency-Inverse Document Frequency)</strong>: This method adjusts the frequency of words based on their importance across multiple documents. It helps reduce the impact of commonly occurring words.</li>
</ul>

<h2>3. Part-of-Speech Tagging</h2>
<p>Part-of-Speech (POS) tagging involves identifying the grammatical categories of each word in a sentence (e.g., noun, verb, adjective). This step helps understand the syntactic structure of a sentence, which is crucial for further analysis.</p>

<h2>4. Named Entity Recognition (NER)</h2>
<p>NER is the process of identifying named entities in the text, such as the names of people, organizations, locations, dates, etc. This step helps to extract specific information from unstructured text.</p>

<h2>5. Dependency Parsing</h2>
<p>Dependency parsing involves analyzing the grammatical structure of a sentence to establish relationships between words. It helps in understanding the syntactic dependencies between words, which is crucial for tasks such as question answering or text summarization.</p>

<h2>6. Sentiment Analysis</h2>
<p>Sentiment analysis determines the sentiment or emotional tone expressed in a text (e.g., positive, negative, neutral). This is widely used in applications like social media monitoring and customer feedback analysis.</p>

<h2>7. Machine Learning or Deep Learning Models</h2>
<p>Once features are extracted, machine learning or deep learning models are used to analyze the data and make predictions or classify text. Common models include:</p>
<ul>
    <li><strong>Naive Bayes</strong>: A probabilistic model based on Bayes' theorem, often used for text classification.</li>
    <li><strong>Support Vector Machines (SVM)</strong>: A supervised learning model used for classification tasks.</li>
    <li><strong>Deep Neural Networks</strong>: Advanced models such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformer-based models (e.g., BERT) are also commonly used in NLP tasks.</li>
</ul>

<h2>Practical Example: Sentiment Analysis of Movie Reviews</h2>
<p>Consider a scenario where we want to perform sentiment analysis on a collection of movie reviews. The steps in the NLP pipeline would look like this:</p>
<ul>
    <li><strong>Text Preprocessing</strong>: Clean the text by tokenizing, lowercasing, removing stop words and punctuation, and performing stemming or lemmatization.</li>
    <li><strong>Feature Extraction</strong>: Convert the text data into numerical features using the Bag of Words or TF-IDF method.</li>
    <li><strong>Sentiment Analysis</strong>: Apply a machine learning model (e.g., Naive Bayes or an LSTM network) to classify the sentiment of each review (positive, negative, or neutral).</li>
</ul>

<p>In this example, the final output would be a prediction of the sentiment for each review, such as "positive" or "negative".</p>

<h2>Conclusion</h2>
<p>The NLP pipeline consists of several crucial stages, including text preprocessing, feature extraction, part-of-speech tagging, named entity recognition, and more. Understanding each of these components helps build robust NLP models capable of solving a wide range of language-related tasks.</p>

<h1>Real-World Applications of Natural Language Processing (NLP)</h1>

<p>Natural Language Processing (NLP) is a versatile field that finds applications across various industries. Below are three real-world applications of NLP, along with a description of how the NLP pipeline plays a crucial role in each application.</p>

<h2>1. Sentiment Analysis in Social Media Monitoring</h2>
<p><strong>Application:</strong> Sentiment analysis is used to analyze opinions, feedback, and emotions expressed on social media platforms such as Twitter, Facebook, and Instagram. Companies often use sentiment analysis to monitor brand reputation, understand customer satisfaction, and track public reactions to events or campaigns.</p>

<p><strong>Role of NLP Pipeline:</strong> The NLP pipeline plays a key role in sentiment analysis by processing and understanding large amounts of unstructured social media text. Here's how the pipeline contributes:</p>
<ul>
    <li><strong>Text Preprocessing</strong>: Text from social media posts needs to be cleaned by removing irrelevant content, such as hashtags, mentions, and special characters. Tokenization, lowercasing, and removing stop words ensure that only meaningful content is considered.</li>
    <li><strong>Feature Extraction</strong>: Features are extracted using methods like TF-IDF or word embeddings (e.g., Word2Vec or GloVe) to convert text into a numerical format that a machine learning model can understand.</li>
    <li><strong>Sentiment Classification</strong>: A trained machine learning or deep learning model classifies the sentiment as positive, negative, or neutral based on the extracted features, helping businesses understand public opinion.</li>
</ul>

<h2>2. Chatbots and Virtual Assistants</h2>
<p><strong>Application:</strong> Virtual assistants like Siri, Alexa, and Google Assistant use NLP to understand and respond to user queries in natural language. These assistants are designed to assist users with tasks such as setting reminders, answering questions, or controlling smart devices.</p>

<p><strong>Role of NLP Pipeline:</strong> The NLP pipeline is central to processing and responding to user input in chatbots and virtual assistants. Here's how it works in this application:</p>
<ul>
    <li><strong>Text Preprocessing</strong>: The user’s speech or text input is first preprocessed to eliminate noise, such as punctuation, informal language, or slang, to ensure clarity in understanding the message.</li>
    <li><strong>Named Entity Recognition (NER)</strong>: NER identifies key entities such as dates, times, locations, and names within the query. For example, if a user asks, "What is the weather like in New York tomorrow?", NER would identify "New York" as a location and "tomorrow" as a time reference.</li>
    <li><strong>Intent Recognition</strong>: The intent behind the user’s query is determined, whether it's a request for information, a command, or a question. This step often relies on models trained using supervised learning techniques.</li>
    <li><strong>Response Generation</strong>: Based on the recognized intent and extracted entities, a suitable response is generated, which could be retrieved from a database or generated dynamically using models like GPT (Generative Pretrained Transformers).</li>
</ul>

<h2>3. Machine Translation</h2>
<p><strong>Application:</strong> Machine translation involves automatically translating text from one language to another. Popular translation services such as Google Translate and DeepL use NLP techniques to enable cross-lingual communication.</p>

<p><strong>Role of NLP Pipeline:</strong> The NLP pipeline in machine translation facilitates the conversion of text between different languages. Here’s how each component of the pipeline contributes:</p>
<ul>
    <li><strong>Text Preprocessing</strong>: Text is cleaned and normalized to remove inconsistencies like spelling errors, extra spaces, or unnecessary punctuation, which could affect the translation accuracy.</li>
    <li><strong>Part-of-Speech Tagging and Dependency Parsing</strong>: These techniques help analyze the grammatical structure of the source language to identify relationships between words. This is particularly important for languages with complex grammatical rules.</li>
    <li><strong>Translation Model</strong>: A translation model, often based on deep learning techniques (e.g., Neural Machine Translation models like Seq2Seq or Transformer), is used to learn the mapping between words, phrases, and sentence structures in different languages.</li>
    <li><strong>Post-Processing</strong>: After translation, the text is post-processed to ensure proper sentence structure, word order, and fluency in the target language. This step might involve techniques like grammar checking and reordering sentences.</li>
</ul>

<h2>Conclusion</h2>
<p>Natural Language Processing is transforming industries by providing intelligent solutions to tasks that require understanding, interpreting, and generating human language. Whether it's sentiment analysis for brand monitoring, enabling human-like interactions with virtual assistants, or breaking language barriers with machine translation, the NLP pipeline is crucial in ensuring the success of these applications. As NLP continues to evolve, its impact on real-world applications will only grow stronger.</p>


# Part B: Tokenization

<h1>Word-Level vs Sentence-Level Tokenization</h1>

<p>Tokenization is the process of splitting text into smaller, meaningful units, known as "tokens." It is a fundamental step in Natural Language Processing (NLP). There are different ways to perform tokenization, two of the most common being word-level tokenization and sentence-level tokenization. Let's explore both with examples.</p>

<h2>Word-Level Tokenization</h2>
<p><strong>Definition:</strong> Word-level tokenization involves splitting text into individual words. It breaks the text into a sequence of words or word-like units (including punctuation as separate tokens). This is useful when we need to analyze each word in a sentence separately for tasks like text classification or sentiment analysis.</p>

<p><strong>Example:</strong> Consider the following text:</p>
<blockquote>
    "Natural Language Processing is fascinating! It enables machines to understand human language."
</blockquote>

<p>Applying word-level tokenization would break this text into the following tokens:</p>
<ul>
    <li>Natural</li>
    <li>Language</li>
    <li>Processing</li>
    <li>is</li>
    <li>fascinating</li>
    <li>!</li>
    <li>It</li>
    <li>enables</li>
    <li>machines</li>
    <li>to</li>
    <li>understand</li>
    <li>human</li>
    <li>language</li>
    <li>.</li>
</ul>

<p>As you can see, each word is treated as an individual token, and punctuation marks (like "!" and ".") are also considered as separate tokens. This is typical in word-level tokenization, where we are primarily concerned with the individual components of the text.</p>

<h2>Sentence-Level Tokenization</h2>
<p><strong>Definition:</strong> Sentence-level tokenization, on the other hand, involves splitting text into complete sentences. This is often used in applications where understanding the structure and meaning of entire sentences is necessary, such as in machine translation, summarization, or dialogue systems.</p>

<p><strong>Example:</strong> Using the same text:</p>
<blockquote>
    "Natural Language Processing is fascinating! It enables machines to understand human language."
</blockquote>

<p>After sentence-level tokenization, the text would be split into the following two tokens:</p>
<ul>
    <li>"Natural Language Processing is fascinating!"</li>
    <li>"It enables machines to understand human language."</li>
</ul>

<p>Here, the entire sentences are considered as individual tokens. Sentence-level tokenization helps in tasks where the meaning or context of the entire sentence is important.</p>

<h2>Key Differences</h2>
<ul>
    <li><strong>Granularity:</strong> Word-level tokenization breaks the text into smaller units (words and punctuation), whereas sentence-level tokenization divides the text into larger units (entire sentences).</li>
    <li><strong>Use Cases:</strong> Word-level tokenization is used in tasks like text classification, sentiment analysis, and word embeddings, where individual words carry significant meaning. Sentence-level tokenization is useful in tasks like machine translation, summarization, and speech-to-text systems where understanding full sentence structure is essential.</li>
    <li><strong>Context:</strong> Sentence-level tokenization retains more contextual information, as it preserves the sentence's overall meaning. Word-level tokenization often removes this context, focusing on individual words.</li>
</ul>

<h2>Conclusion</h2>
<p>Both word-level and sentence-level tokenization play vital roles in the NLP pipeline. While word-level tokenization is more granular and is used for tasks that focus on individual words, sentence-level tokenization allows for understanding the broader context by working with full sentences. Depending on the specific NLP task at hand, one approach may be more suitable than the other.</p>


## Write Python code using a library (e.g., NLTK or SpaCy) to perform tokenization on the text mentioned above.

In [6]:
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text
text = "Natural Language Processing is fascinating! It enables machines to understand human language."

# Using NLTK for tokenization
nltk.download('punkt')  # Ensure necessary data is downloaded
word_tokens_nltk = word_tokenize(text)
sentence_tokens_nltk = sent_tokenize(text)

print("NLTK Word Tokenization:", word_tokens_nltk)
print("NLTK Sentence Tokenization:", sentence_tokens_nltk)

# Using spaCy for tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

word_tokens_spacy = [token.text for token in doc]
sentence_tokens_spacy = [sent.text for sent in doc.sents]

print("\nspaCy Word Tokenization:", word_tokens_spacy)
print("spaCy Sentence Tokenization:", sentence_tokens_spacy)

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


NLTK Word Tokenization: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', 'It', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.']
NLTK Sentence Tokenization: ['Natural Language Processing is fascinating!', 'It enables machines to understand human language.']

spaCy Word Tokenization: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', 'It', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.']
spaCy Sentence Tokenization: ['Natural Language Processing is fascinating!', 'It enables machines to understand human language.']


In [3]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp311-cp311-win_amd64


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 4.2 MB/s eta 0:00:03
     -- ------------------------------------- 0.8/12.8 MB 3.4 MB/s eta 0:00:04
     --- ------------------------------------ 1.0/12.8 MB 2.2 MB/s eta 0:00:06
     ---- ----------------------------------- 1.6/12.8 MB 1.9 MB/s eta 0:00:07
     ----- ---------------------------------- 1.8/12.8 MB 2.1 MB/s eta 0:00:06
     ------ --------------------------------- 2.1/12.8 MB 1.8 MB/s eta 0:00:07
     -------- ------------------------------- 2.6/12.8 MB 1.7 MB/s eta 0:00:06
     --------- ------------------------------ 2.9/12.8 MB 1.8 MB/s eta 0:00:06
     --------- ------------------------------ 3.1/12.8 MB 1.8 MB/s eta 0:00:06
     --------- --------------------------


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Part C: Stemming and Lemmatization

## **Comparison: Stemming vs. Lemmatization**

Both **stemming** and **lemmatization** are text preprocessing techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. However, they differ in their approach and accuracy.

---

## **1. Stemming**
Stemming is a rule-based process of removing suffixes from words to obtain their root form. It often results in non-linguistic root words.

### **Example:**
- "Running" → "Run"
- "Happily" → "Happili"
- "Studies" → "Studi"

### **Key Characteristics:**
- Uses heuristic rules to chop off prefixes or suffixes.
- Does not consider the actual meaning of words.
- Produces stem words that may not always be valid words.

### **Common Stemmers:**
- Porter Stemmer
- Snowball Stemmer
- Lancaster Stemmer

---

## **2. Lemmatization**
Lemmatization reduces a word to its **base or dictionary form (lemma)** using linguistic rules and vocabulary.

### **Example:**
- "Running" → "Run"
- "Happily" → "Happy"
- "Studies" → "Study"
- "Better" → "Good"

### **Key Characteristics:**
- Considers the **context** and meaning of the word.
- Uses **lexical databases** like WordNet.
- Produces valid words that exist in the dictionary.

### **Common Lemmatizers:**
- WordNet Lemmatizer (NLTK)
- SpaCy Lemmatizer



### Given the following words, perform stemming and lemmatization. Use Python for implementation and include the code snippet and output:
* Playing
* Studies
* Happier
* Knives
* Children
* Easily
* Faster
* Caring

In [9]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK data is downloaded
nltk.download('wordnet')
nltk.download('omw-1.4')

# List of words
words = ["Playing", "Studies", "Happier", "Knives", "Children", "Easily", "Faster", "Caring"]

# Stemming using PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word.lower()) for word in words]

# Lemmatization using WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word.lower(), pos='v') for word in words]  # Using 'v' for verb lemma
lemmatized_words_noun = [lemmatizer.lemmatize(word.lower(), pos='n') for word in words]  # Using 'n' for noun lemma

# Print results
print("Original Words:", words)
print("\nStemmed Words:", stemmed_words)
print("\nLemmatized Words (Verb Lemmatization):", lemmatized_words)
print("\nLemmatized Words (Noun Lemmatization):", lemmatized_words_noun)

Original Words: ['Playing', 'Studies', 'Happier', 'Knives', 'Children', 'Easily', 'Faster', 'Caring']

Stemmed Words: ['play', 'studi', 'happier', 'knive', 'children', 'easili', 'faster', 'care']

Lemmatized Words (Verb Lemmatization): ['play', 'study', 'happier', 'knives', 'children', 'easily', 'faster', 'care']

Lemmatized Words (Noun Lemmatization): ['playing', 'study', 'happier', 'knife', 'child', 'easily', 'faster', 'caring']


[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
