# NLP (Natural Language Processing)

A branch of AI that allows computers to understand, interpret, and generate human language

Goal: Enable machines to work with text and speech like humans do.

# Laymen language ans

NLP is teaching computers to understand and work with human language.


# Need for NLP ?

1. Huge Amounts of Text Data

2. Making Machines Understand Human Language

3. Automating Tasks

4. Extracting Insights from Data

5. Enhancing Human-Computer Interaction

# Common NLP Tasks

1. Text Classification

Goal: Assign a label to a piece of text.

2. Sentiment Analysis

Goal: Detect emotion or opinion in text.

3. Part-of-Speech (POS) Tagging

Goal: Identify the role of each word in a sentence.

4. Machine Translation

Goal: Translate text from one language to another.

5. Text Summarization

Goal: Summarize long text into short, meaningful content.

6. Question Answering

Goal: Answer questions based on given text.

7. Speech-to-Text / Text-to-Speech

8. 2. Named Entity Recognition (NER)

Goal: Identify important entities in text, like names, dates, locations.

9. Topic Modeling

Goal: Find hidden topics or themes in a large set of documents.

10. Text Generation

Goal: Generate human-like text based on input.

# Approaches to NLP

Deep Learning / Neural Network NLP

Modern NLP / Transformer-Based Models

Statistical / Machine Learning NLP

Rule-Based / Symbolic NLP



# Heuristic Approach

basically a smart shortcut or rule of thumb to solve a problem

Examples

Finding a Store in a New City

Spell Check on Your Phone

Chatbots



# Challenges in NLP (Natural Language Processing)

1. Ambiguity (Words with Multiple Meanings)

Problem: Words can have more than one meaning depending on context.

2. Sarcasm and Irony

Problem: Computers find it hard to detect sarcasm.

3. Context Understanding

Problem: Words get meaning from other words in a sentence or paragraph.

4. Slang, Abbreviations, and Misspellings

People write informally online.

5. Different Languages and Dialects

NLP models trained in one language may not work in another.

6. Data Scarcity

Modern NLP models need lots of data.

7. Ambiguity in Grammar and Syntax

Problem: Same sentence can have multiple interpretations.

8. Understanding Emotion and Sentiment

9. Spelling Error


# NLP Pipeline

Think of NLP like making a robot understand human language. You need to clean, break down, and convert text into numbers so a computer can process it.

# Set of steps followed to build an end to end NLP software

Step 1: Text Collection

Step 2: Text Preprocessing (Clean and simplify the text)

Step 3: Tokenization

Step 4: Text Representation (Word to Numbers)

Step 5: Model / Algorithm

Step 6: Output / Prediction




Raw Text (Tweets, Reviews)

        ‚îÇ
        ‚ñº

Text Preprocessing
(clean text, remove noise)

        ‚îÇ
        ‚ñº

Tokenization
(split into words/sentences)

        ‚îÇ
        ‚ñº

Text Representation
(convert words to numbers)

        ‚îÇ
        ‚ñº

Model / Algorithm
(train or predict)

        ‚îÇ
        ‚ñº
        
Output / Prediction
(sentiment, spam, translation, chatbot reply)

# Text Preprocessing

Text preprocessing is the step where we clean and prepare raw text so computers can understand it.

# Why Needed:

Computers don‚Äôt understand messy human text (punctuation, slang, capitalization).

Cleaning text improves accuracy of NLP models.

Helps computers understand words consistently.(**Computers see text as exact characters, so even a small difference makes them think it‚Äôs a different word.**)

# Layman Analogy:

Like washing, peeling, and chopping vegetables before cooking.

# Lowercasing

Lowercasing is the process of converting all letters in text to lowercase.

# Why it‚Äôs needed:

Computers see ‚ÄúApple‚Äù and ‚Äúapple‚Äù as different words.

Lowercasing helps treat the same word uniformly, improving NLP model accuracy.



In [1]:
# Example

text = "I Love NLP"

text_lower = text.lower()

print(text_lower)  # Output: "i love nlp"

i love nlp


# Remove HTML Tags ( We use REGEX to remove)

HTML tags are the code in a webpage that formats text, like <p>, <a>, <b>, <div>

# Why it‚Äôs needed:

When you scrape text from websites, you often get tags like <p> or <a href="">.

These tags are not meaningful for NLP tasks.

Keeping them can confuse the computer.

# Remove URLs (Text Preprocessing Step)

URLs are links in text, like https://example.com or www.google.com.

# Why it‚Äôs needed:

URLs are not useful for most NLP tasks like sentiment analysis or text classification.

Keeping them can confuse the model.



# Remove Punctuation

Removing punctuation = cleaning text so NLP models focus on meaningful words, not symbols.

# Why it‚Äôs needed:

Most punctuation does not carry meaning for NLP tasks like sentiment analysis or text classification.

Keeping them can confuse the computer or create unnecessary tokens.

In [2]:
import string

text = "Hello! I love NLP, do you?"

# Remove punctuation
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)
# Output: "Hello I love NLP do you"

Hello I love NLP do you


# Chat word statement

a sentence or text written in casual chat language, like what people use in texting, social media, or instant messaging.





In [3]:
chat_text = "hey! r u coming 2 the party? lol üòÑ"

# Simple replacement
chat_text = chat_text.replace("r", "are").replace("u", "you").replace("2", "to").replace("lol", "laugh out loud")
print(chat_text)
# Output: "hey! are you coming to the party? laugh out loud üòÑ"

hey! are you coming to the paarety? laugh out loud üòÑ


# Spelling Correction

Spelling correction is the process of detecting and fixing typos or misspelled words in text

# Why It‚Äôs Needed

People make typos in chat, reviews, social media, emails.

Correcting spelling helps improve NLP model accuracy



In [4]:
from textblob import TextBlob

text = "I luv NLP and it is amzing!"
corrected_text = str(TextBlob(text).correct())
print(corrected_text)
# Output: "I law NLP and it is amazing!"

I lui NLP and it is amazing!


# Removing Stop Words

Stopwords are common words in a language that don‚Äôt add much meaning to a sentence.

# WHY ?

Stopwords don‚Äôt add value for many NLP

Removing them reduces noise and makes models faster.

# Tokenization

the process of breaking text into smaller pieces called tokens.

Tokens can be words, phrases, or sentences depending on the task.

# Why It‚Äôs Needed:

Computers cannot understand full sentences directly.

Breaking text into tokens makes it easier to analyze, count, or process.

# Word-Level Tokenization

Break a text into individual words (tokens).

# Why:

Most NLP tasks like sentiment analysis, word counting, or text classification work at the word level

# Example

Input: "I love Natural Language Processing!"
Word Tokens: ["I", "love", "Natural", "Language", "Processing", "!"]

# Sentence-Level Tokenization

Break a text into sentences (tokens).

# Why:

Useful for summarization, translation, or understanding context.

Helps analyze one sentence at a time.

# Example

Input: "I love NLP. It is amazing!"
Sentence Tokens: ["I love NLP.", "It is amazing!"]

# Why We Need Tokenization

Tokenization is the first step in NLP after cleaning text. It‚Äôs essential because computers cannot understand raw text as humans do.

Computers Read Numbers, Not Text

Easier Analysis



# Code


In [13]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')

text = "I love Natural Language Processing!"

word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)

print("Sentence Tokens:", sentence_tokens)
print("Word Tokens:", word_tokens)

Sentence Tokens: ['I love Natural Language Processing!']
Word Tokens: ['I', 'love', 'Natural', 'Language', 'Processing', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Stemming

the process of reducing words to their root or base form.

It removes suffixes like ‚Äú-ing‚Äù, ‚Äú-ed‚Äù, ‚Äú-ly‚Äù, etc.

# Why It‚Äôs Needed:

Words can have many forms, but they often mean the same thing.

Stemming helps NLP models treat similar words as the same.

# Lemmatization (it returns actual dictionary words)

process of reducing words to their base or dictionary form (lemma).

# Why It‚Äôs Needed:

Words have different forms (tense, plural, etc.), but their meaning is the same.

Lemmatization helps NLP models understand words correctly and consistently.

# Example

Words: "running", "ran", "better", "wolves"
Lemmatized: "run", "run", "good", "wolf"

# Code



In [14]:
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

words = ["running", "ran", "better", "wolves"]

lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

print(lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


['run', 'run', 'better', 'wolves']
