# Introduction to NLP and Text Processing
Welcome to the first session on Natural Language Processing (NLP). In this notebook, we will cover:
- Overview of NLP
- Techniques for text normalization
- Basics of tokenization and stopword removal
- Implementing Stemming, Lemmatization


## What is NLP?
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) focused on enabling machines to understand, interpret, and generate human language.

### Applications of NLP:
- Sentiment Analysis
- Chatbots
- Machine Translation
- Text Summarization
- Speech Recognition


## Text Normalization
Text normalization involves cleaning and standardizing text data. Common steps include:
1. Lowercasing
2. Removing punctuation
3. Expanding contractions
4. Removing special characters


In [None]:
# import re

# Example Text
text = "Hello, World! This is an NLP class. Let's explore TEXT processing. i am also doing a B.d, i bught this pen for 9.99$. "

text.split()

['Hello,',
 'World!',
 'This',
 'is',
 'an',
 'NLP',
 'class.',
 "Let's",
 'explore',
 'TEXT',
 'processing.']

In [None]:
text = "hello world this is an nlp class lets explore text processing"
text.split()

['hello',
 'world',
 'this',
 'is',
 'an',
 'nlp',
 'class',
 'lets',
 'explore',
 'text',
 'processing']

In [None]:
import re

# Example Text
text = "Hello, World! This is an NLP class. Let's explore TEXT processing."

# Lowercasing
text_lower = text.lower()

# Removing punctuation
text_no_punct = re.sub(r'[^\w\s]', '', text_lower)

# Result
print("Original Text:", text)
print("Lowercased Text:", text_lower)
print("Text without Punctuation:", text_no_punct)


Original Text: Hello, World! This is an NLP class. Let's explore TEXT processing.
Lowercased Text: hello, world! this is an nlp class. let's explore text processing.
Text without Punctuation: hello world this is an nlp class lets explore text processing


## Tokenization
Tokenization is the process of breaking down text into smaller units called tokens. Tokens can be words, sentences, or characters.


In [None]:
import nltk
# nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

# Example Text
text = "Tokenization splits text into meaningful pieces. It's a crucial step in NLP!"

# Word Tokenization
word_tokens = word_tokenize(text)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)

print("Word Tokens:", word_tokens)
print("Sentence Tokens:", sentence_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word Tokens: ['Tokenization', 'splits', 'text', 'into', 'meaningful', 'pieces', '.', 'It', "'s", 'a', 'crucial', 'step', 'in', 'NLP', '!']
Sentence Tokens: ['Tokenization splits text into meaningful pieces.', "It's a crucial step in NLP!"]


## Stopword Removal
Stopwords are common words (e.g., "is", "the", "in") that are usually removed to focus on meaningful words.


In [None]:
stopword = "i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now
"

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Example Text
text = "This is a simple example to demonstrate stopword removal. is there are we "

# Tokenize Text
tokens = word_tokenize(text)

# Load Stopwords
stop_words = set(stopwords.words('english'))

# Remove Stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens (Without Stopwords):", filtered_tokens)



Original Tokens: ['This', 'is', 'a', 'simple', 'example', 'to', 'demonstrate', 'stopword', 'removal', '.', 'is', 'there', 'are', 'we']
Filtered Tokens (Without Stopwords): ['simple', 'example', 'demonstrate', 'stopword', 'removal', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'hdfjhfbd',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',


## Exercise
1. Write a function to normalize text (lowercase, remove punctuation).
2. Tokenize the normalized text.
3. Remove stopwords from the tokens.
4. Test your function with a sample paragraph.


In [None]:
def preprocess_text(text):
    # Normalize
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove Stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return filtered_tokens

# Test with Example Text
sample_text = """Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] Such machines may be called AIs."""
print(preprocess_text(sample_text))


['artificial', 'intelligence', 'ai', 'refers', 'capability', 'computational', 'systems', 'perform', 'tasks', 'typically', 'associated', 'human', 'intelligence', 'learning', 'reasoning', 'problemsolving', 'perception', 'decisionmaking', 'field', 'research', 'computer', 'science', 'develops', 'studies', 'methods', 'software', 'enable', 'machines', 'perceive', 'environment', 'use', 'learning', 'intelligence', 'take', 'actions', 'maximize', 'chances', 'achieving', 'defined', 'goals1', 'machines', 'may', 'called', 'ais']


# Case Study: Text Preprocessing for Sentiment Analysis in NLP
This case study focuses on applying various preprocessing techniques to a real-world dataset: the IMDB Movie Reviews Dataset. The dataset contains 50,000 movie reviews labeled as positive or negative, making it ideal for exploring text preprocessing and its impact on data quality.

**Objective**
To preprocess the IMDB movie reviews dataset using standard NLP techniques, including:

*   Text Normalization   
*   Tokenization
*   Stopword Removal

The processed data will be analyzed to understand its structure and utility for downstream NLP tasks such as sentiment analysis.

**Dataset**
The IMDB Movie Reviews Dataset consists of the following:

**Columns:**
* **text**: The movie review text.
* **label**: Sentiment label (positive or negative).

**Source:** [Kaggle IMDB Dataset.](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

The dataset can also be loaded programmatically using the datasets library in Python.

# Methodology
1. Loading and Exploring the Dataset

The dataset is loaded into a pandas DataFrame for inspection and preprocessing. The structure of the data is explored to understand the distribution of positive and negative reviews.

In [None]:
# Import libraries
import pandas as pd
from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset('imdb')
df = pd.DataFrame(dataset['train'])

# Inspect the dataset
print(df.head())
print("Dataset Shape:", df.shape)
print("Label Distribution:\n", df['label'].value_counts())


# 2. Text Normalization
Normalization involves cleaning the text to make it consistent. Key steps include:

* Converting text to lowercase.
* Removing punctuation and special characters.

In [None]:
import re

def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Apply normalization
df['normalized_text'] = df['text'].apply(normalize_text)


# 3. Tokenization
Tokenization splits the text into smaller units, typically words or sentences. For this study, word tokenization is applied to each review.

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def tokenize_text(text):
    return word_tokenize(text)

# Apply tokenization
df['tokens'] = df['normalized_text'].apply(tokenize_text)


# 4. Stopword Removal
Stopwords (e.g., "the", "is", "and") are common words that carry little semantic meaning. Removing them helps focus on meaningful content.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Apply stopword removal
df['filtered_tokens'] = df['tokens'].apply(remove_stopwords)


# 5. Analysis and Visualization
To evaluate preprocessing, we calculate word counts before and after preprocessing and visualize the most common words in the dataset.

In [None]:
df['original_word_count'] = df['text'].apply(lambda x: len(x.split()))
df['processed_word_count'] = df['filtered_tokens'].apply(len)

# Compare average word counts
print("Average Word Count (Original):", df['original_word_count'].mean())
print("Average Word Count (Processed):", df['processed_word_count'].mean())


**Visualization: Using a bar chart, we display the most frequent words in the processed reviews.**

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

# Count word frequencies
all_words = [word for tokens in df['filtered_tokens'] for word in tokens]
word_counts = Counter(all_words)
common_words = word_counts.most_common(10)

# Plot the most common words
words, counts = zip(*common_words)
plt.bar(words, counts)
plt.title("Most Common Words in Processed Reviews")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()


# Insights
* Normalization Impact: Reducing case sensitivity and punctuation enhanced the dataset's consistency.
* Tokenization Insights: Tokenization provided a granular look at the text, enabling more detailed analyses.
* Stopword Removal Impact: Removing stopwords reduced noise in the data, focusing on words with higher semantic value.
* Word Count Comparison: Preprocessing reduced the average word count, indicating noise removal.

# Stemming, Lemmatization

In [None]:
!pip install nltk scikit-learn

# Setup

In [None]:
# Install necessary libraries
!pip install nltk scikit-learn
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Understanding and Applying Stemming and Lemmatization

### Explanation:
- **Stemming** reduces words to their root forms by chopping off suffixes (e.g., "running" -> "run").
- **Lemmatization** maps words to their base forms using linguistic rules (e.g., "better" -> "good").


In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Example Text
text = "The striped bats are hanging on their feet for best"

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in word_tokenize(text)]
print("Stemmed Words:", stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
print("Lemmatized Words:", lemmatized_words)

Stemmed Words: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatized Words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']


# Case Study: Text Preprocessing

#### **Background**
Natural Language Processing (NLP) requires converting raw text into numerical features for machine learning models. This case study demonstrates the application of text preprocessing to prepare text data for analysis.

---

#### **Objective**
- Preprocess text data (cleaning, tokenization, etc.).



#### **Dataset**
For this case study, we use a sample corpus consisting of multiple sentences related to Natural Language Processing (NLP).


In [None]:
# Sample Corpus
corpus = [
    "Natural Language Processing is a subfield of AI.",
    "Deep Learning is part of machine learning.",
    "NLP applications are amazing and diverse.",
    "AI is transforming industries through NLP."
]

#### **Text Preprocessing**


In [None]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Preprocessing Function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Apply stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    return ' '.join(lemmatized_tokens)

# Apply preprocessing to the corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
print("Preprocessed Corpus:", preprocessed_corpus)