<a href="https://colab.research.google.com/github/elijahmflomo/Sem_2_APPLIED-NATURAL-LANGUAGE-PROCESSING/blob/main/Lab_Assignment_1_Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Scenario-Based Question 1: Customer Feedback Analysis

>



**Step 0: Setup and Imports**

*First, I need to import the tools and download the "knowledge packs" (corpora) that NLTK uses to identify words and grammar.*

In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

**Tasks 1 & 2: Load Data and Normalize**

*I will take the example review and a few others. I'll use Regular Expressions (Regex) to strip out punctuationâ€”this is much faster than manual loops.*

In [5]:
# 1. Load sample set
reviews = [
    "Loved the products!!! Delivery was sooo slow ðŸ˜Š",
    "The item is amazing, but the packaging was damaged.",
    "I'm not happy with the service. Won't buy again!",
    "Fast delivery and great quality. Highly recommended."
]

def normalize_text(text):
    # Lowercasing
    text = text.lower()
    # Removing punctuation and special characters (keeping only letters and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

normalized_reviews = [normalize_text(r) for r in reviews]
print(f"Normalized: {normalized_reviews[0]}")

Normalized: loved the products delivery was sooo slow 


**Tasks 3 & 4: Tokenization and Stop Word Removal**

*Tokenization chops the sentence into individual words. Stop words are common words like "the," "is," and "in" that don't carry much sentiment meaning*

In [6]:
stop_words = set(stopwords.words('english'))

processed_tokens = []
for review in normalized_reviews:
    # 3. Word Tokenization
    tokens = word_tokenize(review)

    # 4. Remove Stop Words
    filtered_tokens = [w for w in tokens if w not in stop_words]
    processed_tokens.append(filtered_tokens)

print(f"Tokens after stop word removal: {processed_tokens[0]}")

Tokens after stop word removal: ['loved', 'products', 'delivery', 'sooo', 'slow']


**Task 5: Stemming vs. Lemmatization**

This is a crucial part of this lab.

Stemming: A "crude" method that chops off the ends of words (e.g., "slowly" becomes "slow").

Lemmatization: A "smart" method that uses a dictionary to find the root word (e.g., "better" becomes "good").

In [10]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Let's compare them on a single review
example_tokens = processed_tokens[0] # ['loved', 'products', 'delivery', 'sooo', 'slow']

stemmed = [stemmer.stem(w) for w in example_tokens]
lemmatized = [lemmatizer.lemmatize(w) for w in example_tokens]



**Task 6: Final Cleaned Output**

In a real-world scenario, we usually prefer Lemmatization because it keeps the words readable. Here is a comparison table of how the data changed:

In [15]:
print("--- Final Cleaned Output ---\n")

print(f"Original:   {example_tokens}")
print(f"Stemmed:    {stemmed}")
print(f"Lemmatized: {lemmatized}")

--- Final Cleaned Output ---

Original:   ['loved', 'products', 'delivery', 'sooo', 'slow']
Stemmed:    ['love', 'product', 'deliveri', 'sooo', 'slow']
Lemmatized: ['loved', 'product', 'delivery', 'sooo', 'slow']


**Scenario-Based Question 2: News Article Classification
System**

**Task 1: Read a Sample News Article**

Here I'll simulate a real news article with dates, numbers, and formal language.



In [19]:
article = """
On March 15, 2024, the Government of India announced a new technology policy.
The initiative aims to invest $5 billion in artificial intelligence and healthcare systems.
Experts believe this move will boost economic growth by 10 percent over the next five years.
"""


**Task 2: Sentence Tokenization**

Sentence tokenization splits text into sentences.

In [20]:
sentences = sent_tokenize(article)

print("Sentences:")
for s in sentences:
    print(s)


Sentences:

On March 15, 2024, the Government of India announced a new technology policy.
The initiative aims to invest $5 billion in artificial intelligence and healthcare systems.
Experts believe this move will boost economic growth by 10 percent over the next five years.


**Task 3: Word Tokenization**

Word tokenization splits sentences into words.

In [21]:
words = []

for sentence in sentences:
    tokens = word_tokenize(sentence)
    words.extend(tokens)

print("Word Tokens:")
print(words)


Word Tokens:
['On', 'March', '15', ',', '2024', ',', 'the', 'Government', 'of', 'India', 'announced', 'a', 'new', 'technology', 'policy', '.', 'The', 'initiative', 'aims', 'to', 'invest', '$', '5', 'billion', 'in', 'artificial', 'intelligence', 'and', 'healthcare', 'systems', '.', 'Experts', 'believe', 'this', 'move', 'will', 'boost', 'economic', 'growth', 'by', '10', 'percent', 'over', 'the', 'next', 'five', 'years', '.']


**Text Normalization**

handle:

âœ” Lowercasing

âœ” Dates â†’ <DATE>

âœ” Numbers â†’ <NUM>

âœ” Remove punctuation

Why?

ML models work better with consistent patterns.

In [22]:
normalized_words = []

for word in words:
    word = word.lower()

    # Replace dates and numbers
    if re.fullmatch(r'\d+', word):
        word = '<NUM>'
    elif re.fullmatch(r'\d{4}', word):
        word = '<DATE>'

    # Remove punctuation
    word = re.sub(r'[^\w<>]', '', word)

    if word != '':
        normalized_words.append(word)

print("Normalized Words:")
print(normalized_words)


Normalized Words:
['on', 'march', '<NUM>', '<NUM>', 'the', 'government', 'of', 'india', 'announced', 'a', 'new', 'technology', 'policy', 'the', 'initiative', 'aims', 'to', 'invest', '<NUM>', 'billion', 'in', 'artificial', 'intelligence', 'and', 'healthcare', 'systems', 'experts', 'believe', 'this', 'move', 'will', 'boost', 'economic', 'growth', 'by', '<NUM>', 'percent', 'over', 'the', 'next', 'five', 'years']


**Task 5: Remove Stop Words (Preserve Keywords)**

*Remove common words without losing domain terms like technology, healthcare.*

In [23]:
stop_words = set(stopwords.words('english'))

filtered_words = [
    word for word in normalized_words
    if word not in stop_words
]

print("After Stop Word Removal:")
print(filtered_words)


After Stop Word Removal:
['march', '<NUM>', '<NUM>', 'government', 'india', 'announced', 'new', 'technology', 'policy', 'initiative', 'aims', 'invest', '<NUM>', 'billion', 'artificial', 'intelligence', 'healthcare', 'systems', 'experts', 'believe', 'move', 'boost', 'economic', 'growth', '<NUM>', 'percent', 'next', 'five', 'years']


**Task 6: Lemmatization**

Lemmatization converts words to meaningful base forms.

In [24]:
lemmatizer = WordNetLemmatizer()

lemmatized_words = [
    lemmatizer.lemmatize(word)
    for word in filtered_words
]

print("Lemmatized Words:")
print(lemmatized_words)


Lemmatized Words:
['march', '<NUM>', '<NUM>', 'government', 'india', 'announced', 'new', 'technology', 'policy', 'initiative', 'aim', 'invest', '<NUM>', 'billion', 'artificial', 'intelligence', 'healthcare', 'system', 'expert', 'believe', 'move', 'boost', 'economic', 'growth', '<NUM>', 'percent', 'next', 'five', 'year']


**Task 7: Final Transformed Text (For Feature Extraction)**

*We join cleaned words back into text.*

In [25]:
final_text = " ".join(lemmatized_words)

print("Final Transformed Text:")
print(final_text)


Final Transformed Text:
march <NUM> <NUM> government india announced new technology policy initiative aim invest <NUM> billion artificial intelligence healthcare system expert believe move boost economic growth <NUM> percent next five year
