In [5]:
!pip install newspaper3k lxml_html_clean



**1. Read the File**

Read the content of https://www.cnn.com/2025/06/13/style/why-luxury-brands-are-so-
expensive and print the first 700 characters.

In [6]:
from newspaper import Article

url = "https://www.cnn.com/2025/06/13/style/why-luxury-brands-are-so-expensive"

article = Article(url)
article.download()
article.parse()

text = article.text
print(text[:700])


London CNN —

More than ever, high-end brands want you to know exactly how, and where, their goods are made. They are producing enormous glossy coffee table books showing white-coated workers hand-stitching products in glamorous workshops, and creating marketing campaigns emphasizing the exquisite materials and dedicated handiwork that go into the making of their very, very expensive products.

These companies are trying to explain the value of their creations to consumers because their profits are slowing, even as their prices are increasing. While the personal luxury goods market was worth €363 billion (about $415 billion) in 2024, up from €223 billion ($242 billion) a decade prior, accord


**2. Remove HTML Tags**   
If any HTML tags are present in the file, remove them so that only the raw text remains.

In [7]:
from bs4 import BeautifulSoup

raw_text = BeautifulSoup(text, "html.parser").get_text(separator=" ", strip=True)
print(raw_text[:700])

London CNN —

More than ever, high-end brands want you to know exactly how, and where, their goods are made. They are producing enormous glossy coffee table books showing white-coated workers hand-stitching products in glamorous workshops, and creating marketing campaigns emphasizing the exquisite materials and dedicated handiwork that go into the making of their very, very expensive products.

These companies are trying to explain the value of their creations to consumers because their profits are slowing, even as their prices are increasing. While the personal luxury goods market was worth €363 billion (about $415 billion) in 2024, up from €223 billion ($242 billion) a decade prior, accord


**3. Lower and Remove Punctuation**   
Convert all text to lowercase and remove all punctuation characters.

In [8]:
import re
import string

lower_text = raw_text.lower()
no_punct = re.sub(rf"[{re.escape(string.punctuation)}]", " ", lower_text)
no_punct = re.sub(r"\s+", " ", no_punct).strip()

print(no_punct[:700])

london cnn — more than ever high end brands want you to know exactly how and where their goods are made they are producing enormous glossy coffee table books showing white coated workers hand stitching products in glamorous workshops and creating marketing campaigns emphasizing the exquisite materials and dedicated handiwork that go into the making of their very very expensive products these companies are trying to explain the value of their creations to consumers because their profits are slowing even as their prices are increasing while the personal luxury goods market was worth €363 billion about 415 billion in 2024 up from €223 billion 242 billion a decade prior according to the global m


**4. Remove Stopwords**   
Remove English stopwords from the text. (Use NLTK’s list of stopwords.)

In [9]:
import nltk
nltk.download("stopwords", quiet=True)

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
tokens = no_punct.split()

filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens[:50])

['london', 'cnn', '—', 'ever', 'high', 'end', 'brands', 'want', 'know', 'exactly', 'goods', 'made', 'producing', 'enormous', 'glossy', 'coffee', 'table', 'books', 'showing', 'white', 'coated', 'workers', 'hand', 'stitching', 'products', 'glamorous', 'workshops', 'creating', 'marketing', 'campaigns', 'emphasizing', 'exquisite', 'materials', 'dedicated', 'handiwork', 'go', 'making', 'expensive', 'products', 'companies', 'trying', 'explain', 'value', 'creations', 'consumers', 'profits', 'slowing', 'even', 'prices', 'increasing']


**5. Lemmatize the Text**  
Lemmatize all remaining words (use NLTK’s WordNetLemmatizer) and print the first 50
lemmatized words. Is there any difference in your output if you stemmed the text?

In [10]:
import nltk
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

from nltk.stem import WordNetLemmatizer, PorterStemmer

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered_tokens]

print("First 50 lemmatized words:")
print(lemmatized[:50])

stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered_tokens]

print("\nFirst 50 stemmed words:")
print(stemmed[:50])

First 50 lemmatized words:
['london', 'cnn', '—', 'ever', 'high', 'end', 'brand', 'want', 'know', 'exactly', 'good', 'made', 'producing', 'enormous', 'glossy', 'coffee', 'table', 'book', 'showing', 'white', 'coated', 'worker', 'hand', 'stitching', 'product', 'glamorous', 'workshop', 'creating', 'marketing', 'campaign', 'emphasizing', 'exquisite', 'material', 'dedicated', 'handiwork', 'go', 'making', 'expensive', 'product', 'company', 'trying', 'explain', 'value', 'creation', 'consumer', 'profit', 'slowing', 'even', 'price', 'increasing']

First 50 stemmed words:
['london', 'cnn', '—', 'ever', 'high', 'end', 'brand', 'want', 'know', 'exactli', 'good', 'made', 'produc', 'enorm', 'glossi', 'coffe', 'tabl', 'book', 'show', 'white', 'coat', 'worker', 'hand', 'stitch', 'product', 'glamor', 'workshop', 'creat', 'market', 'campaign', 'emphas', 'exquisit', 'materi', 'dedic', 'handiwork', 'go', 'make', 'expens', 'product', 'compani', 'tri', 'explain', 'valu', 'creation', 'consum', 'profit', 'slo

Yes, there is a clear difference between lemmatization and stemming in the output. Lemmatization produces proper dictionary words, such as producing → producing, material → material, and company → company. The words remain readable and linguistically correct. In contrast, stemming cuts words mechanically, often producing incomplete or non-dictionary forms such as exactly → exactli, producing → produc, glossy → glossi, and company → compani. While stemming is faster and rule-based, lemmatization is more accurate and produces cleaner, more meaningful text. Therefore, lemmatization is generally preferred for NLP tasks where word meaning matters.