# **NLP Preprocessing Assignment**

## **Importing Libraries**

In [1]:
import re
import nltk
import emoji
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


## **Download required NLTK data**

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zafir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zafir\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\zafir\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## **Paragraph**

In [3]:
text = """
Artificial Intelligence is transforming the world rapidly in 2025. 
Machine learning models are now being used in healthcare, education, finance, 
and even creative fields like art and music. However, raw text data collected 
from social media, reviews, articles, and chats is usually very noisy. 

It contains emojis ðŸ˜ŠðŸš€, hashtags #AI #MachineLearning, URLs https://deeplearningai.com, 
HTML tags <p></p>, extra spaces   , PUNCTUATION!!!!!, numbers 2025 3.14, and many 
stop words like the, is, are, and, to, etc. 

Proper text preprocessing is extremely important because it improves model 
performance, reduces training time, decreases memory usage, and helps the 
algorithm focus on meaningful information only. Today we learned several 
important steps: converting to lowercase, removing numbers, stripping HTML tags, 
eliminating emojis, cleaning punctuation, removing extra whitespaces, and 
finally removing English stopwords.
"""

print("Orginal Text:\n")
print(text)
print("\n" + "="*80 + "\n")

Orginal Text:


Artificial Intelligence is transforming the world rapidly in 2025. 
Machine learning models are now being used in healthcare, education, finance, 
and even creative fields like art and music. However, raw text data collected 
from social media, reviews, articles, and chats is usually very noisy. 

It contains emojis ðŸ˜ŠðŸš€, hashtags #AI #MachineLearning, URLs https://deeplearningai.com, 
HTML tags <p></p>, extra spaces   , PUNCTUATION!!!!!, numbers 2025 3.14, and many 
stop words like the, is, are, and, to, etc. 

Proper text preprocessing is extremely important because it improves model 
performance, reduces training time, decreases memory usage, and helps the 
algorithm focus on meaningful information only. Today we learned several 
important steps: converting to lowercase, removing numbers, stripping HTML tags, 
eliminating emojis, cleaning punctuation, removing extra whitespaces, and 
finally removing English stopwords.





## **Lowercase**

In [4]:
text = text.lower()
print("After Lowercase:\n", text[:300], "...\n")

After Lowercase:
 
artificial intelligence is transforming the world rapidly in 2025. 
machine learning models are now being used in healthcare, education, finance, 
and even creative fields like art and music. however, raw text data collected 
from social media, reviews, articles, and chats is usually very noisy. 

 ...



## **Remove URLs**

In [5]:
text = re.sub(r'https?://\S+|www\.\s+', '', text)

## **Remove HTML tags**

In [6]:
text = re.sub(r'<.*?>', '', text)

## **Remove emojis**

In [7]:
text = emoji.replace_emoji(text, replace='')

## **Remove Numbers**

In [8]:
text = re.sub(r'\d+(\.\d+)?', '', text)

## **Remove Punctuation**

In [9]:
text = text.translate(str.maketrans('', '', string.punctuation))

## **Remove Extra Whitespaces**

In [10]:
text = re.sub(r'\s+', ' ', text.strip())

print("After Basic Cleaning:\n")
print(text)
print("\n" + "="*80 + "\n")

After Basic Cleaning:

artificial intelligence is transforming the world rapidly in machine learning models are now being used in healthcare education finance and even creative fields like art and music however raw text data collected from social media reviews articles and chats is usually very noisy it contains emojis hashtags ai machinelearning urls html tags extra spaces punctuation numbers and many stop words like the is are and to etc proper text preprocessing is extremely important because it improves model performance reduces training time decreases memory usage and helps the algorithm focus on meaningful information only today we learned several important steps converting to lowercase removing numbers stripping html tags eliminating emojis cleaning punctuation removing extra whitespaces and finally removing english stopwords




## **Remove Stopwords**

In [11]:
stop_words = set(stopwords.words('english'))
words = text.split()
text_no_stop = ' '.join([word for word in words if word not in stop_words])

print("After Removing Stopwords:\n")
print(text_no_stop)
print("\n" + "="*80 + "\n")

After Removing Stopwords:

artificial intelligence transforming world rapidly machine learning models used healthcare education finance even creative fields like art music however raw text data collected social media reviews articles chats usually noisy contains emojis hashtags ai machinelearning urls html tags extra spaces punctuation numbers many stop words like etc proper text preprocessing extremely important improves model performance reduces training time decreases memory usage helps algorithm focus meaningful information today learned several important steps converting lowercase removing numbers stripping html tags eliminating emojis cleaning punctuation removing extra whitespaces finally removing english stopwords




## **Stemming (Porter Stemmer)**

In [12]:
stemmer = PorterStemmer()
stemmed = ' '.join([stemmer.stem(word) for word in text_no_stop.split()])

print("After Stemming:\n")
print(stemmed)
print("\n" + "="*80 + "\n")

After Stemming:

artifici intellig transform world rapidli machin learn model use healthcar educ financ even creativ field like art music howev raw text data collect social media review articl chat usual noisi contain emoji hashtag ai machinelearn url html tag extra space punctuat number mani stop word like etc proper text preprocess extrem import improv model perform reduc train time decreas memori usag help algorithm focu meaning inform today learn sever import step convert lowercas remov number strip html tag elimin emoji clean punctuat remov extra whitespac final remov english stopword




## **Lemmatization (better than stemming usually)**

In [13]:
lemmatizer = WordNetLemmatizer()
lemmatized = ' '.join([lemmatizer.lemmatize(word) for word in text_no_stop.split()])

print("After Lemmatization:\n")
print(lemmatized)
print("\n" + "="*80 + "\n")

After Lemmatization:

artificial intelligence transforming world rapidly machine learning model used healthcare education finance even creative field like art music however raw text data collected social medium review article chat usually noisy contains emojis hashtags ai machinelearning url html tag extra space punctuation number many stop word like etc proper text preprocessing extremely important improves model performance reduces training time decrease memory usage help algorithm focus meaningful information today learned several important step converting lowercase removing number stripping html tag eliminating emojis cleaning punctuation removing extra whitespaces finally removing english stopwords




## **Bag of Words**

In [14]:
cv = CountVectorizer()
bow = cv.fit_transform([lemmatized])
print("Bag of Words - Feature Names:")
print(cv.get_feature_names_out())
print("\nBag of Words Matrix:\n", bow.toarray())

Bag of Words - Feature Names:
['ai' 'algorithm' 'art' 'article' 'artificial' 'chat' 'cleaning'
 'collected' 'contains' 'converting' 'creative' 'data' 'decrease'
 'education' 'eliminating' 'emojis' 'english' 'etc' 'even' 'extra'
 'extremely' 'field' 'finally' 'finance' 'focus' 'hashtags' 'healthcare'
 'help' 'however' 'html' 'important' 'improves' 'information'
 'intelligence' 'learned' 'learning' 'like' 'lowercase' 'machine'
 'machinelearning' 'many' 'meaningful' 'medium' 'memory' 'model' 'music'
 'noisy' 'number' 'performance' 'preprocessing' 'proper' 'punctuation'
 'rapidly' 'raw' 'reduces' 'removing' 'review' 'several' 'social' 'space'
 'step' 'stop' 'stopwords' 'stripping' 'tag' 'text' 'time' 'today'
 'training' 'transforming' 'url' 'usage' 'used' 'usually' 'whitespaces'
 'word' 'world']

Bag of Words Matrix:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1
  2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 3 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
  1 1 1 1 1]]


## **TF-IDF**

In [15]:
tf = TfidfVectorizer()
tf_matrix = tf.fit_transform([lemmatized])
print("\nTF-IDF - Feature Names:")
print(tf.get_feature_names_out())
print("\nTF-IDF Matrix:\n", tf_matrix.toarray().round(4))


TF-IDF - Feature Names:
['ai' 'algorithm' 'art' 'article' 'artificial' 'chat' 'cleaning'
 'collected' 'contains' 'converting' 'creative' 'data' 'decrease'
 'education' 'eliminating' 'emojis' 'english' 'etc' 'even' 'extra'
 'extremely' 'field' 'finally' 'finance' 'focus' 'hashtags' 'healthcare'
 'help' 'however' 'html' 'important' 'improves' 'information'
 'intelligence' 'learned' 'learning' 'like' 'lowercase' 'machine'
 'machinelearning' 'many' 'meaningful' 'medium' 'memory' 'model' 'music'
 'noisy' 'number' 'performance' 'preprocessing' 'proper' 'punctuation'
 'rapidly' 'raw' 'reduces' 'removing' 'review' 'several' 'social' 'space'
 'step' 'stop' 'stopwords' 'stripping' 'tag' 'text' 'time' 'today'
 'training' 'transforming' 'url' 'usage' 'used' 'usually' 'whitespaces'
 'word' 'world']

TF-IDF Matrix:
 [[0.0933 0.0933 0.0933 0.0933 0.0933 0.0933 0.0933 0.0933 0.0933 0.0933
  0.0933 0.0933 0.0933 0.0933 0.0933 0.1865 0.0933 0.0933 0.0933 0.1865
  0.0933 0.0933 0.0933 0.0933 0.0933 0.09