## **Assignment 1:**
 Perform tokenization, stopword removal, stemming, and lemmatization on a sample dataset. Compare how these preprocessing steps impact the quality of text representation.

 **By Akash Kumar,**

**Import Required Libraries**

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
text = """Natural Language Processing (NLP) is one of the most important areas of Artificial Intelligence.
It helps machines understand, interpret, and generate human language..."""

**Tokenization**

Tokenization splits text into individual words (tokens).

In [None]:
import nltk
nltk.download('punkt_tab')
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'one', 'of', 'the', 'most', 'important', 'areas', 'of', 'Artificial', 'Intelligence', '.', 'It', 'helps', 'machines', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '...']


**Stopword Removal**

Stopwords are common words that add little meaning (e.g., is, are, the).

In [None]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'one', 'important', 'areas', 'Artificial', 'Intelligence', '.', 'helps', 'machines', 'understand', ',', 'interpret', ',', 'generate', 'human', 'language', '...']


**Stemming**

Stemming reduces words to their root form (may not be a real word).

In [None]:
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed_words)


['natur', 'languag', 'process', '(', 'nlp', ')', 'one', 'import', 'area', 'artifici', 'intellig', '.', 'help', 'machin', 'understand', ',', 'interpret', ',', 'gener', 'human', 'languag', '...']


**Lemmatization**

Lemmatization converts words into meaningful base forms using vocabulary.

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_words)


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'one', 'important', 'area', 'Artificial', 'Intelligence', '.', 'help', 'machine', 'understand', ',', 'interpret', ',', 'generate', 'human', 'language', '...']


# **Comparison of Text Preprocessing**

**Tokenization:**

Breaks long text into smaller words (tokens).

✔ Makes text readable for machines

✘ Does not reduce noise by itself

**Stopword Removal:**

Removes common words like is, the, and.

✔ Reduces unnecessary words

✘ May remove useful meaning in some cases

**Stemming:**

Cuts words to their root form (e.g., playing → play).

✔ Reduces vocabulary size

✘ Root words may not be grammatically correct

**Lemmatization:**

Converts words to meaningful base form (e.g., better → good).

✔ Preserves correct meaning

✘ Slightly slower than stemming