# NLP Practice Assignments
Day 1

1. What is the purpose of text preprocessing in NLP, and why is it essential before analysis?

2. Describe tokenization in NLP and explain its significance in text processing.

3. What are the differences between stemming and lemmatization in NLP? When would you 
choose one over the other?

4. Explain the concept of stop words and their role in text preprocessing. How do they impact 
NLP tasks?

5. How does the process of removing punctuation contribute to text preprocessing in NLP? 
What are its benefits?

6. Discuss the importance of lowercase conversion in text preprocessing. Why is it a 
common step in NLP tasks?

7. Explain the term "vectorization" concerning text data. How does techniques like 
CountVectorizer contribute to text preprocessing in NLP?

8. Describe the concept of normalization in NLP. Provide examples of normalization 
techniques used in text preprocessing.


Note: Consider the text. It may be a file or prompted inputs.
 Python code is mandate for possible Questions

# 1. Purpose of Text Preprocessing in NLP:
    
Text preprocessing in NLP serves the purpose of cleaning and organizing raw text data to make it suitable for analysis. It involves various techniques like tokenization, stemming, lemmatization, and removal of irrelevant information. Preprocessing is essential to:

Enhance the efficiency of analysis algorithms.
Reduce the dimensionality of the data.
Improve the accuracy and interpretability of models.

# 2. Tokenization in NLP:
Tokenization is the process of breaking text into words, phrases, symbols, or other meaningful elements (tokens). It is a crucial step in text processing as it helps in:

Understanding the structure of sentences.
Facilitating further analysis by converting text into manageable units.

In [2]:
from nltk.tokenize import word_tokenize

text = "Tokenization is a key step in NLP."
tokens = word_tokenize(text)
print(tokens)

['Tokenization', 'is', 'a', 'key', 'step', 'in', 'NLP', '.']


# 3. Stemming vs. Lemmatization:
Stemming and lemmatization are techniques to reduce words to their base or root form. 
Stemming is faster but less accurate, while lemmatization considers context and is more accurate.

In [3]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

words = word_tokenize("Stemming and lemmatization are techniques.")
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Stemmed words:", stemmed_words)
print("Lemmatized words:", lemmatized_words)

Stemmed words: ['stem', 'and', 'lemmat', 'are', 'techniqu', '.']
Lemmatized words: ['Stemming', 'and', 'lemmatization', 'are', 'technique', '.']


# 4 Stop Words:
Stop words are common words like "the," "is," and "and" that are often removed during text preprocessing. 
They don't contribute much to the meaning of the text but may impact analysis efficiency.

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "Stop words should be removed for better analysis."
stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print("Filtered tokens:", filtered_tokens)

Filtered tokens: ['Stop', 'words', 'removed', 'better', 'analysis', '.']


# 5. Removing Punctuation:
Removing punctuation helps in focusing on the actual words and their meaning.

In [5]:
import string

text = "Removing punctuation: Does it improve text processing?"
no_punctuations = text.translate(str.maketrans("", "", string.punctuation))
print("Text without punctuation:", no_punctuations)

Text without punctuation: Removing punctuation Does it improve text processing


# 6. Lowercase Conversion:
Lowercasing is essential for consistency and uniformity in text data.

In [6]:
text = "Convert This Text To Lowercase."
lowercased_text = text.lower()
print("Lowercased text:", lowercased_text)

Lowercased text: convert this text to lowercase.


# 7. Vectorization:
Vectorization involves converting text data into numerical vectors. 
CountVectorizer is a common technique for this.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

documents = ["This is the first document.", "This document is the second document."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print("Feature names:", vectorizer.get_feature_names_out())
print("Token Counts Matrix:", X.toarray())

Feature names: ['document' 'first' 'is' 'second' 'the' 'this']
Token Counts Matrix: [[1 1 1 0 1 1]
 [2 0 1 1 1 1]]


# 8. Normalization in NLP:
Normalization involves transforming text data to a standard form. Common techniques include stemming, lemmatization, and lowercasing.

In [8]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

text = "Normalization techniques include stemming and lemmatization."
stemmer = PorterStemmer()

normalized_words = [stemmer.stem(word) for word in word_tokenize(text.lower())]
print("Normalized words:", normalized_words)

Normalized words: ['normal', 'techniqu', 'includ', 'stem', 'and', 'lemmat', '.']
