 Implementation of preprocessing of Text with NLTK (Tokenization, Stemming, Lemmatization and removal of stop words 
in NLP)
-------

- In NLP, preprocessing text involves several steps to clean and prepare the text data for analysis. Here's how you can implement tokenization, stemming, lemmatization, and stop word removal using NLTK (Natural Language Toolkit) in Python:

1. Tokenization

-------

Tokenization breaks text into smaller units like words or sentences

In [13]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize


text = "Tokenization breaks text into smaller units like words or sentences"


words = word_tokenize(text)
print("Tokenized Words:")
print(words)


sentences = sent_tokenize(text)
print("\nTokenized Sentences:")
print(sentences)


Tokenized Words:
['Tokenization', 'breaks', 'text', 'into', 'smaller', 'units', 'like', 'words', 'or', 'sentences']

Tokenized Sentences:
['Tokenization breaks text into smaller units like words or sentences']


2. Stemming

-------------
Cutting words down to their root form, like changing "running" to "run".

In [14]:
from nltk.stem import PorterStemmer

# Initialize stemmer
stemmer = PorterStemmer()

# Stemming example
words = ["playing", "played", "plays"]

stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed Words:")
print(stemmed_words)


Stemmed Words:
['play', 'play', 'play']



3. Lemmatization:

-----------------
 Similar to stemming, but changes words to their dictionary form, like "better" to "good".

In [15]:
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization example
words = ["playing", "played", "plays"]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:")
print(lemmatized_words)


Lemmatized Words:
['playing', 'played', 'play']


4. Removal of Stop Words

----------
Stop words are common words  that often do not contribute to the meaning of the text.


In [16]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Initialize stopwords
stop_words = set(stopwords.words('english'))

# Example text
text = "This is an example sentence demonstrating stop word removal."

# Remove stop words
filtered_words = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print("Filtered Words after Stopword Removal:")
print(filtered_words)


Filtered Words after Stopword Removal:
['example', 'sentence', 'demonstrating', 'stop', 'word', 'removal', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rr749\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Summary

---------------------------
- Tokenization: Splits text into words or sentences.

- Stemming: Reduces words to their root form.

- Lemmatization: Converts words to their base or dictionary form.

- Stop Word Removal: Filters out common words that do not carry much meaning.

- These preprocessing steps are essential for preparing text data before applying more advanced NLP techniques such as sentiment analysis, text classification, or information retrieval.