To Use NLTK functions to read text, removing stop words, applying stemmers, performing lemmatization and generating frequency distribution. 

In [1]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize

# Download the required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Read the text
text = "NLTK is a powerful library for natural language processing in Python. It provides easy-to-use interfaces for tasks such as tokenization, stemming, lemmatization, and frequency analysis."

# Tokenize the text into individual words
tokens = word_tokenize(text)

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# Apply stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

# Apply lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Calculate the frequency distribution
freq_dist = FreqDist(lemmatized_tokens)

# Print the results
print("Original Text:", text)
print("Filtered Tokens (after removing stop words):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("Most Common Tokens:", freq_dist.most_common())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Text: NLTK is a powerful library for natural language processing in Python. It provides easy-to-use interfaces for tasks such as tokenization, stemming, lemmatization, and frequency analysis.
Filtered Tokens (after removing stop words): ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', 'Python', '.', 'provides', 'easy-to-use', 'interfaces', 'tasks', 'tokenization', ',', 'stemming', ',', 'lemmatization', ',', 'frequency', 'analysis', '.']
Stemmed Tokens: ['nltk', 'power', 'librari', 'natur', 'languag', 'process', 'python', '.', 'provid', 'easy-to-us', 'interfac', 'task', 'token', ',', 'stem', ',', 'lemmat', ',', 'frequenc', 'analysi', '.']
Lemmatized Tokens: ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', 'Python', '.', 'provides', 'easy-to-use', 'interface', 'task', 'tokenization', ',', 'stemming', ',', 'lemmatization', ',', 'frequency', 'analysis', '.']
Most Common Tokens: [(',', 3), ('.', 2), ('NLTK', 1), ('powerful', 1), ('library', 1