In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [9]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/dhruvpai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dhruvpai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
# Sample text document
text = """Natural language processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language. It combines computational linguistics with machine learning to process and analyze large amounts of natural language data."""

In [11]:
# Tokenizing the text into words
words = word_tokenize(text)

In [12]:
# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

In [13]:
# Stemming using PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

In [16]:
# Output the processed words
print("Original Text: ", text)
print("\nTokenized Words: ", words)
print("\nFiltered Words (Stopwords Removed): ", filtered_words)
print("\nStemmed Words: ", stemmed_words)

Original Text:  Natural language processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language. It combines computational linguistics with machine learning to process and analyze large amounts of natural language data.

Tokenized Words:  ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'enables', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'It', 'combines', 'computational', 'linguistics', 'with', 'machine', 'learning', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']

Filtered Words (Stopwords Removed):  ['Natural', 'language', 'processing', '(', 'NLP', ')', 'field', 'artificial', 'intelligence', 'enables', 'computers', 'understand', ',', 'interpret', ',', 'generate', 'human', 'language', '.', 'combines', 'computational', 'linguistics', 'machine

Explanation of the Code:
-----

Import Libraries:
We import necessary modules from the nltk library: stopwords, word_tokenize, and PorterStemmer.

Download Required NLTK Data:
We download the required NLTK data files (punkt for tokenization and stopwords for a list of stop words).

Sample Text:
A sample text document is provided that discusses Natural Language Processing (NLP).

Tokenization:
The word_tokenize() function is used to break the text into individual words (tokens).

Stop Word Removal:
A list of common stop words in English is obtained using stopwords.words('english').
The list of words is filtered to remove any stop words, resulting in a list of words without common non-informative terms like "the", "and", etc.

Stemming:
The PorterStemmer class is used to reduce words to their root forms. For example, "running" becomes "run", and "better" becomes "better".

Output:
The original text, tokenized words, filtered (stopword removed) words, and stemmed words are printed to the console.

Questions:
----

1. What are the different NLTK libraries?
NLTK provides a variety of modules and tools for text processing, including:
nltk.corpus: For accessing datasets like stopwords, movie reviews, etc.
nltk.tokenize: For splitting text into sentences and words.
nltk.stem: For stemming and lemmatization algorithms (e.g., PorterStemmer, WordNetLemmatizer).
nltk.chat: For building simple chatbots.
nltk.probability: For statistical modeling.
nltk.parse: For parsing text into syntactic structures.
And many more for various natural language processing tasks.

2. How to remove stop words from the file?
To remove stop words in NLTK, you can:
Tokenize the text using nltk.tokenize.word_tokenize().
Load a list of stop words using nltk.corpus.stopwords.words('english').
Filter out the stop words by checking if each token is in the list of stop words.

3. What is meant by stemming?
Stemming is the process of reducing words to their base or root form by removing prefixes or suffixes. For instance, "running" becomes "run", and "better" remains "better" in some cases. The goal is to reduce words to a common base to treat similar words equally.

4. What is meant by Lemmatization?
Lemmatization is the process of reducing a word to its base or dictionary form (called a "lemma"). Unlike stemming, which simply chops off word endings, lemmatization takes into account the word's meaning and context. For example, "running" becomes "run", but "better" would become "good" through lemmatization.