---
# Text Preprocessing Program

## Problem Statement:

In the field of Natural Language Processing (NLP) and text analysis, it is essential to preprocess textual data to make it suitable for various language processing tasks, such as sentiment analysis, text classification, or information retrieval. Text preprocessing involves tasks like removing irrelevant words (stop words), reducing words to their root form (stemming), and cleaning up the text for analysis.

The objective of this program is to develop a text preprocessing tool that can perform the following tasks on a given text document:

1. **Stop Word Removal**: Stop words are common words (e.g., "the," "is," "in") that do not carry significant meaning in text analysis. This program should identify and remove stop words from the input text.

2. **Stemming**: Stemming is the process of reducing words to their base or root form (e.g., "running" to "run"). The program should apply stemming to the remaining words in the text to standardize them.

## Input:

The input to the program is a text document or string, which may contain natural language text, and may include punctuation and a mixture of uppercase and lowercase letters.

## Output:

The program will output the preprocessed text, which is a cleaned version of the input text after stop word removal and stemming.

### Implementation:

The program is implemented in Python, utilizing the NLTK (Natural Language Toolkit) library for stop word removal and stemming. It consists of functions for text preprocessing and an example demonstrating how to use the functions on a sample input text.

The goal is to provide a simple yet effective text preprocessing tool that can be integrated into various text analysis tasks.

---

### Importing Libraries

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sidla\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sidla\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Preprocessing Function Definition

In [2]:
# Defining a function to preprocess the text
def preprocess_text(text):
    words = word_tokenize(text)
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in words if word.isalnum()]
    filtered_words = [stemmer.stem(word) for word in words if word not in stop_words]   
    preprocessed_text = ' '.join(filtered_words)
    return preprocessed_text

### Preprocessing Example

In [3]:
input_text = """
Natural language processing (NLP) is a field of artificial intelligence
that focuses on the interaction between computers and humans through
natural language.
"""
preprocessed_text = preprocess_text(input_text)

print("Original Text:")
print(input_text)

print("\nPreprocessed Text:")
print(preprocessed_text)

Original Text:

Natural language processing (NLP) is a field of artificial intelligence
that focuses on the interaction between computers and humans through
natural language.


Preprocessed Text:
natur languag process nlp field artifici intellig focus interact comput human natur languag
