### Algorithm for Text Preprocessing in NLP:

Input: A string of text.

Tokenization:

Split the text into individual words or tokens.

Output: List of tokens (words).

Filtration:

Remove non-alphanumeric characters (such as punctuation marks, numbers, etc.).

Output: List of tokens containing only words.

Script Validation:

Remove tokens that contain non-English characters (e.g., non-ASCII characters or words with digits).

Output: List of tokens that contain only valid alphabetic characters.

Stop Word Removal:

Compare each token to a predefined list of common stop words (like "the", "and", "is", etc.).

Remove tokens that are found in the stop word list.

Output: List of tokens after stop words have been removed.

Stemming:

Apply stemming to each token, reducing it to its root form (e.g., "running" → "run", "better" → "better").

Output: List of stemmed tokens.

Language Detection (Optional):

Optionally detect the language of the input text.

Output: Detected language (e.g., 'en' for English, 'hi' for Hindi).

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from langdetect import detect

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Personal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Personal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.isalnum()]
    validated_tokens = [word for word in filtered_tokens if re.match("^[A-Za-z]+$", word)] 
    stop_words = set(stopwords.words('english'))
    tokens_without_stopwords = [word for word in validated_tokens if word.lower() not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens_without_stopwords]
    return stemmed_tokens


In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence!"
preprocessed_text = preprocess_text(text)
print("Preprocessed Text:", preprocessed_text)

Preprocessed Text: ['natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'artifici', 'intellig']


In [5]:
text = "नमस, आप कस ह?"
lang = detect(text)
print("Detected Language:", lang)

Detected Language: hi
