In Natural Language Processing (NLP), preprocessing is a crucial step that involves transforming raw text data into a format that can be efficiently and effectively used by machine learning models. Here are the key steps involved in NLP preprocessing:

1. Text Cleaning
Lowercasing: Convert all text to lowercase to ensure uniformity.
Removing Punctuation: Remove or handle punctuation marks.
Removing Special Characters: Remove characters like #, @, &, etc.
Removing Numbers: Depending on the context, numbers might be removed.
2. Tokenization
Word Tokenization: Split the text into individual words.
Sentence Tokenization: Split the text into sentences.
Subword Tokenization: Break down words into subwords, especially useful for handling out-of-vocabulary words.
3. Stop Word Removal
Stop Words: Remove common words like "and," "the," "is," etc., that do not contribute significantly to the meaning of the text.
4. Stemming and Lemmatization
Stemming: Reduce words to their root form (e.g., "running" to "run").
Lemmatization: Reduce words to their base or dictionary form (e.g., "better" to "good").


In [1]:
!pip install nltk



In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import warnings
warnings.filterwarnings("ignore")
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
text = "This is a simple example: we're going to preprocess this text, removing stopwords and punctuation."

In [12]:
words = word_tokenize(text)

In [13]:
words

['This',
 'is',
 'a',
 'simple',
 'example',
 ':',
 'we',
 "'re",
 'going',
 'to',
 'preprocess',
 'this',
 'text',
 ',',
 'removing',
 'stopwords',
 'and',
 'punctuation',
 '.']

In [15]:
#lower case & removal of puntuation
cw = [word.lower() for word in words if word.isalpha()]

In [16]:
cw

['this',
 'is',
 'a',
 'simple',
 'example',
 'we',
 'going',
 'to',
 'preprocess',
 'this',
 'text',
 'removing',
 'stopwords',
 'and',
 'punctuation']

In [18]:
stop_word = set(stopwords.words('english'))
filtered_words = [word for word in cw if word not in stop_word]

In [19]:
print("Original Text:")
print(text)

print("\nTokenized Text:")
print(words)

print("\nCleaned and Stop Words Removed:")
print(filtered_words)

Original Text:
This is a simple example: we're going to preprocess this text, removing stopwords and punctuation.

Tokenized Text:
['This', 'is', 'a', 'simple', 'example', ':', 'we', "'re", 'going', 'to', 'preprocess', 'this', 'text', ',', 'removing', 'stopwords', 'and', 'punctuation', '.']

Cleaned and Stop Words Removed:
['simple', 'example', 'going', 'preprocess', 'text', 'removing', 'stopwords', 'punctuation']
