# Text Preprocessing Techniques using Pandas

Text preprocessing is a crucial step in natural language processing (NLP) tasks, as it helps to clean and transform raw text data into a format that can be more effectively analyzed and understood by machine learning models. In this lecture, we will explore three key text preprocessing techniques using the pandas library in Python: tokenization, stop word removal, and stemming/lemmatization.

## Tokenization

Tokenization is the process of breaking down a given text into smaller units, called tokens. These tokens can be individual words, phrases, or even characters, depending on the specific task at hand. In pandas, we can use the `str.split()` method to tokenize text data.

In [46]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load sample text data from the 20 Newsgroups dataset
data = fetch_20newsgroups(subset='all')
df = pd.DataFrame({'text': data.data})

# Tokenize the text data
df['tokens'] = df['text'].str.split()

## Stop Word Removal

Stop words are common words that do not carry significant meaning in the context of text analysis, such as "the", "a", "and", "is", etc. Removing stop words can help to focus on the more meaningful words in the text and improve the performance of NLP models.

## Stop Word Removal

Stop words are common words that do not carry significant meaning in the context of text analysis, such as "the", "a", "and", "is", etc. Removing stop words can help to focus on the more meaningful words in the text and improve the performance of NLP models.

In [47]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Remove stop words from the tokenized text
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word.lower() not in ENGLISH_STOP_WORDS])

## Stemming and Lemmatization

Stemming and lemmatization are techniques used to reduce words to their base or root form, known as the stem or lemma, respectively. Stemming typically involves removing common suffixes, while lemmatization takes into account the context and part of speech of the word to determine its base form.

In [56]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming
stemmer = PorterStemmer()
df['stemmed_tokens'] = df['filtered_tokens'].apply(lambda x: [stemmer.stem(word) for word in x])

# Lemmatization
lemmatizer = WordNetLemmatizer()
df['lemmatized_tokens'] = df['filtered_tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

ImportError: cannot import name 'load_reuters' from 'sklearn.datasets' (c:\Users\abu aisha\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\datasets\__init__.py)