# Text Processing and Feature Extraction Pipeline

## Overview
This notebook demonstrates a comprehensive text processing pipeline using Python, spaCy, and scikit-learn. We'll explore:

- **Data preprocessing** with pandas
- **Natural Language Processing** using spaCy
- **Text cleaning and normalization**
- **Feature extraction** with Bag of Words and TF-IDF
- **N-gram analysis** for enhanced text representation

## Dataset
We're working with a collection of sentences about lemons and lemonade to demonstrate various text processing techniques.

---

In [None]:
# Importing libraries
import pandas as pd 

## 1. Data Setup and Initial Loading

First, we'll import the necessary libraries and set up our sample dataset.

In [2]:
data = [
    "When life gives you lemons, make lemonade! 🙂",
    "She bought 2 lemons for $1 at Maven Market.",
    "A dozen lemons will make a gallon of lemonade. [AllRecipes]",
    "lemon, lemon, lemons, lemon, lemon, lemons",
    "He's running to the market to get a lemon — there's a great sale today.",
    "Does Maven Market carry Eureka lemons or Meyer lemons?",
    "An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",
    "iced tea is my favorite"
]

### Creating Sample Dataset

Our dataset contains 8 sentences with various text characteristics:
- Mixed case text
- Punctuation and special characters
- Emojis and symbols
- Numbers and prices
- Citations and brackets
- Repeated words

In [3]:

# Convert list to DataFrame

data_df = pd.DataFrame(data, columns=['sentence'])

### Converting to DataFrame

Let's convert our list of sentences into a pandas DataFrame for easier manipulation.

In [4]:

# Set display options to show full content

pd.set_option('display.max_colwidth', None)


In [5]:

# Create a copy for spaCy processing

spacy_df = data_df.copy()



# Convert text to lowercase

spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower()


In [6]:

# Remove specific citations

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '')



# Advanced cleaning with regex

combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True)

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\s+', ' ', regex=True).str.strip()


## 2. Text Preprocessing and Cleaning

In this section, we'll clean our text data by:
- Converting to lowercase
- Removing URLs, email addresses, and social media handles
- Removing special characters and punctuation
- Normalizing whitespace

In [28]:
%pip install -U spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Installing and Setting Up spaCy

spaCy is an industrial-strength Natural Language Processing library that we'll use for:
- Tokenization
- Lemmatization  
- Stop word removal
- Part-of-speech tagging

In [10]:
import spacy

In [11]:
# Download and install English language model

!python -m spacy download en_core_web_sm



# Load the pre-trained pipeline

nlp = spacy.load('en_core_web_sm')



# Process a sample sentence

phrase = spacy_df.clean_sentence[0] # "when life gives you lemons make lemonade"

doc = nlp(phrase)

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m109.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Loading spaCy Language Model

Now we'll download the English language model and explore tokenization with a sample sentence.

In [12]:

# Extract tokens as text strings

[token.text for token in doc]

# Output: ['when', 'life', 'gives', 'you', 'lemons', 'make', 'lemonade']



# Extract tokens as spaCy objects (with linguistic attributes)

[token for token in doc]

# Output: [when, life, gives, you, lemons, make, lemonade]


[when, life, gives, you, lemons, make, lemonade]

### Exploring Tokenization

Let's examine how spaCy breaks down our text into tokens and explore the difference between text strings and spaCy token objects.

In [13]:

# Extract lemmatized forms

[token.lemma_ for token in doc]

# Output: ['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']


['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

### Understanding Lemmatization

Lemmatization reduces words to their base or root form. For example:
- "gives" → "give"
- "lemons" → "lemon"

In [14]:

# View all English stop words in spaCy

list(nlp.Defaults.stop_words)

print(f"Total stop words: {len(list(nlp.Defaults.stop_words))}") # 326 stop words



# Remove stop words

[token for token in doc if  not token.is_stop]

# Output: [life, gives, lemons, lemonade]



# Combine lemmatization and stop word removal

[token.lemma_ for token in doc if  not token.is_stop]

# Output: ['life', 'give', 'lemon', 'lemonade']



# Convert back to sentence format

norm = [token.lemma_ for token in doc if  not token.is_stop]

' '.join(norm) # Output: 'life give lemon lemonade'


Total stop words: 326


'life give lemon lemonade'

### Working with Stop Words

Stop words are common words that typically don't carry much meaning for text analysis (e.g., "the", "is", "at"). Let's explore how to identify and remove them.

In [18]:

# Function for lemmatization and stop word removal

def  token_lemma_stopw(text):

     doc = nlp(text)

     output = [token.lemma_ for token in doc if  not token.is_stop]

     return  ' '.join(output)



# Apply to entire dataset

spacy_df.clean_sentence.apply(token_lemma_stopw)


0                       life give lemon lemonade
1                     buy 2 lemon 1 maven market
2          dozen lemon gallon lemonade allrecipe
3            lemon lemon lemon lemon lemon lemon
4          s run market lemon s great sale today
5    maven market carry eureka lemon meyer lemon
6       arnold palmer half lemonade half ice tea
7                               ice tea favorite
Name: clean_sentence, dtype: object

In [20]:

def  lower_replace(series):

     output = series.str.lower()

     combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

     output = output.str.replace(combined, ' ', regex=True)

     return output



def  nlp_pipeline(series):

     output = lower_replace(series)

     output = output.apply(token_lemma_stopw)

     return output



# Apply complete pipeline

cleaned_text = nlp_pipeline(data_df.sentence)



# Save processed data for future use

pd.to_pickle(cleaned_text, 'preprocessed_text.pkl')


### Creating a Complete NLP Pipeline

Now let's combine all our preprocessing steps into a single pipeline function and save the processed data for feature extraction.

In [23]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp311-cp311-macosx_12_0_arm64.whl (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m92.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:03[0mm
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.1-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m99.5 kB/s[0m eta [36m0:00:00[0m00:01[0m00:07[0m
[?25hCollecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.4/308.4 kB[0m [31m166.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.2 scikit-learn-1.7.1 scipy-1.16.1 thread

## 3. Feature Extraction with Scikit-Learn

Now we'll convert our preprocessed text into numerical features that machine learning algorithms can work with. We'll explore two main approaches:

1. **Bag of Words (Count Vectorizer)** - Counts word occurrences
2. **TF-IDF (Term Frequency-Inverse Document Frequency)** - Weighs words by importance

In [24]:

# Load preprocessed data

import pandas as pd

series = pd.read_pickle('preprocessed_text.pkl')



from sklearn.feature_extraction.text import CountVectorizer



# Create Count Vectorizer

cv = CountVectorizer()

bow = cv.fit_transform(series)



# Convert to DataFrame for visualization

pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())


Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,1,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### 3.1 Bag of Words (Count Vectorizer)

The Count Vectorizer creates a matrix where each row represents a document and each column represents a unique word. The values indicate how many times each word appears in each document.

In [25]:

# Count Vectorizer with filtering

cv1 = CountVectorizer(

stop_words='english', # Remove English stop words

ngram_range=(1,1), # Use only single words (unigrams)

min_df=2  # Include words that appear in at least 2 documents

)



bow1 = cv1.fit_transform(series)

bow1_df = pd.DataFrame(bow1.toarray(), columns=cv1.get_feature_names_out())



# Calculate term frequencies

term_freq = bow1_df.sum()


### 3.2 Advanced Count Vectorizer Features

Let's explore more sophisticated features of the Count Vectorizer:
- **Stop word filtering**: Remove common words automatically
- **N-gram range**: Control whether to use single words or word combinations
- **Min document frequency**: Filter out rare words

In [26]:

from sklearn.feature_extraction.text import TfidfVectorizer



# Basic TF-IDF vectorization

tv = TfidfVectorizer()

tvidf = tv.fit_transform(series)

tvidf_df = pd.DataFrame(tvidf.toarray(), columns=tv.get_feature_names_out())



# TF-IDF with filtering

tv1 = TfidfVectorizer(min_df=2) # Words must appear in at least 2 documents

tvidf1 = tv1.fit_transform(series)

tvidf1_df = pd.DataFrame(tvidf1.toarray(), columns=tv1.get_feature_names_out())


### 3.3 TF-IDF Vectorization

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a more sophisticated approach that:
- **TF (Term Frequency)**: Measures how frequently a term appears in a document
- **IDF (Inverse Document Frequency)**: Measures how rare or common a term is across all documents
- **TF-IDF Score**: TF × IDF - gives higher weights to terms that are frequent in a document but rare across the corpus

In [27]:

# Bigram TF-IDF (pairs of consecutive words)

tv2 = TfidfVectorizer(ngram_range=(1,2)) # Include both unigrams and bigrams

tvidf2 = tv2.fit_transform(series)

tvidf2_df = pd.DataFrame(tvidf2.toarray(), columns=tv2.get_feature_names_out())



# Analyze feature importance

tvidf2_df.sum().sort_values(ascending=False)


lemon                 1.583310
lemon lemon           0.857624
market                0.767950
lemonade              0.743321
ice tea               0.625522
ice                   0.625522
tea                   0.625522
maven market          0.621858
maven                 0.621858
half                  0.505881
tea favorite          0.493436
favorite              0.493436
buy lemon             0.439482
buy                   0.439482
lemon maven           0.439482
life give             0.416207
life                  0.416207
give                  0.416207
give lemon            0.416207
lemon lemonade        0.416207
lemonade allrecipe    0.358685
lemon gallon          0.358685
allrecipe             0.358685
dozen                 0.358685
gallon                0.358685
dozen lemon           0.358685
gallon lemonade       0.358685
run                   0.319884
sale today            0.319884
market lemon          0.319884
sale                  0.319884
run market            0.319884
great   

### 3.4 N-gram Analysis and Feature Importance

Let's explore bigrams (pairs of consecutive words) to capture more context and analyze feature importance.

## 4. Summary and Conclusions

In this notebook, we've successfully demonstrated a complete text processing pipeline:

### Key Achievements:
1. **Data Preprocessing**: Cleaned and normalized text data using regex and pandas
2. **NLP Pipeline**: Implemented tokenization, lemmatization, and stop word removal with spaCy
3. **Feature Extraction**: Created numerical representations using:
   - Bag of Words (Count Vectorizer)
   - TF-IDF Vectorization
   - N-gram analysis (unigrams and bigrams)

### Key Insights:
- **Text cleaning** is crucial for consistent results
- **Lemmatization** helps reduce vocabulary size while preserving meaning
- **Stop word removal** focuses on meaningful content
- **TF-IDF** often provides better features than simple word counts
- **N-grams** capture context that individual words might miss

### Next Steps:
This preprocessed data is now ready for:
- Machine learning classification tasks
- Clustering analysis
- Similarity comparisons
- Topic modeling
- Sentiment analysis

---

**Libraries Used:**
- `pandas` - Data manipulation
- `spaCy` - Natural language processing
- `scikit-learn` - Feature extraction and machine learning