# Week 6: Python Text Processing & Pattern Matching
## Part 2: Pattern Matching with String Methods
### Wednesday, September 17, 2025

**Business Context**: Finding patterns in e-commerce data for automated categorization  
**Excel Bridge**: Moving from FIND, SEARCH functions to Python pattern matching

## Learning Objectives

By the end of this notebook, you will be able to:
1. Use pandas string methods for pattern detection
2. Find patterns in product names and categories
3. Create automated business classification rules
4. Filter and categorize data based on text patterns
5. Prepare for advanced regular expression patterns

## Setup and Data Import

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 60)

print("✅ Libraries imported successfully!")

In [None]:
# Load enhanced sample data for pattern matching

# E-commerce product data with patterns to identify
products_data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008'],
    'product_name': [
        'Smart Phone Samsung Galaxy S21',
        'Notebook Dell Inspiron 15',
        'Smartphone iPhone 13 Pro',
        'Tablet iPad Air 10.9',
        'Smart TV LG 55 4K',
        'Headphone Bluetooth Sony',
        'Smartphone Xiaomi Redmi Note',
        'Laptop Lenovo ThinkPad X1'
    ],
    'category': [
        'electronics_phones',
        'computers_laptops', 
        'electronics_phones',
        'computers_tablets',
        'electronics_tv',
        'electronics_audio',
        'electronics_phones',
        'computers_laptops'
    ],
    'price': [1200, 2500, 1800, 800, 2200, 300, 400, 3500],
    'brand': ['Samsung', 'Dell', 'Apple', 'Apple', 'LG', 'Sony', 'Xiaomi', 'Lenovo']
}

products_df = pd.DataFrame(products_data)

# Customer review data for sentiment pattern analysis
reviews_data = {
    'review_id': ['R001', 'R002', 'R003', 'R004', 'R005', 'R006'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P004', 'P005'],
    'review_text': [
        'Excelente produto! Muito satisfeito com a compra.',
        'Produto chegou com defeito, não recomendo.',
        'ÓTIMO smartphone, bateria dura muito!',
        'Entrega rápida, produto conforme descrição.',
        'NÃO GOSTEI, qualidade ruim.',
        'Produto bom, preço justo, recomendo!'
    ],
    'review_score': [5, 1, 5, 4, 1, 4]
}

reviews_df = pd.DataFrame(reviews_data)

print("📊 Sample datasets loaded!")
print(f"Products dataset shape: {products_df.shape}")
print(f"Reviews dataset shape: {reviews_df.shape}")

print("\n🔍 First look at products:")
print(products_df.head())

## 1. Basic Pattern Detection Methods

### Excel Connection: FIND and SEARCH Functions
- Excel `FIND()` → Python `.str.find()` (case-sensitive)
- Excel `SEARCH()` → Python `.str.contains()` (case-insensitive option)
- Excel `IF(FIND())` → Python `.str.contains()` with boolean output

### 1.1 Using .str.contains() for Pattern Detection

In [None]:
# Basic pattern detection with .str.contains()
print("🔍 Basic pattern detection:")

# Find all smartphone products
smartphone_pattern = products_df['product_name'].str.contains('Smart', case=False)
print("Products containing 'Smart':")
print(products_df[smartphone_pattern][['product_name', 'category']])

print("\n" + "="*50)

# Find products with specific brands
apple_products = products_df['brand'].str.contains('Apple')
print("\nApple products:")
print(products_df[apple_products][['product_name', 'brand', 'price']])

print("\n" + "="*50)

# Case-sensitive vs case-insensitive comparison
case_sensitive = products_df['product_name'].str.contains('smart')  # lowercase
case_insensitive = products_df['product_name'].str.contains('smart', case=False)

print(f"\n📊 Pattern matching comparison:")
print(f"Case-sensitive 'smart': {case_sensitive.sum()} matches")
print(f"Case-insensitive 'smart': {case_insensitive.sum()} matches")

### 1.2 Using .str.startswith() and .str.endswith()

In [None]:
# Pattern detection based on prefix and suffix
print("🎯 Prefix and suffix pattern matching:")

# Find products starting with specific patterns
print("Products starting with 'Smart':")
starts_with_smart = products_df[products_df['product_name'].str.startswith('Smart')]
print(starts_with_smart[['product_name', 'category']])

print("\nProducts ending with version numbers or models:")
ends_with_number = products_df[products_df['product_name'].str.contains(r'\d+$')]
print(ends_with_number[['product_name']])

# Categories starting with specific prefixes
print("\n📁 Category analysis:")
electronics_categories = products_df[products_df['category'].str.startswith('electronics')]
computer_categories = products_df[products_df['category'].str.startswith('computers')]

print(f"Electronics categories: {len(electronics_categories)}")
print(f"Computer categories: {len(computer_categories)}")

print("\nUnique category prefixes:")
category_prefixes = products_df['category'].str.split('_').str[0].unique()
print(category_prefixes)

### 1.3 Multiple Pattern Matching

In [None]:
# Advanced pattern matching with multiple conditions
print("🔗 Multiple pattern matching:")

# Using OR patterns with | operator
mobile_devices = products_df['product_name'].str.contains('Phone|Smartphone|Tablet', case=False)
print("Mobile devices (Phone, Smartphone, or Tablet):")
print(products_df[mobile_devices][['product_name', 'category', 'price']])

print("\n" + "="*50)

# Combining multiple conditions
expensive_electronics = (
    (products_df['category'].str.startswith('electronics')) & 
    (products_df['price'] > 1000)
)

print("\nExpensive electronics (> $1000):")
print(products_df[expensive_electronics][['product_name', 'price', 'brand']])

print("\n" + "="*50)

# Pattern matching with exclusions
not_apple = ~products_df['brand'].str.contains('Apple')
non_apple_mobile = mobile_devices & not_apple

print("\nMobile devices (excluding Apple):")
print(products_df[non_apple_mobile][['product_name', 'brand']])

## 2. Business Pattern Classification

### 2.1 Automated Product Categorization

In [None]:
# Create automated business classification based on product names
print("🏷️ Automated product categorization:")

def classify_product_type(product_name):
    """Classify products based on name patterns"""
    name_lower = product_name.lower()
    
    if any(device in name_lower for device in ['phone', 'smartphone']):
        return 'Mobile Phone'
    elif any(device in name_lower for device in ['laptop', 'notebook']):
        return 'Computer'
    elif 'tablet' in name_lower or 'ipad' in name_lower:
        return 'Tablet'
    elif any(device in name_lower for device in ['tv', 'television']):
        return 'Television'
    elif any(device in name_lower for device in ['headphone', 'earphone', 'audio']):
        return 'Audio Device'
    else:
        return 'Other Electronics'

# Apply classification
products_df['product_type'] = products_df['product_name'].apply(classify_product_type)

print("Product classification results:")
classification_summary = products_df.groupby('product_type').agg({
    'product_name': 'count',
    'price': ['mean', 'min', 'max']
}).round(2)

classification_summary.columns = ['Count', 'Avg_Price', 'Min_Price', 'Max_Price']
print(classification_summary)

print("\n📊 Detailed classification:")
print(products_df[['product_name', 'product_type', 'price']].sort_values('product_type'))

### 2.2 Review Sentiment Pattern Analysis

In [None]:
# Analyze sentiment patterns in customer reviews
print("💭 Review sentiment pattern analysis:")

# Define positive and negative keywords
positive_keywords = ['excelente', 'ótimo', 'bom', 'satisfeito', 'recomendo', 'rápida']
negative_keywords = ['defeito', 'não', 'ruim', 'não gostei', 'não recomendo']

# Create pattern detection for sentiment
positive_pattern = '|'.join(positive_keywords)
negative_pattern = '|'.join(negative_keywords)

reviews_df['has_positive_words'] = reviews_df['review_text'].str.contains(
    positive_pattern, case=False
)
reviews_df['has_negative_words'] = reviews_df['review_text'].str.contains(
    negative_pattern, case=False
)

# Count positive and negative keywords
reviews_df['positive_word_count'] = reviews_df['review_text'].str.lower().str.count(
    positive_pattern
)
reviews_df['negative_word_count'] = reviews_df['review_text'].str.lower().str.count(
    negative_pattern
)

print("Review sentiment analysis:")
sentiment_analysis = reviews_df[[
    'review_text', 'review_score', 'has_positive_words', 'has_negative_words',
    'positive_word_count', 'negative_word_count'
]]
print(sentiment_analysis)

In [None]:
# Create automated sentiment classification
def classify_sentiment(row):
    """Classify sentiment based on word patterns"""
    if row['positive_word_count'] > row['negative_word_count']:
        return 'Positive'
    elif row['negative_word_count'] > row['positive_word_count']:
        return 'Negative'
    else:
        return 'Neutral'

reviews_df['predicted_sentiment'] = reviews_df.apply(classify_sentiment, axis=1)

# Compare with actual review scores
print("\n🎯 Sentiment prediction vs actual scores:")
comparison = reviews_df[['review_score', 'predicted_sentiment', 'review_text']]
print(comparison)

# Calculate prediction accuracy
def score_to_sentiment(score):
    if score >= 4:
        return 'Positive'
    elif score <= 2:
        return 'Negative'
    else:
        return 'Neutral'

reviews_df['actual_sentiment'] = reviews_df['review_score'].apply(score_to_sentiment)
accuracy = (reviews_df['predicted_sentiment'] == reviews_df['actual_sentiment']).mean()

print(f"\n📊 Sentiment prediction accuracy: {accuracy:.1%}")

## 3. Nigerian Market Adaptation Patterns

In [None]:
# Adapt pattern matching for Nigerian e-commerce context
print("🇳🇬 Nigerian market pattern adaptation:")

# Create Nigerian product database
nigerian_products = {
    'product_id': ['NG001', 'NG002', 'NG003', 'NG004', 'NG005', 'NG006'],
    'product_name': [
        'Tecno Spark 8 Smartphone - Lagos Store',
        'Infinix Note 12 - Konga Nigeria',
        'Itel A48 Pro - Jumia NG',
        'Samsung Galaxy A12 - Slot Nigeria',
        'iPhone 12 - Computer Village Lagos',
        'Redmi Note 11 - Abuja Electronics'
    ],
    'source': ['local_store', 'online_konga', 'online_jumia', 'slot_stores', 'computer_village', 'local_store'],
    'location': ['Lagos', 'Nigeria', 'Nigeria', 'Nigeria', 'Lagos', 'Abuja']
}

nigerian_df = pd.DataFrame(nigerian_products)

# Identify Nigerian-popular brands
nigerian_brands = ['Tecno', 'Infinix', 'Itel']
nigerian_brand_pattern = '|'.join(nigerian_brands)

nigerian_df['is_nigerian_brand'] = nigerian_df['product_name'].str.contains(
    nigerian_brand_pattern, case=False
)

# Identify sales channels
nigerian_df['is_online'] = nigerian_df['product_name'].str.contains(
    'Konga|Jumia', case=False
)
nigerian_df['is_computer_village'] = nigerian_df['product_name'].str.contains(
    'Computer Village', case=False
)

print("Nigerian market analysis:")
market_analysis = nigerian_df[[
    'product_name', 'is_nigerian_brand', 'is_online', 'is_computer_village'
]]
print(market_analysis)

print("\n📊 Market distribution:")
print(f"Nigerian brand products: {nigerian_df['is_nigerian_brand'].sum()}")
print(f"Online marketplace products: {nigerian_df['is_online'].sum()}")
print(f"Computer Village products: {nigerian_df['is_computer_village'].sum()}")

## 4. Advanced Pattern Applications

### 4.1 Price Range Classification

In [None]:
# Create price-based classification using product name patterns
print("💰 Price range classification based on product patterns:")

# Identify premium indicators in product names
premium_keywords = ['Pro', 'Max', 'Ultra', 'Premium', 'Plus']
budget_keywords = ['Lite', 'Go', 'Essential', 'Basic']

premium_pattern = '|'.join(premium_keywords)
budget_pattern = '|'.join(budget_keywords)

products_df['has_premium_keywords'] = products_df['product_name'].str.contains(
    premium_pattern, case=False
)
products_df['has_budget_keywords'] = products_df['product_name'].str.contains(
    budget_pattern, case=False
)

# Analyze correlation between keywords and actual prices
print("Price analysis by product name patterns:")
price_analysis = products_df.groupby(['has_premium_keywords', 'has_budget_keywords'])['price'].agg([
    'count', 'mean', 'min', 'max'
]).round(2)

print(price_analysis)

print("\n🎯 Products with premium keywords:")
premium_products = products_df[products_df['has_premium_keywords']]
if not premium_products.empty:
    print(premium_products[['product_name', 'price']])
else:
    print("No products found with premium keywords in current dataset")

### 4.2 Model/Version Pattern Extraction

In [None]:
# Extract model numbers and versions from product names
print("🔢 Model and version pattern extraction:")

# Extract model information using string methods
products_df['has_model_number'] = products_df['product_name'].str.contains(r'\d+', regex=True)

# Extract specific patterns
def extract_model_info(product_name):
    """Extract model information from product names"""
    import re
    
    # Look for common model patterns
    patterns = {
        'iphone_model': r'iPhone\s*(\d+)',
        'galaxy_model': r'Galaxy\s*(\w+\d*)',
        'series_number': r'(\d+)\s*Pro|Pro\s*(\d+)',
        'version_number': r'\b(\d+\.\d+)\b',
        'generation': r'\b(\d+)(?:st|nd|rd|th)?\s*Gen\b'
    }
    
    results = {}
    for pattern_name, pattern in patterns.items():
        match = re.search(pattern, product_name, re.IGNORECASE)
        results[pattern_name] = match.group(1) if match else None
    
    return results

# Apply model extraction to a few examples
sample_products = [
    'iPhone 13 Pro Max 128GB',
    'Samsung Galaxy S21 Ultra 5G',
    'iPad Air 10.9 4th Gen',
    'MacBook Pro 16 M1 Pro'
]

print("Model extraction examples:")
for product in sample_products:
    model_info = extract_model_info(product)
    print(f"\n{product}:")
    for key, value in model_info.items():
        if value:
            print(f"  {key}: {value}")

# Apply to our dataset using simpler string methods
print("\n📱 Products with model numbers in our dataset:")
model_products = products_df[products_df['has_model_number']]
print(model_products[['product_name', 'has_model_number']])

## 5. Practice Exercises

### Exercise 1: E-commerce Category Classifier
**Task**: Create a pattern-based product classifier for multiple categories

In [None]:
# Exercise 1: YOUR CODE HERE
exercise_products = {
    'product_name': [
        'Nike Air Force 1 Sneakers White',
        'Adidas Ultraboost 22 Running Shoes',
        'The Alchemist by Paulo Coelho',
        'Harry Potter Complete Book Set',
        'Instant Pot Duo 7-in-1 Electric Pressure Cooker',
        'KitchenAid Stand Mixer Artisan',
        'Sony WH-1000XM4 Noise Cancelling Headphones',
        'Apple AirPods Pro 2nd Generation'
    ]
}

exercise_df = pd.DataFrame(exercise_products)

print("🎯 Exercise 1: Classify these products into categories")
print(exercise_df)

# TODO: Create pattern-based classification for:
# - Footwear (shoes, sneakers, boots)
# - Books (book, novel, set)
# - Kitchen (cooker, mixer, pot)
# - Electronics (headphones, airpods, speaker)

# Your solution here:
# def classify_product_category(product_name):
#     # Your classification logic
#     pass

# exercise_df['category'] = exercise_df['product_name'].apply(classify_product_category)

print("\n💡 Hint: Use .str.contains() with multiple keywords for each category")

### Exercise 2: Review Quality Filter
**Task**: Create filters to identify high-quality vs low-quality reviews

In [None]:
# Exercise 2: YOUR CODE HERE
sample_reviews = {
    'review_text': [
        'Great product! Excellent quality and fast shipping. Highly recommend!',
        'ok',
        'TERRIBLE PRODUCT!!! DO NOT BUY!!!',
        'The product arrived on time and matches the description perfectly. Good value for money.',
        'bad',
        'Outstanding customer service and product quality. Will definitely order again.'
    ],
    'review_score': [5, 3, 1, 4, 2, 5]
}

review_exercise_df = pd.DataFrame(sample_reviews)

print("🎯 Exercise 2: Identify review quality patterns")
print(review_exercise_df)

# TODO: Create quality indicators:
# 1. Review length (short reviews might be low quality)
# 2. Excessive caps (ALL CAPS might indicate emotion)
# 3. Detailed feedback (specific words like 'quality', 'shipping', 'service')
# 4. Balanced sentiment (both positive and constructive feedback)

# Your solution here:
# review_exercise_df['review_length'] = ...
# review_exercise_df['has_excessive_caps'] = ...
# review_exercise_df['has_detailed_feedback'] = ...
# review_exercise_df['quality_score'] = ...

print("\n💡 Hint: Combine string length, case analysis, and keyword detection")

## 6. Key Takeaways and Next Steps

### 🎯 What We Learned Today

1. **Pattern Detection Methods**: `.str.contains()`, `.str.startswith()`, `.str.endswith()`
2. **Multiple Pattern Matching**: Using `|` for OR conditions and `&` for AND conditions
3. **Business Classification**: Automated categorization based on text patterns
4. **Sentiment Analysis**: Pattern-based sentiment detection in reviews
5. **Market Adaptation**: Customizing patterns for Nigerian e-commerce context

### 🔄 Excel to Python Translation
- `FIND()` → `.str.find()` or `.str.contains()`
- `SEARCH()` → `.str.contains(case=False)`
- `IF(FIND())` → Boolean masks with `.str.contains()`
- Multiple conditions → Combine with `&` and `|` operators

### 📝 Best Practices
1. Use `case=False` for case-insensitive pattern matching
2. Combine multiple patterns with `|` for OR conditions
3. Use boolean indexing to filter DataFrames based on patterns
4. Create reusable functions for complex classification logic
5. Test patterns on sample data before applying to large datasets

### ⏭️ Coming Next
**Part 3**: Regular Expressions for Advanced Pattern Matching

In [None]:
# Summary visualization of pattern matching results
print("📊 Pattern Matching Summary:")
print(f"\nProducts analyzed: {len(products_df)}")
print(f"Product types identified: {products_df['product_type'].nunique()}")
print(f"Reviews analyzed: {len(reviews_df)}")
print(f"Sentiment prediction accuracy: {accuracy:.1%}")

print("\n🎉 Great job completing Part 2: Pattern Matching!")
print("\n📚 You're now ready for Part 3: Regular Expressions")
print("\n🔍 Next, we'll learn advanced pattern matching with:")
print("   • Regular expression syntax and special characters")
print("   • Complex pattern extraction and replacement")
print("   • Business applications of regex in data processing")