# Week 6: Python Text Processing & Pattern Matching
## Part 1: String Methods Fundamentals
### Wednesday, September 17, 2025

**Business Context**: Cleaning and standardizing e-commerce data for Nigerian marketplace analysis  
**Excel Bridge**: Moving from TEXT, UPPER, LOWER, TRIM functions to Python string methods

## Learning Objectives

By the end of this notebook, you will be able to:
1. Use basic Python string methods for data cleaning
2. Apply pandas `.str` accessor for vectorized string operations
3. Clean and standardize text data in business datasets
4. Handle missing values during string operations
5. Create standardized text formats for analysis

## Setup and Data Import

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("✅ Libraries imported successfully!")

In [None]:
# Load sample data (simulating Olist dataset structure)
# In real scenario, this would be loaded from the datasets folder

# Sample customer data with text cleaning challenges
customers_data = {
    'customer_id': ['CUST001', 'CUST002', 'CUST003', 'CUST004', 'CUST005', 'CUST006'],
    'customer_city': ['  sao paulo  ', 'RIO DE JANEIRO', 'Belo Horizonte', ' brasilia', 'CURITIBA ', 'sao paulo'],
    'customer_state': ['SP', 'rj', 'MG', 'df', 'PR', 'SP'],
    'customer_zip': [1000, 2000, 3000, 7000, 8000, 1100]
}

customers_df = pd.DataFrame(customers_data)

# Sample product categories with text standardization needs
categories_data = {
    'product_category_name': ['cama_mesa_banho', 'esporte_lazer', 'moveis_decoracao', 'beleza_saude', 'utilidades_domesticas'],
    'product_category_name_english': ['bed_bath_table', 'sports_leisure', 'furniture_decor', 'health_beauty', 'housewares'],
    'product_count': [3029, 2867, 2657, 2444, 2335]
}

categories_df = pd.DataFrame(categories_data)

print("📊 Sample datasets created!")
print(f"Customers dataset shape: {customers_df.shape}")
print(f"Categories dataset shape: {categories_df.shape}")

## 1. Basic String Methods

### Excel Connection: TEXT Functions
- Excel `UPPER()` → Python `.upper()`
- Excel `LOWER()` → Python `.lower()`
- Excel `PROPER()` → Python `.title()`
- Excel `TRIM()` → Python `.strip()`

### 1.1 Case Standardization

In [None]:
# Display current customer data to see inconsistencies
print("🔍 Current customer data with inconsistent casing:")
print(customers_df[['customer_city', 'customer_state']].head())
print("\n" + "="*50)

# Demonstrate single string operations first
sample_city = '  sao paulo  '
print(f"\n📝 Single string operations on: '{sample_city}'")
print(f"Original: '{sample_city}'")
print(f"Upper: '{sample_city.upper()}'")
print(f"Lower: '{sample_city.lower()}'")
print(f"Title: '{sample_city.title()}'")
print(f"Stripped: '{sample_city.strip()}'")
print(f"Combined (strip + title): '{sample_city.strip().title()}'")

In [None]:
# Apply string methods to pandas DataFrame columns using .str accessor
print("🔧 Applying case standardization to entire columns:")

# Create a copy to show before/after comparison
customers_clean = customers_df.copy()

# Standardize city names to title case (proper case)
customers_clean['city_cleaned'] = customers_clean['customer_city'].str.strip().str.title()

# Standardize state names to uppercase
customers_clean['state_cleaned'] = customers_clean['customer_state'].str.upper()

# Display before and after
comparison_df = customers_clean[['customer_city', 'city_cleaned', 'customer_state', 'state_cleaned']]
print(comparison_df)

### 1.2 Text Length Analysis
**Excel Connection**: `LEN()` function → Python `len()` and `.str.len()`

In [None]:
# Analyze text lengths for quality assessment
print("📏 Text length analysis:")

# Add length columns
customers_clean['city_original_length'] = customers_clean['customer_city'].str.len()
customers_clean['city_cleaned_length'] = customers_clean['city_cleaned'].str.len()

# Show the comparison
length_comparison = customers_clean[['customer_city', 'city_cleaned', 'city_original_length', 'city_cleaned_length']]
print(length_comparison)

print("\n📊 Length statistics:")
print(f"Average original length: {customers_clean['city_original_length'].mean():.1f}")
print(f"Average cleaned length: {customers_clean['city_cleaned_length'].mean():.1f}")
print(f"Characters saved by cleaning: {(customers_clean['city_original_length'] - customers_clean['city_cleaned_length']).sum()}")

## 2. String Modification Methods

### Excel Connection: SUBSTITUTE and Text Manipulation
- Excel `SUBSTITUTE()` → Python `.replace()`
- Excel `MID()` → Python string slicing `[start:end]`
- Excel `CONCATENATE()` → Python `.join()` and f-strings

### 2.1 Text Replacement

In [None]:
# Work with product categories that need standardization
print("🏷️ Product category standardization:")
print("Original categories:")
print(categories_df['product_category_name'].tolist())

# Replace underscores with spaces for better readability
categories_df['category_readable'] = categories_df['product_category_name'].str.replace('_', ' ')

# Create business-friendly names
categories_df['category_display'] = categories_df['category_readable'].str.title()

print("\n✨ After standardization:")
display_comparison = categories_df[['product_category_name', 'category_readable', 'category_display']]
print(display_comparison)

In [None]:
# Multiple replacements and advanced cleaning
print("🔧 Advanced text cleaning with multiple replacements:")

# Create some messy review data for demonstration
messy_reviews = pd.Series([
    'PRODUTO EXCELENTE!!!',
    'muito  bom produto...',
    'NÃO GOSTEI NADA!!!',
    'produto   ok, mas   pode melhorar.',
    'RECOMENDO!! produto TOP!!!'
])

print("Original reviews:")
for i, review in enumerate(messy_reviews):
    print(f"{i+1}. '{review}'")

# Clean the reviews
cleaned_reviews = (
    messy_reviews
    .str.replace('!!', '!')  # Reduce multiple exclamation marks
    .str.replace('...', '.')  # Reduce multiple periods
    .str.replace('  ', ' ')  # Remove double spaces
    .str.strip()  # Remove leading/trailing spaces
    .str.title()  # Standardize case
)

print("\n✨ Cleaned reviews:")
for i, review in enumerate(cleaned_reviews):
    print(f"{i+1}. '{review}'")

### 2.2 String Slicing and Extraction
**Excel Connection**: `LEFT()`, `RIGHT()`, `MID()` functions

In [None]:
# Extract parts of strings for analysis
print("✂️ String slicing and extraction:")

# Work with customer cities to extract regional information
cities_sample = ['sao paulo', 'rio de janeiro', 'belo horizonte', 'porto alegre']
cities_df = pd.DataFrame({'full_city_name': cities_sample})

# Extract first word (often the main city name)
cities_df['first_word'] = cities_df['full_city_name'].str.split(' ').str[0]

# Extract first 3 characters (city code)
cities_df['city_code'] = cities_df['full_city_name'].str[:3].str.upper()

# Get city name length category
cities_df['name_length'] = cities_df['full_city_name'].str.len()
cities_df['length_category'] = cities_df['name_length'].apply(
    lambda x: 'Short' if x <= 8 else 'Medium' if x <= 12 else 'Long'
)

print(cities_df)

### 2.3 Text Concatenation and Formatting
**Excel Connection**: `CONCATENATE()` and `&` operator

In [None]:
# Create standardized business formats
print("🔗 Text concatenation and formatting:")

# Create standardized customer location format
customers_clean['formatted_location'] = (
    customers_clean['city_cleaned'] + ', ' + customers_clean['state_cleaned']
)

# Create shipping labels using f-strings (modern Python approach)
customers_clean['shipping_label'] = customers_clean.apply(
    lambda row: f"{row['city_cleaned'].upper()} - {row['customer_zip']}", axis=1
)

# Using .str.cat() method for concatenation
customers_clean['location_with_zip'] = customers_clean['city_cleaned'].str.cat(
    [customers_clean['state_cleaned'], customers_clean['customer_zip'].astype(str)], 
    sep=' | '
)

print("🏷️ Formatted location examples:")
location_formats = customers_clean[['customer_id', 'formatted_location', 'shipping_label', 'location_with_zip']].head(3)
print(location_formats)

## 3. Business Application: Nigerian E-commerce Adaptation

In [None]:
# Adapt Brazilian patterns for Nigerian market
print("🇳🇬 Nigerian e-commerce adaptation:")

# Create Nigerian city data
nigerian_cities = {
    'raw_city_name': ['  LAGOS  ', 'abuja', 'Port Harcourt', 'KANO', ' ibadan ', 'Benin City'],
    'raw_state': ['lagos', 'FCT', 'rivers', 'kano', 'oyo', 'edo'],
    'customer_id': ['NG001', 'NG002', 'NG003', 'NG004', 'NG005', 'NG006']
}

nigerian_df = pd.DataFrame(nigerian_cities)

# Standardize Nigerian city and state names
nigerian_df['city_standardized'] = nigerian_df['raw_city_name'].str.strip().str.title()
nigerian_df['state_standardized'] = nigerian_df['raw_state'].str.upper()

# Create full Nigerian address format
nigerian_df['full_address'] = (
    nigerian_df['city_standardized'] + ' City, ' + nigerian_df['state_standardized'] + ' State, Nigeria'
)

print("✨ Standardized Nigerian locations:")
print(nigerian_df[['raw_city_name', 'city_standardized', 'state_standardized', 'full_address']])

## 4. Data Quality Assessment with String Operations

In [None]:
# Assess data quality issues in text fields
print("🔍 Data quality assessment:")

# Create a dataset with various quality issues
quality_test_data = {
    'customer_id': ['Q001', 'Q002', 'Q003', 'Q004', 'Q005', 'Q006'],
    'city_name': ['São Paulo', '', '   ', 'UNKNOWN', 'N/A', 'Test City'],
    'state_code': ['SP', 'XX', '', '??', 'RJ', 'MG']
}

quality_df = pd.DataFrame(quality_test_data)

print("Original data with quality issues:")
print(quality_df)

# Identify different types of quality issues
quality_issues = {
    'empty_strings': (quality_df['city_name'] == '').sum(),
    'whitespace_only': (quality_df['city_name'].str.strip() == '').sum(),
    'unknown_values': quality_df['city_name'].str.contains('UNKNOWN|N/A', case=False, na=False).sum(),
    'suspicious_states': quality_df['state_code'].str.contains('XX|\?\?', na=False).sum(),
    'missing_values': quality_df['city_name'].isna().sum()
}

print("\n📊 Data quality summary:")
for issue, count in quality_issues.items():
    print(f"{issue.replace('_', ' ').title()}: {count}")

## 5. Practice Exercises

### Exercise 1: Product Category Standardization
**Task**: Clean and standardize product category names for Nigerian market

In [None]:
# Exercise 1: YOUR CODE HERE
# Create a dataset with messy product categories
messy_products = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'category_raw': ['fashion_bags_and_accessories', 'ELECTRONICS_AND_COMPUTERS', '  home_and_garden  ', 'sports&leisure', 'health_beauty']
}

products_exercise = pd.DataFrame(messy_products)

print("🎯 Exercise 1: Clean these product categories")
print(products_exercise)

# TODO: 
# 1. Replace underscores and special characters with spaces
# 2. Standardize case to title case
# 3. Remove extra whitespace
# 4. Create a display-friendly format

# Your solution here:
# products_exercise['category_clean'] = ...

print("\n💡 Hint: Use .str.replace(), .str.strip(), and .str.title()")

### Exercise 2: Customer Data Standardization
**Task**: Create a standardized customer contact format

In [None]:
# Exercise 2: YOUR CODE HERE
customer_contacts = {
    'customer_id': ['C001', 'C002', 'C003'],
    'first_name': ['  john  ', 'MARY', 'peter'],
    'last_name': ['DOE', '  smith  ', 'JONES'],
    'email': ['John.Doe@Email.Com', 'MARY.SMITH@GMAIL.COM', '  peter.jones@yahoo.com  ']
}

contacts_exercise = pd.DataFrame(customer_contacts)

print("🎯 Exercise 2: Standardize customer contact information")
print(contacts_exercise)

# TODO:
# 1. Clean and standardize names (proper case)
# 2. Create full name field
# 3. Standardize email format (lowercase)
# 4. Create a contact label format: "Full Name <email>"

# Your solution here:

print("\n💡 Hint: Combine string methods and concatenation techniques")

## 6. Key Takeaways and Next Steps

### 🎯 What We Learned Today

1. **Basic String Methods**: `.upper()`, `.lower()`, `.title()`, `.strip()`
2. **Pandas String Operations**: Using `.str` accessor for DataFrame columns
3. **Text Modification**: `.replace()`, slicing, concatenation
4. **Business Applications**: Standardizing real-world e-commerce data
5. **Data Quality**: Identifying and handling text data issues

### 🔄 Excel to Python Translation
- `UPPER()` → `.str.upper()`
- `LOWER()` → `.str.lower()`
- `PROPER()` → `.str.title()`
- `TRIM()` → `.str.strip()`
- `SUBSTITUTE()` → `.str.replace()`
- `CONCATENATE()` → `.str.cat()` or `+` operator

### 📝 Best Practices
1. Always use `.str` accessor for pandas Series string operations
2. Chain string methods for complex cleaning operations
3. Handle missing values before string operations
4. Test string operations on small samples first
5. Create reusable functions for common cleaning tasks

### ⏭️ Coming Next
**Part 2**: Pattern Matching with String Methods and Introduction to Regular Expressions

In [None]:
# Summary: Compare your solutions with expected outputs
print("🎉 Great job completing Part 1: String Methods Fundamentals!")
print("\n📚 You're now ready for Part 2: Pattern Matching")
print("\n🔍 Next, we'll learn how to find patterns in text data using:")
print("   • String methods like .contains() and .startswith()")
print("   • Introduction to regular expressions")
print("   • Advanced pattern matching for business categorization")