# Polars String Operations - Comprehensive Guide

Master string manipulation in Polars using the `.str` namespace.

## Topics:
- Basic string operations
- Case conversion
- String searching and pattern matching
- String extraction and replacement
- String splitting and joining
- Regex operations
- String cleaning and normalization

In [None]:
import polars as pl

In [None]:
# Sample data
df = pl.DataFrame({
    'name': ['Alice Smith', 'bob jones', 'CHARLIE BROWN', 'Diana Lee', 'eve DAVIS'],
    'email': ['alice@email.com', 'bob.jones@company.org', 'charlie@MAIL.COM', 'diana_lee@test.net', 'eve123@domain.co.uk'],
    'phone': ['(123) 456-7890', '123-456-7890', '1234567890', '+1-123-456-7890', '123.456.7890'],
    'address': ['  123 Main St  ', '456 Oak Ave', '789 Pine Rd', '  321 Elm St', '654 Maple Dr  ']
})

print(df)

## Part 1: Basic String Operations

### Length and character count

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.len_chars().alias('length'),
    pl.col('name').str.len_bytes().alias('bytes')
])

print(result)

### Case conversion

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.to_lowercase().alias('lowercase'),
    pl.col('name').str.to_uppercase().alias('uppercase'),
    pl.col('name').str.to_titlecase().alias('titlecase')
])

print(result)

### Trimming whitespace

In [None]:
result = df.select([
    pl.col('address'),
    pl.col('address').str.strip_chars().alias('trimmed'),
    pl.col('address').str.strip_chars_start().alias('left_trim'),
    pl.col('address').str.strip_chars_end().alias('right_trim')
])

print(result)

## Part 2: String Searching

### Contains, starts_with, ends_with

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.contains('li').alias('has_li'),
    pl.col('name').str.starts_with('A').alias('starts_A'),
    pl.col('name').str.ends_with('n').alias('ends_n'),
    pl.col('email').str.contains('@').alias('has_at')
])

print(result)

### Case-insensitive search

In [None]:
# Convert to lowercase first for case-insensitive search
result = df.select([
    pl.col('name'),
    pl.col('name').str.to_lowercase().str.contains('smith').alias('has_smith_ci')
])

print(result)

## Part 3: String Extraction

### Slicing strings

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.slice(0, 5).alias('first_5'),
    pl.col('name').str.slice(-5).alias('last_5'),
    pl.col('name').str.head(3).alias('head_3'),
    pl.col('name').str.tail(3).alias('tail_3')
])

print(result)

### Regex extraction

In [None]:
# Extract domain from email
result = df.select([
    pl.col('email'),
    pl.col('email').str.extract(r'@(.+)', 1).alias('domain'),
    pl.col('email').str.extract(r'(.+)@', 1).alias('username')
])

print(result)

### Extract all matches

In [None]:
# Extract all numbers from phone
result = df.select([
    pl.col('phone'),
    pl.col('phone').str.extract_all(r'\d').alias('all_digits')
])

print(result)

## Part 4: String Replacement

### Simple replacement

In [None]:
result = df.select([
    pl.col('phone'),
    pl.col('phone').str.replace_all('-', '').alias('no_dash'),
    pl.col('phone').str.replace_all(r'[^0-9]', '').alias('digits_only')
])

print(result)

### Multiple replacements

In [None]:
# Clean phone numbers
result = df.select([
    pl.col('phone'),
    pl.col('phone')
      .str.replace_all(r'[()\s-.]', '')
      .str.replace(r'^\+1', '')
      .alias('cleaned_phone')
])

print(result)

## Part 5: Splitting and Joining

### Split into list

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.split(' ').alias('name_parts'),
    pl.col('email').str.split('@').alias('email_parts')
])

print(result)

### Extract from split

In [None]:
# Extract first and last name
result = df.select([
    pl.col('name'),
    pl.col('name').str.split(' ').list.get(0).alias('first_name'),
    pl.col('name').str.split(' ').list.get(-1).alias('last_name')
])

print(result)

### Concatenation

In [None]:
# Combine columns
result = df.select([
    pl.col('name'),
    pl.col('email'),
    (pl.col('name') + ' <' + pl.col('email') + '>').alias('name_with_email'),
    pl.concat_str([pl.col('name'), pl.col('email')], separator=' | ').alias('combined')
])

print(result)

## Part 6: Padding and Formatting

In [None]:
df_pad = pl.DataFrame({
    'id': ['1', '42', '123'],
    'value': ['5', '100', '9999']
})

result = df_pad.select([
    pl.col('id'),
    pl.col('id').str.pad_start(5, '0').alias('id_padded'),
    pl.col('value').str.pad_end(6, ' ').alias('value_padded')
])

print(result)

## Part 7: Real-World Examples

### Example 1: Email validation and parsing

In [None]:
email_analysis = df.select([
    pl.col('email'),
    pl.col('email').str.contains(r'^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$').alias('valid_format'),
    pl.col('email').str.extract(r'(.+)@', 1).alias('username'),
    pl.col('email').str.extract(r'@(.+)', 1).alias('domain'),
    pl.col('email').str.extract(r'\.([a-zA-Z]{2,})$', 1).alias('tld')
])

print("Email parsing:")
print(email_analysis)

### Example 2: Phone number standardization

In [None]:
phone_clean = df.select([
    pl.col('phone'),
    # Extract only digits
    pl.col('phone').str.replace_all(r'[^0-9]', '').alias('digits'),
    # Standardize format
    pl.col('phone')
      .str.replace_all(r'[^0-9]', '')
      .str.replace(r'^1?', '')
      .str.replace(r'^(\d{3})(\d{3})(\d{4})$', '($1) $2-$3')
      .alias('formatted')
])

print("Phone standardization:")
print(phone_clean)

### Example 3: Text cleaning pipeline

In [None]:
text_df = pl.DataFrame({
    'text': [
        '  Hello, World!  ',
        'URGENT: Buy NOW!!!',
        'Contact: john.doe@email.com',
        'Price: $19.99'
    ]
})

cleaned = text_df.select([
    pl.col('text'),
    pl.col('text')
      .str.strip_chars()  # Remove whitespace
      .str.to_lowercase()  # Lowercase
      .str.replace_all(r'[^a-z0-9\s@.]', '')  # Remove special chars
      .str.replace_all(r'\s+', ' ')  # Normalize spaces
      .alias('cleaned')
])

print("Text cleaning:")
print(cleaned)

### Example 4: URL parsing

In [None]:
url_df = pl.DataFrame({
    'url': [
        'https://www.example.com/path/page.html',
        'http://api.test.org/v1/users',
        'https://subdomain.site.co.uk/products?id=123'
    ]
})

url_parsed = url_df.select([
    pl.col('url'),
    pl.col('url').str.extract(r'(https?)://', 1).alias('protocol'),
    pl.col('url').str.extract(r'://([^/]+)', 1).alias('domain'),
    pl.col('url').str.extract(r'://[^/]+(.+?)(?:\?|$)', 1).alias('path')
])

print("URL parsing:")
print(url_parsed)

## Summary

### Key Operations:
- **Case**: to_lowercase, to_uppercase, to_titlecase
- **Trim**: strip_chars, strip_chars_start, strip_chars_end
- **Search**: contains, starts_with, ends_with
- **Extract**: slice, extract, extract_all
- **Replace**: replace, replace_all
- **Split**: split (returns list)
- **Combine**: + operator, concat_str
- **Regex**: All operations support regex patterns

### Best Practices:
- Chain operations for complex transformations
- Use regex for flexible pattern matching
- Normalize before comparison (lowercase, trim)
- Extract structured data with regex groups