#### Part 21: Pivot Tables and Text Processing in Pandas

In this notebook, we'll explore:
- Advanced pivot table operations
- Working with text data in pandas
- String methods and pattern matching

##### Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np

##### 1. Advanced Pivot Table Operations

Let's create a sample DataFrame to demonstrate pivot table operations:

In [None]:
# Set a random seed for reproducibility
np.random.seed([3, 1415])
n = 20

# Create column names
cols = np.array(['key', 'row', 'item', 'col'])

# Create a DataFrame with random data
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))

df

### 1.1 Pivoting with Single Aggregations

Let's pivot the DataFrame so that 'col' values become columns, 'row' values become the index, and the mean of 'val0' are the values:

In [None]:
# Basic pivot table with mean aggregation (default)
df.pivot_table(values='val0', index='row', columns='col', aggfunc='mean')

We can replace missing values using the `fill_value` parameter:

In [None]:
# Pivot table with fill_value
df.pivot_table(values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)

We can use other aggregation functions as well, such as sum:

In [None]:
# Pivot table with sum aggregation
df.pivot_table(values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)

We can also calculate the frequency in which columns and rows occur together (cross tabulation) by using 'size' as the aggregation function:

In [None]:
# Cross tabulation using size
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

### 1.2 Pivoting with Multiple Aggregations

We can perform multiple aggregations by passing a list to the `aggfunc` parameter:

In [None]:
# Pivot table with multiple aggregations
df.pivot_table(values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])

##### 2. Working with Text Data

Pandas provides a wide range of string methods through the `.str` accessor. Let's explore some of these methods.

### 2.1 Testing for Strings that Match or Contain a Pattern

You can check whether elements contain a pattern using `str.contains()`:

In [None]:
# Define a pattern
pattern = r'[0-9][a-z]'

# Check if elements contain the pattern
pd.Series(['1', '2', '3a', '3b', '03c'], dtype="string").str.contains(pattern)

Or whether elements match a pattern using `str.match()`:

In [None]:
# Check if elements match the pattern
pd.Series(['1', '2', '3a', '3b', '03c'], dtype="string").str.match(pattern)

The distinction between `match` and `contains` is strictness: 
- `match` relies on strict `re.match` (matches from the beginning of the string)
- `contains` relies on `re.search` (matches anywhere in the string)

Methods like `match`, `contains`, `startswith`, and `endswith` take an extra `na` argument so missing values can be considered True or False:

In [None]:
# Create a Series with NaN
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], dtype="string")

# Check if elements contain 'A', treating NaN as False
s4.str.contains('A', na=False)

### 2.2 Creating Indicator Variables

You can extract dummy variables from string columns. For example, if values are separated by a '|':

In [None]:
# Create a Series with pipe-separated values
s = pd.Series(['a', 'a|b', np.nan, 'a|c'], dtype="string")

# Get dummy variables
s.str.get_dummies(sep='|')

String Index also supports `get_dummies` which returns a MultiIndex:

In [None]:
# Create an Index with pipe-separated values
idx = pd.Index(['a', 'a|b', np.nan, 'a|c'])

# Get dummy variables from Index
idx.str.get_dummies(sep='|')

### 2.3 Extracting Patterns from Strings

Let's demonstrate how to extract patterns from strings using regular expressions:

In [None]:
# Define a pattern with two capture groups
two_groups = r'([a-z])([0-9])'

# Extract all occurrences of the pattern
pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)

### 2.4 String Method Summary

Here's a summary of some common string methods available in pandas:

- `cat()`: Concatenate strings
- `split()`: Split strings on delimiter
- `rsplit()`: Split strings on delimiter working from the end of the string
- `get()`: Index into each element (retrieve i-th element)
- `join()`: Join strings in each element of the Series with passed separator
- `get_dummies()`: Split strings on the delimiter returning DataFrame of dummy variables
- `contains()`: Return boolean array if each string contains pattern/regex
- `replace()`: Replace occurrences of pattern/regex/string with some other string
- `repeat()`: Duplicate values (s.str.repeat(3) equivalent to x * 3)
- `pad()`: Add whitespace to left, right, or both sides of strings
- `center()`: Equivalent to str.center
- `ljust()`: Equivalent to str.ljust
- `rjust()`: Equivalent to str.rjust
- `zfill()`: Equivalent to str.zfill
- `wrap()`: Split long strings into lines with length less than a given width
- `slice()`: Slice each string in the Series
- `slice_replace()`: Replace slice in each string with passed value
- `count()`: Count occurrences of pattern
- `startswith()`: Equivalent to str.startswith(pat) for each element
- `endswith()`: Equivalent to str.endswith(pat) for each element

Let's demonstrate a few of these methods:

In [None]:
# Create a sample Series
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', 'CABA', 'dog', 'cat'], dtype="string")

# Lowercase
print("Lowercase:")
print(s.str.lower())

# Uppercase
print("\nUppercase:")
print(s.str.upper())

# Length of each string
print("\nLength:")
print(s.str.len())

# Count occurrences of 'a'
print("\nCount 'a':")
print(s.str.count('a'))

##### Summary

In this notebook, we've explored:

1. Advanced pivot table operations in pandas, including:
   - Single and multiple aggregations
   - Handling missing values
   - Cross tabulation

2. Working with text data in pandas, including:
   - Pattern matching with `contains()` and `match()`
   - Creating indicator variables with `get_dummies()`
   - Extracting patterns from strings
   - Various string manipulation methods

These techniques are essential for data preprocessing, transformation, and analysis in pandas.