#### Part 21: Pivot Tables and Text Processing in Pandas

In this notebook, we'll explore:
- Advanced pivot table operations
- Working with text data in pandas
- String methods and pattern matching

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np

##### 1. Advanced Pivot Table Operations

Let's create a sample DataFrame to demonstrate pivot table operations:

In [2]:
# Set a random seed for reproducibility
np.random.seed([3, 1415])
n = 20

# Create column names
cols = np.array(['key', 'row', 'item', 'col'])

# Create a DataFrame with random data
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))

df

Unnamed: 0,key,row,item,col,val0,val1
0,key0,row3,item1,col3,0.81,0.04
1,key1,row2,item1,col2,0.44,0.07
2,key1,row0,item1,col0,0.77,0.01
3,key0,row4,item0,col2,0.15,0.59
4,key1,row0,item2,col1,0.81,0.64
5,key1,row2,item2,col4,0.13,0.88
6,key2,row4,item1,col3,0.88,0.39
7,key1,row4,item1,col1,0.1,0.07
8,key1,row0,item2,col4,0.65,0.02
9,key1,row2,item0,col2,0.35,0.61


### 1.1 Pivoting with Single Aggregations

Let's pivot the DataFrame so that 'col' values become columns, 'row' values become the index, and the mean of 'val0' are the values:

In [3]:
# Basic pivot table with mean aggregation (default)
df.pivot_table(values='val0', index='row', columns='col', aggfunc='mean')

col,col0,col1,col2,col3,col4
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
row0,0.77,0.605,,0.86,0.65
row2,0.13,,0.395,0.5,0.25
row3,,0.31,,0.545,
row4,,0.1,0.395,0.76,0.24


We can replace missing values using the `fill_value` parameter:

In [4]:
# Pivot table with fill_value
df.pivot_table(values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)

col,col0,col1,col2,col3,col4
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
row0,0.77,0.605,0.0,0.86,0.65
row2,0.13,0.0,0.395,0.5,0.25
row3,0.0,0.31,0.0,0.545,0.0
row4,0.0,0.1,0.395,0.76,0.24


We can use other aggregation functions as well, such as sum:

In [5]:
# Pivot table with sum aggregation
df.pivot_table(values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)

col,col0,col1,col2,col3,col4
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
row0,0.77,1.21,0.0,0.86,0.65
row2,0.13,0.0,0.79,0.5,0.5
row3,0.0,0.31,0.0,1.09,0.0
row4,0.0,0.1,0.79,1.52,0.24


We can also calculate the frequency in which columns and rows occur together (cross tabulation) by using 'size' as the aggregation function:

In [6]:
# Cross tabulation using size
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

col,col0,col1,col2,col3,col4
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
row0,1,2,0,1,1
row2,1,0,2,1,2
row3,0,1,0,2,0
row4,0,1,2,2,1


### 1.2 Pivoting with Multiple Aggregations

We can perform multiple aggregations by passing a list to the `aggfunc` parameter:

In [7]:
# Pivot table with multiple aggregations
df.pivot_table(values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])

Unnamed: 0_level_0,mean,mean,mean,mean,mean,sum,sum,sum,sum,sum
col,col0,col1,col2,col3,col4,col0,col1,col2,col3,col4
row,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
row0,0.77,0.605,,0.86,0.65,0.77,1.21,,0.86,0.65
row2,0.13,,0.395,0.5,0.25,0.13,,0.79,0.5,0.5
row3,,0.31,,0.545,,,0.31,,1.09,
row4,,0.1,0.395,0.76,0.24,,0.1,0.79,1.52,0.24


##### 2. Working with Text Data

Pandas provides a wide range of string methods through the `.str` accessor. Let's explore some of these methods.

### 2.1 Testing for Strings that Match or Contain a Pattern

You can check whether elements contain a pattern using `str.contains()`:

In [8]:
# Define a pattern
pattern = r'[0-9][a-z]'

# Check if elements contain the pattern
pd.Series(['1', '2', '3a', '3b', '03c'], dtype="string").str.contains(pattern)

0    False
1    False
2     True
3     True
4     True
dtype: boolean

Or whether elements match a pattern using `str.match()`:

In [9]:
# Check if elements match the pattern
pd.Series(['1', '2', '3a', '3b', '03c'], dtype="string").str.match(pattern)

0    False
1    False
2     True
3     True
4    False
dtype: boolean

The distinction between `match` and `contains` is strictness: 
- `match` relies on strict `re.match` (matches from the beginning of the string)
- `contains` relies on `re.search` (matches anywhere in the string)

Methods like `match`, `contains`, `startswith`, and `endswith` take an extra `na` argument so missing values can be considered True or False:

In [10]:
# Create a Series with NaN
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], dtype="string")

# Check if elements contain 'A', treating NaN as False
s4.str.contains('A', na=False)

0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

### 2.2 Creating Indicator Variables

You can extract dummy variables from string columns. For example, if values are separated by a '|':

In [11]:
# Create a Series with pipe-separated values
s = pd.Series(['a', 'a|b', np.nan, 'a|c'], dtype="string")

# Get dummy variables
s.str.get_dummies(sep='|')

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


String Index also supports `get_dummies` which returns a MultiIndex:

In [12]:
# Create an Index with pipe-separated values
idx = pd.Index(['a', 'a|b', np.nan, 'a|c'])

# Get dummy variables from Index
idx.str.get_dummies(sep='|')

MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

### 2.3 Extracting Patterns from Strings

Let's demonstrate how to extract patterns from strings using regular expressions:

In [13]:
# Define a pattern with two capture groups
two_groups = r'([a-z])([0-9])'

# Extract all occurrences of the pattern
pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,1
0,1,a,2
1,0,b,1
2,0,c,1


### 2.4 String Method Summary

Here's a summary of some common string methods available in pandas:

- `cat()`: Concatenate strings
- `split()`: Split strings on delimiter
- `rsplit()`: Split strings on delimiter working from the end of the string
- `get()`: Index into each element (retrieve i-th element)
- `join()`: Join strings in each element of the Series with passed separator
- `get_dummies()`: Split strings on the delimiter returning DataFrame of dummy variables
- `contains()`: Return boolean array if each string contains pattern/regex
- `replace()`: Replace occurrences of pattern/regex/string with some other string
- `repeat()`: Duplicate values (s.str.repeat(3) equivalent to x * 3)
- `pad()`: Add whitespace to left, right, or both sides of strings
- `center()`: Equivalent to str.center
- `ljust()`: Equivalent to str.ljust
- `rjust()`: Equivalent to str.rjust
- `zfill()`: Equivalent to str.zfill
- `wrap()`: Split long strings into lines with length less than a given width
- `slice()`: Slice each string in the Series
- `slice_replace()`: Replace slice in each string with passed value
- `count()`: Count occurrences of pattern
- `startswith()`: Equivalent to str.startswith(pat) for each element
- `endswith()`: Equivalent to str.endswith(pat) for each element

Let's demonstrate a few of these methods:

In [14]:
# Create a sample Series
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', 'CABA', 'dog', 'cat'], dtype="string")

# Lowercase
print("Lowercase:")
print(s.str.lower())

# Uppercase
print("\nUppercase:")
print(s.str.upper())

# Length of each string
print("\nLength:")
print(s.str.len())

# Count occurrences of 'a'
print("\nCount 'a':")
print(s.str.count('a'))

Lowercase:
0       a
1       b
2       c
3    aaba
4    baca
5    caba
6     dog
7     cat
dtype: string

Uppercase:
0       A
1       B
2       C
3    AABA
4    BACA
5    CABA
6     DOG
7     CAT
dtype: string

Length:
0    1
1    1
2    1
3    4
4    4
5    4
6    3
7    3
dtype: Int64

Count 'a':
0    0
1    0
2    0
3    2
4    2
5    0
6    0
7    1
dtype: Int64


##### Summary

In this notebook, we've explored:

1. Advanced pivot table operations in pandas, including:
   - Single and multiple aggregations
   - Handling missing values
   - Cross tabulation

2. Working with text data in pandas, including:
   - Pattern matching with `contains()` and `match()`
   - Creating indicator variables with `get_dummies()`
   - Extracting patterns from strings
   - Various string manipulation methods

These techniques are essential for data preprocessing, transformation, and analysis in pandas.