#### Part 10: Advanced Data Operations in Pandas

In this notebook, we'll explore advanced data operations in pandas, including:
- Hierarchical indexing (MultiIndex)
- Advanced grouping operations
- Complex data transformations
- Performance optimization techniques

##### Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np

# Set display options for better visualization
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.precision', 3)

##### 1. Hierarchical Indexing (MultiIndex)

Hierarchical indexing is a powerful feature that allows you to have multiple levels of indexing on both rows and columns.

In [None]:
# Create a MultiIndex DataFrame
arrays = [
    ['A', 'A', 'B', 'B'],
    [1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('letter', 'number'))
df = pd.DataFrame({'value': [100, 200, 300, 400]}, index=index)
df

### 1.1 Accessing Data in MultiIndex

In [None]:
# Different ways to access data
print("Accessing level 'A':")
print(df.loc['A'])

print("\nAccessing specific value:")
print(df.loc[('A', 1)])

##### 2. Advanced Grouping Operations

In [None]:
# Create sample data
df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

# Group by multiple columns
grouped = df.groupby(['A', 'B']).sum()
grouped

### 2.1 Advanced Aggregation

In [None]:
# Multiple aggregation functions
agg_funcs = {
    'C': ['sum', 'mean', 'std'],
    'D': ['min', 'max']
}
df.groupby('A').agg(agg_funcs)

##### 3. Complex Data Transformations

In [None]:
# Create sample time series data
dates = pd.date_range('2023-01-01', periods=6)
df = pd.DataFrame({
    'date': dates,
    'value': [100, 102, 98, 97, 103, 105]
})

# Calculate rolling mean
df['rolling_mean'] = df['value'].rolling(window=3).mean()

# Calculate percent change
df['pct_change'] = df['value'].pct_change()

df

##### 4. Performance Optimization

Let's look at some techniques to optimize pandas operations:

In [None]:
# Create a large DataFrame
n = 1000000
df = pd.DataFrame({
    'A': np.random.randn(n),
    'B': np.random.randn(n),
    'C': np.random.randn(n)
})

# Time different operations
import time

# Using iterrows (slow)
start = time.time()
result1 = []
for idx, row in df.iterrows():
    result1.append(row['A'] + row['B'])
print(f"iterrows time: {time.time() - start:.2f} seconds")

# Using vectorization (fast)
start = time.time()
result2 = df['A'] + df['B']
print(f"vectorized time: {time.time() - start:.2f} seconds")

### 4.1 Memory Optimization Tips

1. Use appropriate data types (e.g., categories for string columns with few unique values)
2. Use chunking for large files
3. Remove unnecessary columns
4. Use inplace operations when possible

In [None]:
# Example of memory optimization
df = pd.DataFrame({
    'id': range(1000000),
    'category': np.random.choice(['A', 'B', 'C'], 1000000)
})

print(f"Memory usage before: {df.memory_usage().sum() / 1024**2:.2f} MB")

# Convert to category
df['category'] = df['category'].astype('category')

print(f"Memory usage after: {df.memory_usage().sum() / 1024**2:.2f} MB")