# Pandas Integration

This notebook demonstrates full pandas API compatibility:
- Using pandas DataFrame operations
- Filtering and selection
- Grouping and aggregation
- All operations are tracked automatically

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import pandalchemy as pa

## Setup

In [2]:
engine = create_engine('sqlite:///:memory:')
db = pa.DataBase(engine)

# Create sample sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=50, freq='D').astype(str),
    'product': np.random.choice(['Widget', 'Gadget', 'Doohickey'], 50),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 50),
    'quantity': np.random.randint(1, 20, 50),
    'unit_price': np.random.choice([9.99, 19.99, 29.99], 50)
})
sales_data.index = range(1, 51)

sales = db.create_table('sales', sales_data, primary_key='id')

print("Sample data (first 5 rows):")
sales._data.head()

Sample data (first 5 rows):


Unnamed: 0_level_0,date,product,region,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2024-01-01,Doohickey,South,13,9.99
2,2024-01-02,Widget,North,18,29.99
3,2024-01-03,Doohickey,South,15,29.99
4,2024-01-04,Doohickey,North,13,29.99
5,2024-01-05,Widget,South,9,9.99


## Column Operations

### Add calculated column

In [3]:
# Add total column using proper schema change
sales.add_column_with_default('total', 0.0)
sales.push()
sales.pull()

# Calculate total values
sales._data['total'] = sales._data['quantity'] * sales._data['unit_price']
sales.push()

print("✓ Added total = quantity × unit_price")
sales._data[['quantity', 'unit_price', 'total']].head()

✓ Added total = quantity × unit_price


Unnamed: 0_level_0,quantity,unit_price,total
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,13,9.99,129.87
2,18,29.99,539.82
3,15,29.99,449.85
4,13,29.99,389.87
5,9,9.99,89.91


### Apply function to categorize

In [4]:
# Add price_tier column
sales.add_column_with_default('price_tier', 'Medium')
sales.push()
sales.pull()

sales._data['price_tier'] = sales._data['unit_price'].apply(
    lambda x: 'Low' if x < 15 else 'Medium' if x < 25 else 'High'
)
sales.push()

print("✓ Categorized prices into tiers")
sales._data[['unit_price', 'price_tier']].head()

✓ Categorized prices into tiers


Unnamed: 0_level_0,unit_price,price_tier
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,9.99,Low
2,29.99,High
3,29.99,High
4,29.99,High
5,9.99,Low


## Filtering and Selection

### Filter by condition

In [5]:
high_quantity = sales._data[sales._data['quantity'] > 10]
print(f"Sales with quantity > 10: {len(high_quantity)}")
high_quantity[['product', 'quantity', 'total']].head()

Sales with quantity > 10: 18


Unnamed: 0_level_0,product,quantity,total
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Doohickey,13,129.87
2,Widget,18,539.82
3,Doohickey,15,449.85
4,Doohickey,13,389.87
6,Widget,15,149.85


### Multiple conditions

In [6]:
complex_filter = (sales._data['product'] == 'Widget') & (sales._data['region'] == 'North')
filtered = sales._data[complex_filter]
print(f"Widgets sold in North: {len(filtered)}")
filtered[['product', 'region', 'quantity']].head()

Widgets sold in North: 1


Unnamed: 0_level_0,product,region,quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Widget,North,18


### Use isin for multiple values

In [7]:
selected = sales._data[sales._data['product'].isin(['Widget', 'Gadget'])]
print(f"Widget or Gadget sales: {len(selected)}")

Widget or Gadget sales: 30


## Grouping and Aggregation

### Group by product

In [8]:
by_product = sales._data.groupby('product')['total'].sum()
print("Total sales by product:")
by_product

Total sales by product:


product
Doohickey    3158.48
Gadget       2018.81
Widget       3108.41
Name: total, dtype: float64

### Group by multiple columns

In [9]:
by_product_region = sales._data.groupby(['product', 'region'])['total'].agg(['sum', 'mean', 'count'])
print("Sales by product and region:")
by_product_region.head(10)

Sales by product and region:


Unnamed: 0_level_0,Unnamed: 1_level_0,sum,mean,count
product,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Doohickey,East,399.79,133.263333,3
Doohickey,North,1019.66,254.915,4
Doohickey,South,1069.43,152.775714,7
Doohickey,West,669.6,111.6,6
Gadget,East,199.8,99.9,2
Gadget,North,389.77,194.885,2
Gadget,South,1129.41,161.344286,7
Gadget,West,299.83,74.9575,4
Widget,East,389.76,129.92,3
Widget,North,539.82,539.82,1


### Multiple statistics

In [10]:
stats = sales._data.groupby('region').agg({
    'total': ['sum', 'mean', 'count'],
    'quantity': ['sum', 'mean']
})
print("Statistics by region:")
stats

Statistics by region:


Unnamed: 0_level_0,total,total,total,quantity,quantity
Unnamed: 0_level_1,sum,mean,count,sum,mean
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
East,989.35,123.66875,8,65,8.125
North,1949.25,278.464286,7,75,10.714286
South,3148.32,165.701053,19,168,8.842105
West,2198.78,137.42375,16,122,7.625


## Sorting

### Top sales

In [11]:
top_sales = sales._data.nlargest(5, 'total')
print("Top 5 sales by total:")
top_sales[['product', 'region', 'quantity', 'total']]

Top 5 sales by total:


Unnamed: 0_level_0,product,region,quantity,total
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,Widget,North,18,539.82
16,Widget,South,17,509.83
31,Gadget,South,16,479.84
3,Doohickey,South,15,449.85
4,Doohickey,North,13,389.87


### Sort by multiple columns

In [12]:
sorted_sales = sales._data.sort_values(['product', 'total'], ascending=[True, False])
print("Sorted by product then total (desc):")
sorted_sales[['product', 'total']].head(10)

Sorted by product then total (desc):


Unnamed: 0_level_0,product,total
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,Doohickey,449.85
4,Doohickey,389.87
12,Doohickey,359.88
10,Doohickey,269.91
7,Doohickey,259.87
28,Doohickey,239.88
9,Doohickey,209.93
46,Doohickey,209.93
40,Doohickey,149.85
1,Doohickey,129.87


## Data Transformation

### Use assign() to add columns

In [13]:
sales._data = sales._data.assign(
    discount=0.1,
    final_price=lambda x: x['total'] * 0.9
)
print("✓ Added discount and final_price columns")
sales._data[['total', 'discount', 'final_price']].head()

✓ Added discount and final_price columns


Unnamed: 0_level_0,total,discount,final_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,129.87,0.1,116.883
2,539.82,0.1,485.838
3,449.85,0.1,404.865
4,389.87,0.1,350.883
5,89.91,0.1,80.919


### Transform by groups

In [14]:
sales._data['pct_of_product_total'] = sales._data.groupby('product')['total'].transform(
    lambda x: x / x.sum() * 100
)
print("✓ Calculated percentage of product total")
sales._data[['product', 'total', 'pct_of_product_total']].head()

✓ Calculated percentage of product total


Unnamed: 0_level_0,product,total,pct_of_product_total
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Doohickey,129.87,4.111788
2,Widget,539.82,17.366435
3,Doohickey,449.85,14.24261
4,Doohickey,389.87,12.343596
5,Widget,89.91,2.892476


In [15]:
sales.push()

## Summary

**Key Takeaways:**
- Full pandas DataFrame API is available
- All operations are automatically tracked
- Use `.loc[]`, `.iloc[]`, `.at[]`, `.iat[]` for indexing
- `groupby()`, `merge()`, `pivot_table()` all work
- String and datetime operations supported
- Changes tracked regardless of operation type