# PipeFrame Tutorial: Complete Guide

**Author:** Dr. Yasser Mustafa  
**Date:** February 2024  
**Version:** 0.2.1

Welcome to PipeFrame! This tutorial will teach you everything you need to know about using PipeFrame for elegant data manipulation.

## What You'll Learn

1. 🚀 Installation and Setup
2. 📊 Core Concepts and Philosophy
3. 🔧 Basic Operations
4. 🎯 Advanced Features
5. 💼 Real-World Examples
6. 🔒 Security Features
7. ⚡ Performance Tips

---

## 1. Installation and Setup

### Installation

```bash
# Basic installation
pip install pipeframe

# With all features
pip install pipeframe[all]
```

### Import

In [1]:
# Main imports
from pipeframe import *

# For comparison
import pandas as pd
import numpy as np

# For visualization (optional)
import matplotlib.pyplot as plt
%matplotlib inline

print(f"✅ PipeFrame version: {__version__}")
print("Ready to pipe your data!")

✅ PipeFrame version: 0.2.1
Ready to pipe your data!


---

## 2. Core Concepts

### The Philosophy

PipeFrame is built on three core principles:

1. **Readability** - Code should read like English
2. **Composability** - Complex operations from simple building blocks
3. **Safety** - Validated expressions prevent errors

### The Pipe Operator `>>`

The pipe operator chains operations left-to-right, making code read like a recipe:

In [2]:
# Create sample data
employees = DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'age': [25, 32, 37, 29, 41, 28],
    'salary': [50000, 65000, 72000, 58000, 85000, 52000],
    'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Engineering', 'Sales'],
    'years': [2, 5, 8, 3, 12, 1]
})

print("📊 Employee Dataset:")
employees

📊 Employee Dataset:


<pipeframe.DataFrame shape=(6, 5)>
      name  age  salary         dept  years
0    Alice   25   50000  Engineering      2
1      Bob   32   65000    Marketing      5
2  Charlie   37   72000  Engineering      8
3    Diana   29   58000        Sales      3
4      Eve   41   85000  Engineering     12
5    Frank   28   52000        Sales      1

In [3]:
# Compare traditional pandas vs PipeFrame

# ❌ Pandas: Hard to read
pandas_result = employees._data[employees._data['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)
print("Pandas result:")
print(pandas_result)

print("\n" + "="*50 + "\n")

# ✅ PipeFrame: Clear and intuitive
pipeframe_result = (employees
    >> filter('age > 30')
    >> group_by('dept')
    >> summarize(avg_salary='mean(salary)')
    >> arrange('-avg_salary')
)
print("PipeFrame result:")
pipeframe_result

Pandas result:
dept
Engineering    78500.0
Marketing      65000.0
Name: salary, dtype: float64


PipeFrame result:


<pipeframe.DataFrame shape=(2, 2)>
          dept  avg_salary
0  Engineering     78500.0
1    Marketing     65000.0

---

## 3. Basic Operations

### 3.1 Creating DataFrames

In [4]:
# From dictionary
df1 = DataFrame({
    'x': [1, 2, 3, 4],
    'y': ['a', 'b', 'c', 'd']
})

# From lists
df2 = DataFrame([[1, 2], [3, 4]], columns=['col1', 'col2'])

# From pandas DataFrame
pdf = pd.DataFrame({'a': [1, 2, 3]})
df3 = DataFrame.from_pandas(pdf)

print("Dictionary-based:")
display(df1)
print("\nList-based:")
display(df2)

Dictionary-based:


<pipeframe.DataFrame shape=(4, 2)>
   x  y
0  1  a
1  2  b
2  3  c
3  4  d


List-based:


<pipeframe.DataFrame shape=(2, 2)>
   col1  col2
0     1     2
1     3     4

### 3.2 Filtering Rows with `filter()`

In [5]:
# Simple filter
senior_employees = employees >> filter('age >= 35')
print(f"Senior employees (age >= 35): {len(senior_employees)} people")
senior_employees

Senior employees (age >= 35): 2 people


<pipeframe.DataFrame shape=(2, 5)>
      name  age  salary         dept  years
2  Charlie   37   72000  Engineering      8
4      Eve   41   85000  Engineering     12

In [6]:
# Complex filter with multiple conditions
high_earners = employees >> filter('(salary > 60000) & (years > 5)')
print(f"High earners with experience: {len(high_earners)} people")
high_earners

High earners with experience: 2 people


<pipeframe.DataFrame shape=(2, 5)>
      name  age  salary         dept  years
2  Charlie   37   72000  Engineering      8
4      Eve   41   85000  Engineering     12

In [7]:
# String matching
engineers = employees >> filter('dept == "Engineering"')
print(f"Engineers: {len(engineers)} people")
engineers

Engineers: 3 people


<pipeframe.DataFrame shape=(3, 5)>
      name  age  salary         dept  years
0    Alice   25   50000  Engineering      2
2  Charlie   37   72000  Engineering      8
4      Eve   41   85000  Engineering     12

### 3.3 Creating Columns with `define()`

In [8]:
# Add computed columns
enhanced = employees >> define(
    # Basic math
    annual_bonus='salary * 0.1',
    total_comp='salary + annual_bonus',
    
    # Conditional columns
    seniority=if_else('years >= 10', 'Senior', 'Junior'),
    
    # Multiple conditions
    salary_grade=case_when(
        ('salary >= 80000', 'High'),
        ('salary >= 60000', 'Medium'),
        default='Low'
    )
)

enhanced[['name', 'salary', 'annual_bonus', 'total_comp', 'seniority', 'salary_grade']]

<pipeframe.DataFrame shape=(6, 6)>
      name  salary  annual_bonus  total_comp seniority salary_grade
0    Alice   50000        5000.0     55000.0    Junior          Low
1      Bob   65000        6500.0     71500.0    Junior       Medium
2  Charlie   72000        7200.0     79200.0    Junior       Medium
3    Diana   58000        5800.0     63800.0    Junior          Low
4      Eve   85000        8500.0     93500.0    Senior         High
5    Frank   52000        5200.0     57200.0    Junior          Low

### 3.4 Selecting Columns with `select()`

In [9]:
# Select specific columns
basic_info = employees >> select('name', 'dept', 'salary')
basic_info

<pipeframe.DataFrame shape=(6, 3)>
      name         dept  salary
0    Alice  Engineering   50000
1      Bob    Marketing   65000
2  Charlie  Engineering   72000
3    Diana        Sales   58000
4      Eve  Engineering   85000
5    Frank        Sales   52000

In [10]:
# Select range of columns
range_select = employees >> select('name', 'age:salary')
print("Columns:", list(range_select.columns))
range_select

Columns: ['name', 'age', 'salary']


<pipeframe.DataFrame shape=(6, 3)>
      name  age  salary
0    Alice   25   50000
1      Bob   32   65000
2  Charlie   37   72000
3    Diana   29   58000
4      Eve   41   85000
5    Frank   28   52000

### 3.5 Sorting with `arrange()`

In [11]:
# Sort ascending
by_age = employees >> arrange('age')
print("Sorted by age (youngest first):")
by_age[['name', 'age']]

Sorted by age (youngest first):


<pipeframe.DataFrame shape=(6, 2)>
      name  age
0    Alice   25
5    Frank   28
3    Diana   29
1      Bob   32
2  Charlie   37
4      Eve   41

In [12]:
# Sort descending (using '-' prefix)
by_salary_desc = employees >> arrange('-salary')
print("Sorted by salary (highest first):")
by_salary_desc[['name', 'salary']]

Sorted by salary (highest first):


<pipeframe.DataFrame shape=(6, 2)>
      name  salary
4      Eve   85000
2  Charlie   72000
1      Bob   65000
3    Diana   58000
5    Frank   52000
0    Alice   50000

In [13]:
# Multiple sort keys
multi_sort = employees >> arrange('dept', '-salary')
print("Sorted by dept (asc), then salary (desc):")
multi_sort[['name', 'dept', 'salary']]

Sorted by dept (asc), then salary (desc):


<pipeframe.DataFrame shape=(6, 3)>
      name         dept  salary
4      Eve  Engineering   85000
2  Charlie  Engineering   72000
0    Alice  Engineering   50000
1      Bob    Marketing   65000
3    Diana        Sales   58000
5    Frank        Sales   52000

### 3.6 Renaming Columns with `rename()`

In [14]:
# Rename columns
renamed = employees >> rename(
    employee_name='name',
    department='dept',
    years_experience='years'
)

print("New column names:")
print(list(renamed.columns))
renamed.head(3)

New column names:
['employee_name', 'age', 'salary', 'department', 'years_experience']


<pipeframe.DataFrame shape=(3, 5)>
  employee_name  age  salary   department  years_experience
0         Alice   25   50000  Engineering                 2
1           Bob   32   65000    Marketing                 5
2       Charlie   37   72000  Engineering                 8

### 3.7 Getting Unique Values with `distinct()`

In [15]:
# Unique departments
unique_depts = employees >> distinct('dept')
print(f"Unique departments: {len(unique_depts)}")
unique_depts

Unique departments: 3


<pipeframe.DataFrame shape=(3, 5)>
    name  age  salary         dept  years
0  Alice   25   50000  Engineering      2
1    Bob   32   65000    Marketing      5
3  Diana   29   58000        Sales      3

---

## 4. Advanced Features

### 4.1 Method Chaining

In [16]:
# Complex pipeline combining multiple operations
analysis = (employees
    >> filter('age >= 30')                      # Filter adults
    >> define(
        bonus='salary * 0.15',                  # Calculate bonus
        total='salary + bonus',                 # Total compensation
        category=if_else('salary > 70000', 'High', 'Standard')
    )
    >> select('name', 'dept', 'salary', 'bonus', 'total', 'category')
    >> arrange('-total')                        # Sort by total comp
)

print("📊 Senior employee analysis:")
analysis

📊 Senior employee analysis:


<pipeframe.DataFrame shape=(3, 6)>
      name         dept  salary    bonus    total  category
4      Eve  Engineering   85000  12750.0  97750.0      High
2  Charlie  Engineering   72000  10800.0  82800.0      High
1      Bob    Marketing   65000   9750.0  74750.0  Standard

### 4.2 GroupBy and Summarize

In [17]:
# Department summary
dept_summary = (employees
    >> group_by('dept')
    >> summarize(
        headcount='count()',
        avg_salary='mean(salary)',
        max_salary='max(salary)',
        avg_years='mean(years)'
    )
    >> arrange('-avg_salary')
)

print("📊 Department Summary:")
dept_summary

📊 Department Summary:


<pipeframe.DataFrame shape=(3, 5)>
          dept  headcount  avg_salary  max_salary  avg_years
0  Engineering          3     69000.0       85000   7.333333
1    Marketing          1     65000.0       65000   5.000000
2        Sales          2     55000.0       58000   2.000000

### 4.3 Conditional Logic with `case_when()`

In [18]:
# Performance grades
graded = employees >> define(
    performance_score='salary / 1000 + years * 5',  # Hypothetical score
    grade=case_when(
        ('performance_score >= 100', 'A - Excellent'),
        ('performance_score >= 85', 'B - Good'),
        ('performance_score >= 70', 'C - Satisfactory'),
        default='D - Needs Improvement'
    )
)

graded[['name', 'performance_score', 'grade']]

<pipeframe.DataFrame shape=(6, 3)>
      name  performance_score                  grade
0    Alice               60.0  D - Needs Improvement
1      Bob               90.0               B - Good
2  Charlie              112.0          A - Excellent
3    Diana               73.0       C - Satisfactory
4      Eve              145.0          A - Excellent
5    Frank               57.0  D - Needs Improvement

### 4.4 Reshape Operations

In [19]:
# Create quarterly sales data (wide format)
sales_wide = DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'Q1': [100, 150, 120],
    'Q2': [110, 140, 130],
    'Q3': [120, 160, 140],
    'Q4': [130, 170, 150]
})

print("Wide format (original):")
display(sales_wide)

# Convert to long format
sales_long = sales_wide >> pivot_longer(
    cols=['Q1', 'Q2', 'Q3', 'Q4'],
    names_to='quarter',
    values_to='sales'
)

print("\nLong format (pivoted):")
sales_long

Wide format (original):


<pipeframe.DataFrame shape=(3, 5)>
     product   Q1   Q2   Q3   Q4
0     Widget  100  110  120  130
1     Gadget  150  140  160  170
2  Doohickey  120  130  140  150


Long format (pivoted):


<pipeframe.DataFrame shape=(12, 3)>
      product quarter  sales
0      Widget      Q1    100
1      Gadget      Q1    150
2   Doohickey      Q1    120
3      Widget      Q2    110
4      Gadget      Q2    140
5   Doohickey      Q2    130
6      Widget      Q3    120
7      Gadget      Q3    160
8   Doohickey      Q3    140
9      Widget      Q4    130
10     Gadget      Q4    170
11  Doohickey      Q4    150

In [20]:
# Convert back to wide format
sales_wide_again = sales_long >> pivot_wider(
    id_cols='product',
    names_from='quarter',
    values_from='sales'
)

print("Back to wide format:")
sales_wide_again

Back to wide format:


<pipeframe.DataFrame shape=(3, 5)>
     product  sales_Q1  sales_Q2  sales_Q3  sales_Q4
0  Doohickey       120       130       140       150
1     Gadget       150       140       160       170
2     Widget       100       110       120       130

### 4.5 Separate and Unite Columns

In [21]:
# Create data with names to split
people = DataFrame({
    'full_name': ['John Doe', 'Jane Smith', 'Bob Jones'],
    'age': [30, 25, 35]
})

# Separate full_name into first and last
separated = people >> separate('full_name', into=['first', 'last'], sep=' ')
print("After separation:")
separated

After separation:


<pipeframe.DataFrame shape=(3, 3)>
   age first   last
0   30  John    Doe
1   25  Jane  Smith
2   35   Bob  Jones

In [22]:
# Create date parts
dates = DataFrame({
    'id': [1, 2, 3],
    'year': [2024, 2024, 2024],
    'month': [1, 2, 3],
    'day': [15, 20, 10]
})

# Unite into single date column
united = dates >> unite('date', ['year', 'month', 'day'], sep='-')
print("After uniting:")
united

After uniting:


<pipeframe.DataFrame shape=(3, 2)>
   id       date
0   1  2024-1-15
1   2  2024-2-20
2   3  2024-3-10

---

## 5. Real-World Examples

### 5.1 Sales Analysis Pipeline

In [23]:
# Create sample sales data
sales_data = DataFrame({
    'date': ['2024-01-15', '2024-01-20', '2024-02-10', '2024-02-15', '2024-03-05', '2024-03-12'],
    'product': ['Widget', 'Gadget', 'Widget', 'Doohickey', 'Gadget', 'Widget'],
    'quantity': [10, 5, 15, 8, 12, 20],
    'price': [100, 200, 100, 150, 200, 100],
    'cost': [60, 120, 60, 90, 120, 60]
})

# Comprehensive sales analysis
sales_analysis = (sales_data
    >> define(
        revenue='quantity * price',
        total_cost='quantity * cost',
        profit='revenue - total_cost',
        margin='profit / revenue * 100',
        month='pd.to_datetime(date).dt.month'
    )
    >> group_by('product', 'month')
    >> summarize(
        total_revenue='sum(revenue)',
        total_profit='sum(profit)',
        avg_margin='mean(margin)',
        num_sales='count()'
    )
    >> define(
        profit_per_sale='total_profit / num_sales'
    )
    >> arrange('-total_revenue')
)

print("📊 Sales Analysis by Product and Month:")
sales_analysis

📊 Sales Analysis by Product and Month:


<pipeframe.DataFrame shape=(6, 7)>
     product  month  num_sales  total_revenue  total_profit  avg_margin  \
2     Gadget      3          1           2400           960        40.0   
5     Widget      3          1           2000           800        40.0   
4     Widget      2          1           1500           600        40.0   
0  Doohickey      2          1           1200           480        40.0   
1     Gadget      1          1           1000           400        40.0   
3     Widget      1          1           1000           400        40.0   

   profit_per_sale  
2            960.0  
5            800.0  
4            600.0  
0            480.0  
1            400.0  
3            400.0  

### 5.2 Customer Segmentation

In [24]:
# Create customer data
customers = DataFrame({
    'customer_id': range(1, 9),
    'total_purchases': [15, 3, 25, 8, 1, 40, 12, 5],
    'total_spent': [1500, 200, 3500, 600, 50, 5000, 1200, 400],
    'days_since_last': [5, 45, 10, 90, 180, 2, 30, 120]
})

# Segment customers
segments = (customers
    >> filter('total_purchases > 0')
    >> define(
        avg_order_value='total_spent / total_purchases',
        segment=case_when(
            ('avg_order_value > 100 & days_since_last < 30', 'Premium Active'),
            ('avg_order_value > 100 & days_since_last >= 30', 'Premium At Risk'),
            ('days_since_last < 30', 'Standard Active'),
            ('days_since_last < 90', 'At Risk'),
            default='Churned'
        )
    )
    >> group_by('segment')
    >> summarize(
        customers='count()',
        total_value='sum(total_spent)',
        avg_value='mean(total_spent)'
    )
    >> arrange('-total_value')
)

print("👥 Customer Segmentation:")
segments

👥 Customer Segmentation:


<pipeframe.DataFrame shape=(4, 4)>
           segment  customers  total_value  avg_value
2   Premium Active          2         8500     4250.0
3  Standard Active          1         1500     1500.0
0          At Risk          2         1400      700.0
1          Churned          3         1050      350.0

---

## 6. I/O Operations

### 6.1 Reading Data

In [None]:
# Create sample data and save it
sample = DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [95, 87, 92]
})

# Save to CSV
sample.to_csv('sample.csv', index=False)
print("✅ Saved to sample.csv")

# Read back
df_from_csv = read_csv('sample.csv')
print("\n📖 Read from CSV:")
df_from_csv

---

## 7. Security Features

PipeFrame validates all expressions to prevent code injection:

In [None]:
from pipeframe.exceptions import PipeFrameExpressionError

# ✅ Safe expressions work fine
try:
    result = employees >> define(safe_column='salary * 2')
    print("✅ Safe expression allowed")
except Exception as e:
    print(f"❌ Error: {e}")

# ❌ Dangerous expressions are blocked
try:
    result = employees >> define(bad="__import__('os').system('ls')")
    print("❌ SECURITY FAILURE: Dangerous code executed!")
except PipeFrameExpressionError as e:
    print(f"✅ Security check passed: {e}")

---

## 8. Performance Tips

### 8.1 Performance Comparison

In [26]:
import time

# Create larger dataset
np.random.seed(42)
n = 100000

large_df = DataFrame({
    'x': np.random.randint(0, 100, n),
    'y': np.random.randint(0, 100, n),
    'category': np.random.choice(['A', 'B', 'C'], n),
    'value': np.random.randn(n) * 100
})

# Benchmark filter operation
start = time.time()
pandas_result = large_df._data[large_df._data['x'] > 50]
pandas_time = time.time() - start

start = time.time()
pipeframe_result = large_df >> filter('x > 50')
pipeframe_time = time.time() - start

print(f"⚡ Performance on {n:,} rows:")
print(f"  Pandas:     {pandas_time*1000:.2f}ms")
print(f"  PipeFrame:  {pipeframe_time*1000:.2f}ms")
print(f"  Overhead:   {(pipeframe_time/pandas_time - 1)*100:.1f}%")
print(f"\n📊 {len(pipeframe_result):,} rows matched the filter")

⚡ Performance on 100,000 rows:
  Pandas:     2.01ms
  PipeFrame:  4.70ms
  Overhead:   133.6%

📊 48,968 rows matched the filter


---

## Summary

### What You've Learned

✅ **Core Concepts**
- Pipe operator `>>` for natural chaining
- String expressions for readable conditions
- Security-validated operations

✅ **Essential Verbs**
- `filter()` - Select rows
- `define()` - Create columns
- `select()` - Choose columns
- `arrange()` - Sort data
- `group_by()` / `summarize()` - Aggregate

✅ **Advanced Features**
- Conditional logic (`if_else`, `case_when`)
- Reshape operations (pivot, melt)
- Column operations (separate, unite)
- I/O operations

✅ **Real-World Applications**
- Sales analysis
- Customer segmentation
- Data cleaning pipelines

### Next Steps

1. 📚 **Read the docs**: https://pipeframe.readthedocs.io
2. 💬 **Join discussions**: https://github.com/Yasser03/pipeframe/discussions
3. 🐛 **Report issues**: https://github.com/Yasser03/pipeframe/issues
4. ⭐ **Star on GitHub**: https://github.com/Yasser03/pipeframe

### Get Help

- **Email**: yasser.mustafan@gmail.com
- **GitHub**: @Yasser03

---

**Happy piping! 🔄**

*Built with ❤️ by Dr. Yasser Mustafa*