# Mastering `.pipe()` in Pandas: The Secret to Cleaner Data Pipelines

When working with pandas, you‚Äôve probably found yourself writing long method chains that become hard to read and maintain. That‚Äôs where `.pipe()` comes in.

The `.pipe()` method in pandas is a hidden gem that allows you to **insert custom functions seamlessly into your workflow**, making your data pipelines more modular, readable, and reusable.

## üß† What is .pipe()?

At its core, `.pipe()` is nothing more than a way to pass a DataFrame (or Series) to a function:

```
df.pipe(func, *args, **kwargs)
```

is exactly the same as:

```
func(df, *args, **kwargs)
```

So `.pipe()` simply **hands your DataFrame to the function as the first argument**.

### üîß Why Use .pipe()?

- ‚úÖ Keeps method chains clean and readable
- ‚úÖ Allows use of custom functions inside chains
- ‚úÖ Encourages modular, reusable code
- ‚úÖ Works with both DataFrames and Series

Without `.pipe()`, you often have to break a chain into multiple lines, making your pipeline harder to follow.

---

# A. Simple Example with `pipe()` 

In [1]:
# import the libraries 
import pandas as pd
import numpy as np

In [7]:
# Without pipe()
df = pd.DataFrame({
    "A": [10, 20, 30, 40, np.nan, 60, 70, 70],
    "B": [5, 15, np.nan, 25, 35, 45, np.nan, 65],
    "C": ["x", "y", "z", "x", "y", "z", "x", "y"]
})
def add_5_and_square(df):
    return (df + 5) ** 2

df['A_add_5_and_square'] = df['A'].apply(add_5_and_square)
df

Unnamed: 0,A,B,C,A_add_5_and_square
0,10.0,5.0,x,225.0
1,20.0,15.0,y,625.0
2,30.0,,z,1225.0
3,40.0,25.0,x,2025.0
4,,35.0,y,
5,60.0,45.0,z,4225.0
6,70.0,,x,5625.0
7,70.0,65.0,y,5625.0


In [8]:
# with pipe()
df['A_add_5_and_square_pipe'] = df['A'].pipe(add_5_and_square)
df

Unnamed: 0,A,B,C,A_add_5_and_square,A_add_5_and_square_pipe
0,10.0,5.0,x,225.0,225.0
1,20.0,15.0,y,625.0,625.0
2,30.0,,z,1225.0,1225.0
3,40.0,25.0,x,2025.0,2025.0
4,,35.0,y,,
5,60.0,45.0,z,4225.0,4225.0
6,70.0,,x,5625.0,5625.0
7,70.0,65.0,y,5625.0,5625.0


In [18]:
# Passing Arguments

def add_and_multiply(df, col,  add_val, mul_val):
    return (df[col] + add_val) * mul_val

df['B_pipe_passing_arg'] = df.pipe(
    add_and_multiply, col='B', add_val=2, mul_val=5
    )
df

Unnamed: 0,A,B,C,A_add_5_and_square,A_add_5_and_square_pipe,B_pipe_chaining,B_pipe_passing_arg
0,10.0,5.0,x,225.0,225.0,35.0,35.0
1,20.0,15.0,y,625.0,625.0,85.0,85.0
2,30.0,,z,1225.0,1225.0,,
3,40.0,25.0,x,2025.0,2025.0,135.0,135.0
4,,35.0,y,,,185.0,185.0
5,60.0,45.0,z,4225.0,4225.0,235.0,235.0
6,70.0,,x,5625.0,5625.0,,
7,70.0,65.0,y,5625.0,5625.0,335.0,335.0


---

# B. Custom Column Transformations

In [None]:
df = pd.DataFrame({"A": [1, 2, 3, 4, 5]})

def normalize(col):
    return (col - col.mean()) / col.std()

df.assign(A_norm=lambda d: d["A"].pipe(normalize))

Unnamed: 0,A,A_norm
0,1,-1.264911
1,2,-0.632456
2,3,0.0
3,4,0.632456
4,5,1.264911


### üëå ‚Äî let‚Äôs break down what `assign()` does in pandas and why it‚Äôs so useful in method chains.

#### üîß What `.assign()` Does

The **`.assign()`** method in pandas is used to **add new columns** (or update existing ones) in a **DataFrame**.

* It **returns a new DataFrame** with the new/modified columns (**doesn‚Äôt overwrite by default**).
* You can pass in **keyword arguments** where:

  * The **key** = new column name
  * The **value** = expression, Series, scalar, or lambda


#### ‚úÖ Syntax

```python
df.assign(new_col=value, another_col=expression, ...)
```

Equivalent to:

```python
df["new_col"] = value
```

but designed for **method chaining**.

#### üî¢ Example 1: Simple Assignment

```python
df = pd.DataFrame({"A": [1, 2, 3]})
df2 = df.assign(B=df["A"] * 2)

print(df2)
```

Output:

```
   A  B
0  1  2
1  2  4
2  3  6
```


#### üî¢ Example 2: Using Lambda

If the value is a **lambda function**, `.assign()` will pass the whole DataFrame as the argument:

```python
df2 = df.assign(B=lambda d: d["A"] * 2)
```

This is **exactly the same as above**, but lambda style is preferred in chains because it ensures the function gets the latest DataFrame state.

#### üî¢ Example 3: Multiple Columns at Once

```python
df2 = df.assign(
    B=lambda d: d["A"] * 2,
    C=lambda d: d["A"] + 10
)
```

### ‚ö° Back to our Code

```python
df.assign(A_norm=lambda d: d["A"].pipe(normalize))
```

Here‚Äôs what happens step by step:

1. `lambda d: ...` ‚Üí receives the **whole DataFrame `df`** as `d`.
2. Inside, `d["A"].pipe(normalize)` ‚Üí applies your custom `normalize` function to column **A**.
3. `.assign(A_norm=...)` ‚Üí creates a new column **`A_norm`** with the normalized values.

So the result is:

```
   A   A_norm
0  1 -1.264911
1  2 -0.632456
2  3  0.000000
3  4  0.632456
4  5  1.264911
```


‚úÖ **Key takeaway**:
`.assign()` is basically **‚Äúadd a new column inline‚Äù**, but with the bonus that you can use **lambda** for chaining so each transformation gets the current DataFrame.


---

# C. Inline Filtering with lambda

In [None]:

(df
 .pipe(lambda d: d[d["A"] > 2])        # filter rows
 .pipe(lambda d: d.assign(A2=d["A"]**2))  # add squared column
)

Unnamed: 0,A,A2
2,3,9
3,4,16
4,5,25


### üëå ‚Äî let‚Äôs break this chained code step by step:

```python
(df
 .pipe(lambda d: d[d["A"] > 2])          # Step 1: filter rows
 .pipe(lambda d: d.assign(A2=d["A"]**2)) # Step 2: add squared column
)
```

#### üîé Step-by-Step Explanation

##### Step 1:

```python
.pipe(lambda d: d[d["A"] > 2])
```

* `.pipe()` passes the **DataFrame `df`** to the lambda function.
* Inside the lambda: `d[d["A"] > 2]` filters the rows where column `A` is greater than 2.

üëâ Result after Step 1 (assuming `df = pd.DataFrame({"A":[1,2,3,4,5]})`):

```
   A
2  3
3  4
4  5
```

##### Step 2:

```python
.pipe(lambda d: d.assign(A2=d["A"]**2))
```

* Takes the filtered DataFrame as input (`d`).
* `.assign(A2=d["A"]**2)` adds a new column called **A2**.
* `d["A"]**2` squares each value of column `A`.

üëâ Final Result:

```
   A  A2
2  3   9
3  4  16
4  5  25
```


#### ‚ú® Why use `.pipe()` here?

Without `.pipe()`, you‚Äôd have to write it in multiple lines:

```python
df2 = df[df["A"] > 2]      # filter
df2 = df2.assign(A2=df2["A"]**2)  # add column
```

With `.pipe()`, you **chain transformations cleanly** into one flow.
This is super helpful in **data pipelines** where you want readability and reproducibility.


‚ö° So in short:

1. **Filter rows** ‚Üí keep only `A > 2`.
2. **Add new column** `A2` ‚Üí squares of `A`.



---

# D. Multiple pipes in a chain

In [22]:
def square(df): return df ** 2
def add_100(df): return df + 100

(df
 .pipe(square)
 .pipe(add_100)
)


Unnamed: 0,A
0,101
1,104
2,109
3,116
4,125


---

## ‚ö° `.pipe()` vs. Built-in Methods

- Use `.pipe()` for custom functions or transformations not built into pandas

- Stick with built-in methods (`.sum(), .mean(), .clip()`) when available

- Think of `.pipe()` as a bridge between pandas and your own custom logic.

---

# E. üèóÔ∏è Building a Mini ETL Pipeline with `.pipe()`

Here‚Äôs a toy example showing `.pipe()` in action in a data-cleaning workflow:

In [32]:
df = pd.DataFrame({"A": [0, -1, np.nan, 1, 2, 3, 4, 5]})

def drop_missing(df):
    return df.dropna()

def filter_positive(df):
    return df[df['A'] > 0]

def log_transform(df):
    df["log_value"] = np.log(df)
    return df

(df
 .pipe(drop_missing)
 .pipe(filter_positive)
 .pipe(log_transform)
)

Unnamed: 0,A,log_value
3,1.0,0.0
4,2.0,0.693147
5,3.0,1.098612
6,4.0,1.386294
7,5.0,1.609438


---