# What's pipe?

__Pipe__ is a method in `pandas.DataFrame` capable of _passing existing functions from packages or self-defined functions_ to dataframe. It is part of the methods that enable method chaining. By using __pipe__, multiple processes can be combined with method chaining without nesting. Let’s look into an example here to show its benefits.

In the documentation of pandas, there are 3 functions: `h(df)`, `g(df,arg1=a)`, `f(df,arg2=b, arg3=c)` applied on df in this order. Usually, three functions are nested in the sequence of calling. It is hard to read the functions & arguments at first glance. By method chaining, the relationships among operations can be shown in a clearer format.


The appropriate method for applying the functions depends on whether your function expects to operate element-wise, row wise, or column wise.

* __.pipe():__ Table wise function applications in Pandas
* __.apply():__ Row or column wise function operation
* __.applymap():__ Element-wise function applications in Pandas

In [None]:
# Nested functions
f(g(h(df), arg1=a), arg2=b, arg3=c)

# Method chaining
(df
 .pipe(h)
 .pipe(g, arg1=a)
 .pipe(f, arg2=b, arg3=c)
)

Let's use online shipping as an example to show a different approach combining multiple processes in a row. There are 5 functions `add_to_cart`, `checkout`, `shipping`, `billing`, and `place_order` used to complete a transaction by customers. The following two examples shows common ways of calling multiple functions consecutively.

The first one that uses nested functions heavily is hard to read without proper formatting and hard to recognize the argument of each function. From my personal experience, the second one is harder for me as it requires giving meaningful names to the intermediate results otherwise hard to be recognized later. Also, the intermediate results are sometimes one-time results and not used in the later part of process. That’s why I wouldn’t choose this way if I have alternatives.

In [None]:
# 1. Nested functions
place_order(
      billing(
        shipping(
            checkout(add_to_cart(items)),
            "address"),
        "credit_card"))

# 2. Save all the intermediate results
cart = add_to_cart(items)
new_order = checkout(cart)
shipping_info = shipping(new_order, "address")
billing_info = billing(shipping_info, "credit_card")
completed_order = place_order(billing_info)

For the same process, using method chaining/__pipe__ makes the process readable and easily recognizable the argument of each function call. It clearly shows the sequence of the execution and the arguments without the need to nest.

In [None]:
# Method chaining
items.pipe(add_to_cart)
     .pipe(checkout)
     .pipe(shipping,"address")
     .pipe(billing, "credit_card")
     .pipe(place_order)

This situation is common in data science as there are numerous processes involved in data manipulation. Oftentimes, the intermediate results are not important since the goal of data manipulation is to get the final data clean.

# Examples

Let’s look into several examples to see how __pipe__ really works. Here, a dataframe containing student name, subject and score is randomly generated.

In [8]:
import pandas as pd
import numpy as np

# Set seed
np.random.seed(520)

# Create a dataframe
df = pd.DataFrame({
    'name': ['Ted'] * 3 + ['Lisa'] * 3 + ['Sam'] * 3,
    'subject': ['math', 'physics', 'history'] * 3,
    'score': np.random.randint(60, 100, 9)
})
df

Unnamed: 0,name,subject,score
0,Ted,math,87
1,Ted,physics,80
2,Ted,history,75
3,Lisa,math,79
4,Lisa,physics,78
5,Lisa,history,77
6,Sam,math,85
7,Sam,physics,61
8,Sam,history,88


# To get rank by subject in a line

## Return pandas dataframe


The goal is to get the rank of every subject in one line and append it to the original dataframe. Thus, a function - `get_subject_rank` - is created to complete this task. By passing the functions, the rank is appended to the original dataframe.

In [12]:
def get_subject_rank(input_df):
    # Avoid overwrite the original dataframe
    input_df = input_df.copy()
    input_df['subject_rank'] = (input_df
                                .groupby(['subject'])['score']
                                .rank(ascending=False))
    return input_df

# pipe method
df.pipe(get_subject_rank)

Unnamed: 0,name,subject,score,subject_rank
0,Ted,math,87,1.0
1,Ted,physics,80,1.0
2,Ted,history,75,3.0
3,Lisa,math,79,3.0
4,Lisa,physics,78,2.0
5,Lisa,history,77,2.0
6,Sam,math,85,2.0
7,Sam,physics,61,3.0
8,Sam,history,88,1.0


# Return pandas series

__Pipe__ can return arbitrary outputs when defined in functions. In the following example, the function returns pandas series once `df_or_not = False.` Other arguments needs to be specified in the calling in __pipe__ when functions have more than one arguments, also shown in the example below.

In [11]:
def get_subject_rank(input_df, df_or_not=True):
    # Avoid overwrite the original dataframe
    input_df = input_df.copy()
    if df_or_not is True:
        input_df['subject_rank'] = (input_df
                                    .groupby(['subject'])['score']
                                    .rank(ascending=False))
        return input_df
    else:
        output_series = (input_df
                         .groupby(['subject'])['score']
                         .rank(ascending=False))
        return output_series

# pipe method - return arbitary output
df.pipe(get_subject_rank, df_or_not = False)

0    1.0
1    1.0
2    3.0
3    3.0
4    2.0
5    2.0
6    2.0
7    3.0
8    1.0
Name: score, dtype: float64

# Data is not the first argument


When calling functions in __pipe__, the first argument of the function by default is the dataframe/series applied by __pipe__. Here is an example of a function that modifies scores - `add_score`. The first argument - `input_df` - is `df`. There is no need to specify `input_df` in the calling in __pipe__.

In [13]:
def add_score(input_df, added_score):
    # Avoid overwrite the original dataframe
    input_df = input_df.copy()
    input_df = input_df.assign(new_score=lambda x: x.score+added_score)
    return input_df

df.pipe(add_score, 2)

Unnamed: 0,name,subject,score,subject_rank,new_score
0,Ted,math,87,1.0,89
1,Ted,physics,80,1.0,82
2,Ted,history,75,3.0,77
3,Lisa,math,79,3.0,81
4,Lisa,physics,78,2.0,80
5,Lisa,history,77,2.0,79
6,Sam,math,85,2.0,87
7,Sam,physics,61,3.0,63
8,Sam,history,88,1.0,90


The two arguments of `add_score` are swapped with each other. In this case, `df` is the second argument in the calling. Thus, a tuple - (function, "the argument of data") - is passed to point out that which argument is the data to apply the function on.

In [14]:
def add_score(added_score, input_df):
    # Avoid overwrite the original dataframe
    input_df = input_df.copy()
    input_df = input_df.assign(new_score=lambda x: x.score+added_score)
    return input_df

df.pipe((add_score, "input_df"), 2)

Unnamed: 0,name,subject,score,subject_rank,new_score
0,Ted,math,87,1.0,89
1,Ted,physics,80,1.0,82
2,Ted,history,75,3.0,77
3,Lisa,math,79,3.0,81
4,Lisa,physics,78,2.0,80
5,Lisa,history,77,2.0,79
6,Sam,math,85,2.0,87
7,Sam,physics,61,3.0,63
8,Sam,history,88,1.0,90


# APPLY FUNCTIONS IN PYTHON PANDAS – APPLY(), APPLYMAP(), PIPE()

To Apply our own function or some other library’s function, pandas provide three important functions namely __pipe()__, __apply()__ and __applymap()__.  These Functions are discussed below.

* Table wise Function Application: __pipe()__
* Index wise i.e. Row or Column Wise Function Application: __apply()__
* Element wise Function Application: __applymap()__

## Table wise Function Application: pipe()

__Pipe()__ function performs the custom operation for the entire dataframe. In below example we will using __pipe()__ Function to add value 2 to the entire dataframe

In [21]:
import pandas as pd
import numpy as np
import math
 
# own function
def adder(adder1,adder2):
   return adder1+adder2
 
#Create a Dictionary of series
d = {'Score_Math':pd.Series([66,57,75,44,31,67,85,33,42,62,51,47]),
   'Score_Science':pd.Series([89,87,67,55,47,72,76,79,44,92,93,69])}
 
df = pd.DataFrame(d)
print(f'Original DataFrame \n {df}\n')
print(f'DataFrame with Value 2 Added \n {df.pipe(adder,2)}')

Original DataFrame 
     Score_Math  Score_Science
0           66             89
1           57             87
2           75             67
3           44             55
4           31             47
5           67             72
6           85             76
7           33             79
8           42             44
9           62             92
10          51             93
11          47             69

DataFrame with Value 2 Added 
     Score_Math  Score_Science
0           68             91
1           59             89
2           77             69
3           46             57
4           33             49
5           69             74
6           87             78
7           35             81
8           44             46
9           64             94
10          53             95
11          49             71


## Index Wise - Row or Column Wise Function Application: apply()

__apply()__ function performs the custom operation for either row wise or column wise . In below example we will be using __apply()__ Function to find the mean of values across rows and mean of values across columns

### Row wise Function in python pandas : Apply()

__apply()__ Function to find the mean of values across rows

In [25]:
print('Row wise Mean')
df.apply(np.mean, axis=1)

Row wise Mean


0     77.5
1     72.0
2     71.0
3     49.5
4     39.0
5     69.5
6     80.5
7     56.0
8     43.0
9     77.0
10    72.0
11    58.0
dtype: float64

### Column wise Function in python pandas : Apply()

__apply()__ Function to find the mean of values across columns

In [26]:
print('Column Wise Mean')
df.apply(np.mean,axis=0)

Column Wise Mean


Score_Math       55.0
Score_Science    72.5
dtype: float64

### Element wise Function Application in python pandas: applymap()


__applymap()__ Function performs the specified operation for all the elements the dataframe. we will be using the same dataframe to depict example of __applymap()__ Function. We will be multiplying the all the elements of dataframe by 2 as shown below

In [27]:
# applymap() Function
df.applymap(lambda x:x*2)  # same results with df.pipe(lambda x:x*2)

Unnamed: 0,Score_Math,Score_Science
0,132,178
1,114,174
2,150,134
3,88,110
4,62,94
5,134,144
6,170,152
7,66,158
8,84,88
9,124,184


In [28]:
df.pipe(lambda x:x*2)

Unnamed: 0,Score_Math,Score_Science
0,132,178
1,114,174
2,150,134
3,88,110
4,62,94
5,134,144
6,170,152
7,66,158
8,84,88
9,124,184


Example 2: 

We will be finding the square root of all the elements of dataframe with __applymap()__ function as shown below

In [30]:
#applymap() Function to find the sqrt
df.applymap(lambda x:math.sqrt(x))  # same results with df.pipe(lambda x:x**.5)

Unnamed: 0,Score_Math,Score_Science
0,8.124038,9.433981
1,7.549834,9.327379
2,8.660254,8.185353
3,6.63325,7.416198
4,5.567764,6.855655
5,8.185353,8.485281
6,9.219544,8.717798
7,5.744563,8.888194
8,6.480741,6.63325
9,7.874008,9.591663


In [31]:
df.pipe(lambda x:x**.5)

Unnamed: 0,Score_Math,Score_Science
0,8.124038,9.433981
1,7.549834,9.327379
2,8.660254,8.185353
3,6.63325,7.416198
4,5.567764,6.855655
5,8.185353,8.485281
6,9.219544,8.717798
7,5.744563,8.888194
8,6.480741,6.63325
9,7.874008,9.591663
