# How method chaining in pandas is and can be super effective

Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it. 

It eliminates the cognitive burden of naming variables at each intermediate step. 

Fluent Interface, a method of creating object-oriented API relies on method cascading (aka method chaining). 

Method chaining substantially increases the readability of the code. 

It is a top down approach.


## Readable and understandable code - chaining

I will always try to avoid the following two ways of writing code:

**1) MUTLIPLE OBJECT OPTION**

```
eat(
    
    slice(

        bake(
            
            put(
                pour(

                    mix(ingredients),

                    into=baking_form),

                    into=oven),

                time=30),
                
            pieces=6),
        1)
```

This first option is considered a “nested” option such that functions are nested 
within one another. Historically, this has been the traditional way of
integrating code; however, it becomes extremely difficult to read what 
exactly the code is doing and it also becomes easier to make mistakes when 
making updates to your code. Although not in violation of the DRY principle1, 
it definitely violates the basic principle of readability and clarity, 
which makes communication of your analysis more difficult. To make things
more readable, people often move to the following approach…

**2) NESTED OPTION**

```
    it = mix(ingredients)

    it = pour(it, into=baking_form)

    it = put(it, into=oven)

    it = bake(it, time=30)

    it = slice(it, pieces=6)

    it = eat(it, 1)
```

This second option helps in making the data wrangling steps more explicit 
and obvious but definitely violates the DRY principle. By sequencing 
multiple functions in this way you are likely saving multiple outputs 
that are not very informative to you or others; rather, the only reason 
you save them is to insert them into the next function to eventually get the
final output you desire. This inevitably creates unnecessary copies and wrecks 
havoc on properly managing your objects…basically it results in a global 
environment charlie foxtrot! 

**I WILL INSIST ON METHODS CHAINING (if and when python and pandas allows you)**

To provide the same readability (or even better), we can use chaining of methods to string 
these arguments together without unnecessary object creation…

The point of the chain is to help you write code in a way that is easier to read and understand. 
It is powerful tool for clearly expressing a sequence of multiple operations. 

```
    (ingredients

        .mix()

        .pour(into=baking_form)

        .put(into=oven)

        .bake(time=30)

        .slice(pieces=6)
    
        .ear(1)
    )
```



In [9]:
import pandas as pd
import janitor

In [10]:
df = pd.read_csv("https://factpages.npd.no/ReportServer_npdpublic?/FactPages/TableView/wellbore_exploration_all&rs:Command=Render&rc:Toolbar=false&rc:Parameters=f&rs:Format=CSV&Top100=false&IpAddress=82.102.27.246&CultureCode=en")

In [11]:
# initial column name cleaning step
df.columns = (
    df.columns
        .str.replace("wlb", "")
        .str.replace("fld", "")
        .str.replace("fcl", "")
)
df = df.clean_names(case_type="snake")

In [12]:
df.head()

Unnamed: 0,wellbore_name,well,drilling_operator,production_licence,purpose,status,content,well_type,sub_sea,entry_date,...,npdid_wellbore,dsc_npdid_discovery,npdid_field,npdid_facility_drilling,npdid_wellbore_reclass,prl_npdid_production_licence,npdid_site_survey,date_updated,date_updated_max,datesync_npd
0,1/2-1,1/2-1,Phillips Petroleum Norsk AS,143,WILDCAT,P&A,OIL,EXPLORATION,NO,20.03.1989,...,1382,43814.0,3437650.0,296245.0,0,21956.0,,03.10.2019,03.10.2019,19.11.2019
1,1/2-2,1/2-2,Paladin Resources Norge AS,143 CS,WILDCAT,P&A,OIL SHOWS,EXPLORATION,NO,14.12.2005,...,5192,,,278245.0,0,2424919.0,,03.10.2019,03.10.2019,19.11.2019
2,1/3-1,1/3-1,A/S Norske Shell,011,WILDCAT,P&A,GAS,EXPLORATION,NO,06.07.1968,...,154,43820.0,,288604.0,0,20844.0,,03.10.2019,03.10.2019,19.11.2019
3,1/3-2,1/3-2,A/S Norske Shell,011,WILDCAT,P&A,DRY,EXPLORATION,NO,14.05.1969,...,165,,,288847.0,0,20844.0,,03.10.2019,03.10.2019,19.11.2019
4,1/3-3,1/3-3,Elf Petroleum Norge AS,065,WILDCAT,P&A,OIL,EXPLORATION,NO,22.08.1982,...,87,43826.0,1028599.0,288334.0,0,21316.0,,03.10.2019,03.10.2019,19.11.2019


Pandas doesn’t have a comprehensible list of methods to use in method chaining. But to make up for it, Pandas introduced Pipe function starting from version 0.16.2. Pipe enables user-defined methods in method chains.

With the introduction of pipe, you can almost write anything in a method chain which begets the question, How much chaining is too much?. This is an entirely subjective question and must be left to the discretion of the programmer. Most people find the sweet spot to be around 7 or 8 methods in a single chain. I don’t use any hard limits on the number of methods in a chain. Instead, I try to represent a single coherent thought in a single method chain.

Some of the critics of method chaining accuse it of increasing code readability at the cost of making debugging tricky, which is true. Imagine a chain that’s ten methods long that you are debugging after a month. The data frame structure or the column names have changed since then and now your chain starts throwing errors. Its impossible to now debug through the chain and see the changes it makes to the data frame as you move along the chain, albeit you can easily find which method call is breaking the code. This needs to be addressed before starting to use long method chains in production or in notebooks.

### How to track a changing of the dataframe shape along the chain?

Two things to note in this function are the fn argument that can take in a lambda function and display function call. Lambda function lends flexibility and the display function call makes the display of data frames and plots pretty in a Jupyter lab or a notebook setting.

In [30]:
def csnap(df, fn=lambda x: x.shape, msg=None):
    """ Custom Help function to print things in method chaining.
        Returns back the df to further use in chaining.
    """
    if msg:
        print(msg)
    display(fn(df))
    return df

In [31]:
(df
    .pipe(csnap) # check the dataframe shape
    .filter(items = ["wellbore_name", "well", "drilling_operator", "water_depth", "total_depth"]) # select only 5 columns
    .pipe(csnap) # check the dataframe shape
    .query("water_depth >= 1000")
    .pipe(csnap) # check the dataframe shape
)

(1921, 87)

(1921, 5)

(30, 5)

Unnamed: 0,wellbore_name,well,drilling_operator,water_depth,total_depth
1393,6302/6-1,6302/6-1,Statoil ASA (old),1261.0,4234.0
1394,6304/3-1,6304/3-1,A/S Norske Shell,1235.3,3642.0
1396,6305/4-1,6305/4-1,Norsk Hydro Produksjon AS,1002.0,2975.0
1397,6305/4-2 S,6305/4-2,A/S Norske Shell,1086.0,2985.0
1413,6403/6-1,6403/6-1,Statoil ASA (old),1721.0,4120.0
1414,6403/10-1,6403/10-1,Norsk Hydro Produksjon AS,1717.0,3400.0
1415,6404/11-1,6404/11-1,BP Amoco Norge AS,1495.0,3650.0
1416,6405/7-1,6405/7-1,Statoil ASA (old),1206.0,4300.0
1550,6504/5-1 S,6504/5-1,Eni Norge AS,1190.0,4193.0
1675,6603/5-1 S,6603/5-1,A/S Norske Shell,1452.0,5254.0


In [None]:
(
    wine.pipe(csnap)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .pipe(csnap)
    .query("alcohol > 14")
    .pipe(csnap)
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
    .pipe(csnap, lambda x: x.sample(5))
)