# Pandas Flavor

## The easy way to write your own "flavor" of Pandas.

Zach Sailer

## About me

* B.S. in Physics at Cal Poly
* Ph.D. in Evolutionary Biophysics Lab at University of Oregon
* Open-source software contributor since 2012 (Jupyter, IPython, SciPy, Numpy, Altair, 

## Here's a little teaser...

Pandas-flavor is a backward-compatible extension API for Pandas.

In [None]:
import pandas as pd

df = pd.read_csv('data/dirty-data.csv')
df.head()

In [None]:
import re

def clean_names(df, case_type='lower'):
    """Function for cleaning column names in a pandas DataFrame.
    """
    
    def _change_case(col, case_type):
        """Change case of a column name."""
        if case_type.lower() == "upper":
            col = col.upper()
        elif case_type.lower() == "lower":
            col = col.lower()
        return col

    def _normalize(col_name):
        """Normalize common special characters."""
        result = col_name
        for search, replace in [(r"[ /:,?()\.-]", "_"), (r"['’]", "")]:
            result = re.sub(search, replace, result)
        return result

    # Should the columns be upper or lower case?
    df = df.rename(columns=lambda x: _change_case(x, case_type))

    # Normalize common special characters.
    df = df.rename(columns=_normalize)

    # Only use single underscores.
    df = df.rename(columns=lambda x: re.sub("_+", "_", x))
    
    return df

In [None]:
df = clean_names(df, case_type='lower')
df.head()

## The pandas flavor way

In [None]:
import re
import pandas_flavor as pf

@pf.register_dataframe_method
def clean_names(df, case_type='lower'):
    """Function for cleaning column names in a pandas DataFrame.
    """
    def _change_case(col: str, case_type: str) -> str:
        """Change case of a column name."""
        if case_type.lower() == "upper":
            col = col.upper()
        elif case_type.lower() == "lower":
            col = col.lower()
        return col

    def _normalize(col_name: str) -> str:
        """Normalize common special characters."""
        result = col_name
        for search, replace in [(r"[ /:,?()\.-]", "_"), (r"['’]", "")]:
            result = re.sub(search, replace, result)
        return result

    # Should the columns be upper or lower case?
    df = df.rename(columns=lambda x: _change_case(x, case_type))

    # Normalize common special characters.
    df = df.rename(columns=_normalize)

    # Only use single underscores.
    df = df.rename(columns=lambda x: re.sub("_+", "_", x))
    
    return df

In [None]:
df.clean_names(case_type='upper')
df.head()

Pandas-flavor enables you to easily extend the Pandas API.

## This allows you to write your own flavor of Pandas

Two ways:
* Method registration
* Accessor registration

## Part 1: Method registration

Method registration is simple with Pandas-flavor. Here's the syntax:

In [None]:
import pandas_flavor as pf

@pf.register_dataframe_method
def my_method(df, arg1, arg2):
    print(arg1, arg2)
    return df

Your method is immediately available on the DataFrame API.

In [None]:
df = pd.DataFrame({'x': [0, 0], 'y': [1, 1]})
df.my_method('hello', 'world')

To write your own "flavor" of Pandas

1. collect your custom registered functions in a Python module (or package)
2. import it.

In this example, I'll write my own "flavor" of Pandas called `my_flavor`.

In [None]:
import my_flavor

df = pd.DataFrame({'x': [0, 0], 'y': [1, 1]})

df.zach_func1()
df.zach_func2()

A really interesting syntax sugar that emerged from Pandas-flavor is "method-chaining".

In [None]:
df = (
    pd.DataFrame({'x': [0, 0], 'y': [1, 1]})
    .zach_func1()
    .zach_func2()
)

This is extremely useful for "data-cleaning" activities.

## Pyjanitor

For example, a popular "flavor" in the wild is [**pyjanitor**](https://github.com/ericmjl/pyjanitor).

<img src="img/pyjanitor-logo.svg" width="20%">

In [None]:
df = pd.read_excel('data/dirty_data.xlsx')
df.head()

In [None]:
import datetime as dt 
import numpy as np

df = (
    pd.read_excel('data/dirty_data.xlsx')
    
    # Remove the empty column and empty row
    .drop("do not edit! --->", axis=1).drop(7, axis=0)
    .rename(
        mapper={
            "First Name": "first_name",
            "Last Name": "last_name",
            "Employee Status": "employee_status",
            "Subject": "subject",
            "Hire Date": "hire_date",
            "% Allocated": "percentage_allocated",
            "Full time?": "full_time",
            "Certification": "certification",
        },
        axis=1
    )
)

# Correct hire date.
df["hire_date"] = pd.TimedeltaIndex(df["hire_date"], unit="d") + dt.datetime(1899, 12, 30)

# Squash certification columns
df['certification'] = df['certification'].combine_first(df['Certification.1'])
gratitude_points = [10, 50, 20, 1000, 392, 115, 12, 182, 1190, 582, 25, 317]
df = (
    df
    .drop(["Certification.1", "Certification.2"], axis=1)
    # Add gratidude points.
    .assign(gratitude_points=gratitude_points)
)

# Log-transform gratitude points.
df["gratitude_points_log"] = df["gratitude_points"].apply(np.log10)

df.head()

### The pyjanitor flavor simplifies Pandas' API for data cleaning.

In [None]:
import janitor

df = (
    pd.read_excel("data/dirty_data.xlsx")
    .remove_empty()
    .clean_names(strip_underscores=True)
    .coalesce(["certification", "certification_1"])
    .convert_excel_date("hire_date")
    .rename_column("%_allocated", "percent_allocated")
    .add_column("gratitude_points", gratitude_points)
    .transform_column("gratitude_points", np.log10, "gratitude_log")
)
df.head()

## Conclusion (Part 1)

Using pandas-flavor, you can write your own flavor of Pandas by **registering methods in a python module** (or package).

You can easily make your flavor pip-installable.

i.e. `pip install my_flavor`

## Part 2: Accessor registration.

An **accessor** is an *object* attached to a DataFrame that can affect (i.e. mutate) that DataFrame.

## Start with a real life use-case

<img src="img/phylopandas-logo.png" width="60%">

In biology, we have all kinds of (non-sense) formats. 

For example, `fasta` is a common format for genomic sequence data. 

In [None]:
with open('data/PF08793.fasta', 'r') as f:
    print(f.read())

I wanted to read biological data like this into Pandas.

Naturally, I started by writing my own `read_` functions.

In [None]:
import phylopandas as ph

df = ph.read_fasta('data/PF08793.fasta')
df.head()

But I couldn't write that DataFrame back out to biological data formats.

This is what inspired me to write pandas-flavor.

I created an *accessor* with custom write methods. 


What I needed as a custom API on Pandas to write out the data.
```python
df.phylo.to_fasta(...)
```

In [None]:
print(df.phylo.to_fasta(id_col='label'))

The PhyloPandas flavor registers an accessor, named `phylo`, on Pandas' DataFrame that include custom functions for biological data.

In [None]:
accessor = df.phylo

for item in dir(accessor):
    if item[:2] == 'to':
        print(item)

Combining representations of data into a single DataFrame.

In [None]:
with open('data/PF08793.newick', 'r') as f:
    print(f.read())

In [None]:
from phylovega import TreeChart

TreeChart.from_newick('data/PF08793.newick')

The PhyloPandas flavor has some clever logic to merge two data formats in a single DataFrame.

In [None]:
df = ph.read_fasta('data/PF08793.fasta')
df.head()

In [None]:
df = (
    ph.read_fasta('data/PF08793.fasta')
    .phylo.read_newick('data/PF08793.newick', combine_on='id')
)

df.head()

Don't forget, we still get all of Pandas!

In [None]:
df[df.length > 0.8]

And just for fun, we added a simple `.display` method for showing the data.

In [None]:
df.phylo.display()

In this approach, we flavor Pandas by containing all our custom functions in an accessor. 

## How do we write an accessor?

Checkout the `my_flavor_accessor.py` module.

In [None]:
import my_flavor_accessor

df = pd.DataFrame({'x': [0, 0], 'y': [1, 1]})

df.zach.func1()
df.zach.func2()

## Conclusion (Part 2): What does this mean for you?

* We (the scientific community) can write domain-specific DataFrames.

        Follow our example in the evolutionary biology community.

* DataFrames encourage "data scientists" to define schemas or grammars (standardized column names) for their domain.

        Pandas Flavor makes it easy to build a domain-specific API for those schemas.

## Other Pandas flavors in the wild...

* pandas `.plot`
* pdvega
* pyjanitor
* geopandas
* python-ctd
* pingouin

In [None]:
%matplotlib inline

df.plot.bar()

## Acknowledgements

* Eric Ma (creator of pyjanitor)
* BioPython community (for PhyloPandas inspiration)
* Jeet Sukumaran (creator of DendroPy)
* Jake Vanderplas (original review of pandas-flavor)
* Mike Harms (graduate PI and contributor to phylopandas)

Thanks!