Skip to content

Access the parent Pandas data frame in loc[], iloc[], assign(), and others Pandas helpers

License

Notifications You must be signed in to change notification settings

eikevons/pandas-paddles

Repository files navigation

Pandas Paddles

pandas-paddles logo

Access the calling pandas data frame in loc[], iloc[], assign() and other methods with DF to write better chains of data frame operations, e.g.:

df = (
    df
    # Select all rows with column "x" < 2
    .loc[DF["x"] < 2]
    .assign(
        # Shift "x" by its minimum.
        y = DF["x"] - DF["x"].min(),
        # Clip "x" to it's central 50% window. Note how DF is used
        # in the argument to `clip()`.
        z = DF["x"].clip(
            lower=DF["x"].quantile(0.25),
            upper=DF["x"].quantile(0.75)
        ),
    )
)
Documentation Status Test Status Latest version Supported Python versions PyPI downloads

Overview

  • Motivation: Make chaining Pandas operations easier and bring functionality to Pandas similar to Spark's col() function or referencing columns in R's dplyr.
  • Install from PyPI with pip install pandas-paddles. Pandas versions 1+ (>=1,<3) are supported.
  • Documentation can be found at readthedocs.
  • Source code can be obtained from GitHub.
  • Changelog

Example: Create new column and filter

Instead of writing "traditional" Pandas like this:

df_in = pd.DataFrame({"x": range(5)})
df = df_in.copy()
df["y"] = df["x"] // 2
df = df.loc[df["y"] <= 1]
df
#    x  y
# 0  0  0
# 1  1  0
# 2  2  1
# 3  3  1

One can write:

from pandas_paddles import DF
df = (
  df_in
  .assign(y = DF["x"] // 2)
  .loc[DF["y"] <= 1]
)

This is especially handy when re-iterating on data frame manipulations interactively, e.g. in a notebook (just imagine you have to rename df to df_out).

But you can access all methods and attributes of the data frame from the context:

df = pd.DataFrame({
    "X": range(5),
    "y": ["1", "a", "c", "D", "e"],
})
df.loc[DF["y"].str.isupper() | DF["y"].str.isnumeric()]
#    X  y
# 0  0  1
# 3  3  D
df.loc[:, DF.columns.str.isupper()]
#    X
# 0  0
# 1  1
# 2  2
# 3  3
# 4  4

You can even use DF in the arguments to methods:

df = pd.DataFrame({
    "x": range(5),
    "y": range(2, 7),
})
df.assign(z = DF['x'].clip(lower=2.2, upper=DF['y'].median()))
#    x  y    z
# 0  0  2  2.2
# 1  1  3  2.2
# 2  2  4  2.2
# 3  3  5  3.0
# 4  4  6  4.0

When working with pd.Series the S object exists. It can be used similar to DF:

s = pd.Series(range(5))
s[S < 3]
# 0    0
# 1    1
# 2    2
# dtype: int64

Similar projects for pandas

  • siuba

    • (+) active
    • (-) new API to learn
  • pandas-ply

    • (-) stale(?), last change 6 years ago
    • (-) new API to learn
    • (-) Symbol / pandas_ply.X works only with ply_* functions
  • pandas-select

    • (+) no explicite df necessary
    • (-) new API to learn
  • pandas-selectable

    • (+) simple select accessor

    • (-) usage inside chains clumsy (needs explicite df):

      ((df
        .select.A == 'a')
        .select.B == 'b'
      )
    • (-) hard-coded str, dt accessor methods

    • (?) composable?

Development

Development is containerized with Docker to separte from host systems and improve reproducability. No other prerequisites are needed on the host system.

Recommendation for Windows users: install WSL 2 (tested on Ubuntu 20.04), and for containerized workflows, Docker Desktop for Windows.

The common tasks are collected in Makefile (See make help for a complete list):

  • Run the unit tests: make test or make watch for continuously running tests on code-changes.

  • Build the documentation: make docs

  • TODO: Update the poetry.lock file: make lock

  • Add a dependency:

    1. Start a shell in a new container.

    2. Add dependency with poetry add in the running container. This will update poetry.lock automatically:

      # 1. On the host system
      % make shell
      # 2. In the container instance:
      I have no name!@7d0e85b3a303:/app$ poetry add --dev --lock falcon
      
  • Build the development image make image (Note: This should be done automatically for the targets.)

About

Access the parent Pandas data frame in loc[], iloc[], assign(), and others Pandas helpers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages