# pplyr <-> base Python / pandas

This is a copy of the dplyr vignette [dplyr <-> base R](https://cran.r-project.org/web/packages/dplyr/vignettes/base.html).

This vignette compares pplyr functions to their functions in base Python and pandas. This helps those familiar with base Python/pandas understand better what pplyr does, and shows pplyr users how you might express the same ideas using Python and pandas. We’ll start with a rough overview of the major differences, then discuss the one table verbs in more detail, followed by the two table verbs.

## Overview

The core pplyr verbs input and output data frames. The input data frame is always the first variable so that functions are of the form:

```
def some_action(df, ...):
    <return a pd.DataFrame>
```

This allows the functions to be used naturally with pandas DataFrame objects via the pipe() function:

```
df2 = df.pipe(some_action, ...)
```

To allow method chaining we also have a ```pipeline``` object that stores a list of verbs to apply to a DataFrame.  Method chaining can be used to create this pipeline.  For example, you might see:

```
import pipeline from pplyr
df2 = df.pipe(pipeline()
        .select(['name','age','height'])
        .groupby('age')
        .summarise(
          n = lambda x: len(x),
          avg_height = lamdba x: x.height.mean()
        )
```

This allows multiple verbs to be used together without having to store the intermediate results.  While this is possible with pandas DataFrames there are occassionally operations that do not have a nice verb and can interrupt the flow of your code.  One example is the bracket selectors ```[]``` that are used to select columns and perform filtering operations.  In pplyr these actions are performed by ```select``` and ```filter```.

We do not force the "non-standard evaluation" from R's dplyr into the Python environment.  Instead, variable names are always quoted and operations are often defined by lambda expressions where one can operate on an object in a "Pythonic" way.

All pplyr verbs handle “grouped” data frames so that the code to perform a computation per-group looks very similar to code that works on a whole data frame. Some operations keep the DataFrame grouped.  Indexes are also reset after most grouped operations. In base Python, per-group operations tend to rely on the ```group().apply()``` pattern.  This will always return a regular DataFrame rather than a grouped DataFrame.  It also will return a DataFrame with indices that are a combination of the grouping variables and other indices returned from the apply function.

## One table verbs

The following table shows a condensed translation between pplyr verbs and their pandas equivalents. The following sections describe each operation in more detail. You learn more about the pplyr verbs in their documentation and in the vignette ```pplyr```.

| pplyr                       | pandas                             |
|:----------------------------|:-----------------------------------|
| ```arrange(df, x)```        | ```df.sort_values(x)```            |
| ```distinct(df, x)```       | ```df.drop_duplicates(subset=x)``` |
| ```filter(df, x)```         | ```df[<logical test>]```           |
| ```mutate(df, z=lambda df: df.x+df.y)``` | ```df.assign(z=lambda df: df.x+df.y)``` |
| ```pull(df, 1)```           | ```df.iloc[:,1]```                 |
| ```pull(df, x)```           | ```df[x]```                        |
| ```rename(df, y="x")```     | ```df.rename({"x": "y"})```        |
| ```relocate(df, "y")```     | n/a                                |
| ```select(df, ["x","y"])``` | ```df[["x","y"]]```                |
| ```summarise(df, avg=lambda x: x.mean())``` | ```?```    |
| ```slice(df, [0,1,4])```    | ```df.iloc[[0,1,4], :]```          |

From the table above it is easily seen that the pplyr verbs don't offer much in terms of providing simpler ways to do things.  There are similar operations in pandas for every operation, and sometimes the pandas versions are even shorter or more flexible.  The biggest advantage pplyr offers (beside being more intuitive for a dplyr user) is the verbs for operations that can only be done in pandas with angle brackets (```df.iloc[]```, ```df[cols]```, and ```df[<filter criteria>]```).  In these cases the pplyr verbs offer a way to chain the calls together without having to stop your flow and save intermediate data frames.

To begin, we’ll load pplyr and the mtcars and iris data sets so that we can easily show only abbreviated output for each operation.

In [1]:
import sys
if ".." not in sys.path:
    sys.path.append("..")

import pplyr
import pandas as pd
import numpy as np

In [2]:
mtcars = pd.read_csv("../data/mtcars.csv.gz")
iris = pd.read_csv("../data/iris.csv.gz")
starwars = pd.read_csv("../data/starwars.csv.gz")

## ```arrange()```: Arrange rows by variables

pplyr.arrange() orders the rows of a data frame by the values of one or more columns:

In [3]:
mtcars.pipe(pplyr.arrange, ["cyl", "disp"]).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


The ```ascending``` parameter can be set to False to order in descending order:

In [4]:
mtcars.pipe(pplyr.arrange, ["cyl", "disp"], ascending=False).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
14,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
15,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
16,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
24,19.2,8,400.0,175,3.08,3.845,17.05,0,0,3,2
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


We can replicate in pands by using ```sort_values()```:

In [5]:
mtcars.sort_values(["cyl", "disp"]).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


## ```distinct()```: Select distinct/unique rows

```pplyr::distinct()``` selects unique rows:

In [6]:
df = pd.DataFrame({
    "x": np.random.randint(low=1, high=10, size=100),
    "y": np.random.randint(low=1, high=10, size=100)
})

df.pipe(pplyr.distinct, "x")

Unnamed: 0,x,y
0,5,5
1,3,5
2,1,7
3,8,5
5,7,2
10,9,7
14,2,4
18,4,9
19,6,6


NOTE: Behavior differs from R's dplyr.  R will only return the column used in testing distinctness unless ".keep_all=TRUE".  Ours runs with ".keep_all=TRUE" as the default, since this is the default in pandas.  If you just want the single column use ```pull("x")```.

The equivalent in pandas is:

In [7]:
df.drop_duplicates("x")

Unnamed: 0,x,y
0,5,5
1,3,5
2,1,7
3,8,5
5,7,2
10,9,7
14,2,4
18,4,9
19,6,6


## ```filter()```: Return rows with matching conditions

```pplyr.filter()``` selects rows where an expression is True:

In [8]:
starwars.pipe(pplyr.filter, lambda x: x.species == "Human").head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,165.0,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human


In [9]:
starwars.pipe(pplyr.filter, lambda x: x.mass > 1000)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
15,Jabba Desilijic Tiure,175.0,1358.0,,"green-tan, brown",orange,600.0,hermaphroditic,masculine,Nal Hutta,Hutt


In [10]:
starwars.pipe(pplyr.filter, 
              lambda x: (x.hair_color == "none") & (x.eye_color == "black")).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
29,Nien Nunb,160.0,68.0,none,grey,black,,male,masculine,Sullust,Sullustan
45,Gasgano,122.0,,none,"white, blue",black,,male,masculine,Troiken,Xexto
49,Kit Fisto,196.0,87.0,none,green,black,,male,masculine,Glee Anselm,Nautolan
54,Plo Koon,188.0,80.0,none,orange,black,22.0,male,masculine,Dorin,Kel Dor
68,Lama Su,229.0,88.0,none,grey,black,,male,masculine,Kamino,Kaminoan


The pandas equivalent is:

In [11]:
starwars[starwars.species == "Human"].head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,165.0,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human


In [12]:
starwars[starwars.mass > 1000]

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
15,Jabba Desilijic Tiure,175.0,1358.0,,"green-tan, brown",orange,600.0,hermaphroditic,masculine,Nal Hutta,Hutt


In [13]:
starwars[lambda x: (x.hair_color == "none") & (x.eye_color == "black")].head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
29,Nien Nunb,160.0,68.0,none,grey,black,,male,masculine,Sullust,Sullustan
45,Gasgano,122.0,,none,"white, blue",black,,male,masculine,Troiken,Xexto
49,Kit Fisto,196.0,87.0,none,green,black,,male,masculine,Glee Anselm,Nautolan
54,Plo Koon,188.0,80.0,none,orange,black,22.0,male,masculine,Dorin,Kel Dor
68,Lama Su,229.0,88.0,none,grey,black,,male,masculine,Kamino,Kaminoan


As shown in the last example, pandas also supports using a function as the selection criteria.

## ```mutate()```: Create or transform variables

```pplyr::mutate()``` creates new variables from existing variables:

In [14]:
df.pipe(pplyr.mutate, 
        z = lambda df: df.x + df.y,
        z2 = lambda df: df.z.pow(2)
       ).head()

Unnamed: 0,x,y,z,z2
0,5,5,10,100
1,3,5,8,64
2,1,7,8,64
3,8,5,13,169
4,3,4,7,49


The pandas equivalent is ```assign()```:

In [15]:
df.assign( 
        z = lambda df: df.x + df.y,
        z2 = lambda df: df.z.pow(2)
       ).head()

Unnamed: 0,x,y,z,z2
0,5,5,10,100
1,3,5,8,64
2,1,7,8,64
3,8,5,13,169
4,3,4,7,49


When applied to a grouped data frame, pplyr.mutate() computes new variable once per group:

In [16]:
gf = pd.DataFrame({
    "g": [1,1,2,2],
    "x": [0.5, 1.5, 2.5, 3.5]
})

gf.pipe(pplyr.pipeline()
       .groupby("g")
       .mutate(
         x_mean = lambda df: df.x.mean(),
         x_rank = lambda df: range(len(df))
       ).ungroup())

Unnamed: 0,g,x,x_mean,x_rank
0,1,0.5,1.0,0
1,1,1.5,1.0,1
2,2,2.5,3.0,0
3,2,3.5,3.0,1


To replicate this in pandas, you can use:

In [17]:
gf.groupby("g") \
  .apply(lambda df: pd.DataFrame({
    "g": df.g,
    "x": df.x,
    "x_mean": df.x.mean(),
    "x_rank": range(len(df))
}))

Unnamed: 0,g,x,x_mean,x_rank
0,1,0.5,1.0,0
1,1,1.5,1.0,1
2,2,2.5,3.0,0
3,2,3.5,3.0,1


## ```pull()```: Pull out a single variable

```pplyr.pull()``` extracts a variable either by name or position:

In [18]:
mtcars.pipe(pplyr.pull, 0).head()

0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64

In [19]:
mtcars.pipe(pplyr.pull, "cyl").head()

0    6
1    6
2    4
3    6
4    8
Name: cyl, dtype: int64

This is equivalent to ```.iloc[]``` for positions and ```[]``` for names:

In [20]:
mtcars.iloc[:,0].head()

0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64

In [21]:
mtcars["cyl"].head()

0    6
1    6
2    4
3    6
4    8
Name: cyl, dtype: int64

## ```relocate()```: Change column order

pplyr.relocate() makes it easy to move a set of columns to a new position (by default, the front):

In [22]:
# to front
mtcars.pipe(pplyr.relocate, ["gear", "carb"]).head()

Unnamed: 0,gear,carb,mpg,cyl,disp,hp,drat,wt,qsec,vs,am
0,4,4,21.0,6,160.0,110,3.9,2.62,16.46,0,1
1,4,4,21.0,6,160.0,110,3.9,2.875,17.02,0,1
2,4,1,22.8,4,108.0,93,3.85,2.32,18.61,1,1
3,3,1,21.4,6,258.0,110,3.08,3.215,19.44,1,0
4,3,2,18.7,8,360.0,175,3.15,3.44,17.02,0,0


In [23]:
# to back
mtcars.pipe(pplyr.relocate, ["mpg","cyl"], after=-1).head() 

Unnamed: 0,disp,hp,drat,wt,qsec,vs,am,gear,carb,mpg,cyl
0,160.0,110,3.9,2.62,16.46,0,1,4,4,21.0,6
1,160.0,110,3.9,2.875,17.02,0,1,4,4,21.0,6
2,108.0,93,3.85,2.32,18.61,1,1,4,1,22.8,4
3,258.0,110,3.08,3.215,19.44,1,0,3,1,21.4,6
4,360.0,175,3.15,3.44,17.02,0,0,3,2,18.7,8


You can accomplish this in pandas with some basic list manipulation:

In [24]:
#cols = ["gear","carb"] + [col for col in mtcars.columns if col not in ["gear","carb"]]

cols = list(mtcars.columns)
cols.insert(0, cols.pop(cols.index('gear')))
cols.insert(1, cols.pop(cols.index('carb')))

mtcars[cols].head()

Unnamed: 0,gear,carb,mpg,cyl,disp,hp,drat,wt,qsec,vs,am
0,4,4,21.0,6,160.0,110,3.9,2.62,16.46,0,1
1,4,4,21.0,6,160.0,110,3.9,2.875,17.02,0,1
2,4,1,22.8,4,108.0,93,3.85,2.32,18.61,1,1
3,3,1,21.4,6,258.0,110,3.08,3.215,19.44,1,0
4,3,2,18.7,8,360.0,175,3.15,3.44,17.02,0,0


In [25]:
cols = list(mtcars.columns)
cols.append(cols.pop(cols.index('mpg')))
cols.append(cols.pop(cols.index('cyl')))

mtcars[cols].head()

Unnamed: 0,disp,hp,drat,wt,qsec,vs,am,gear,carb,mpg,cyl
0,160.0,110,3.9,2.62,16.46,0,1,4,4,21.0,6
1,160.0,110,3.9,2.875,17.02,0,1,4,4,21.0,6
2,108.0,93,3.85,2.32,18.61,1,1,4,1,22.8,4
3,258.0,110,3.08,3.215,19.44,1,0,3,1,21.4,6
4,360.0,175,3.15,3.44,17.02,0,0,3,2,18.7,8


## ```rename()```: Rename variables by name

```pplyr.rename()``` allows you to rename variables by name or position:

In [26]:
iris.pipe(pplyr.rename, sepal_length="Sepal.Length", sepal_width=1).head()

Unnamed: 0,sepal_length,sepal_width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


This is also easily done in pandas.  (Note that pandas reverses the order of the variables though.  The old name comes first since this dict is treated as a mapping function of existing names to new names.)

In [27]:
iris.rename(columns={
    "Sepal.Length": "sepal_length",
    iris.columns[1]: "sepal_width"
}).head()

Unnamed: 0,sepal_length,sepal_width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## ```rename_with()```: Rename variables with a function

pplyr.rename_with() transform column names with a function:

In [28]:
iris.pipe(pplyr.rename_with, lambda col: col.upper()).head()

Unnamed: 0,SEPAL.LENGTH,SEPAL.WIDTH,PETAL.LENGTH,PETAL.WIDTH,SPECIES
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


pandas supports this natively if a function is supplied to to ```rename``` rather than a dictionary.

In [29]:
iris.rename(columns=lambda col: col.upper()).head()

Unnamed: 0,SEPAL.LENGTH,SEPAL.WIDTH,PETAL.LENGTH,PETAL.WIDTH,SPECIES
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## ```select()```: Select variables by name

pplyr.select() subsets columns by position, name, function of name, or other property:

In [30]:
iris.pipe(pplyr.select, start=0, end=2).head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4


In [31]:
iris.pipe(pplyr.select, ["Species", "Sepal.Length"]).head()

Unnamed: 0,Species,Sepal.Length
0,setosa,5.1
1,setosa,4.9
2,setosa,4.7
3,setosa,4.6
4,setosa,5.0


In [32]:
iris.pipe(pplyr.select, lambda x: x.startswith("Petal")).head()

Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


The equivalent operations in pandas are:

In [33]:
iris.iloc[:,0:3].head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4


In [34]:
iris.loc[:, ["Species", "Sepal.Length"]].head()

Unnamed: 0,Species,Sepal.Length
0,setosa,5.1
1,setosa,4.9
2,setosa,4.7
3,setosa,4.6
4,setosa,5.0


In [35]:
iris.loc[:, iris.columns.map(lambda x: x.startswith("Petal"))].head()

Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


## ```summarise()```: Reduce multiple values down to a single value

```pplyr.summarise()``` computes one or more summaries for each group:

In [36]:
mtcars.pipe(pplyr.pipeline()
        .groupby("cyl")
        .summarise(
          mean = lambda df: df.disp.mean(),
          n = lambda df: len(df)
        ))

Unnamed: 0,cyl,mean,n
0,4,105.136364,11
1,6,183.314286,7
2,8,353.1,14


pandas supports this through it's ```groupby``` and ```apply()``` mechanism.  However, this requires a bit of extra work.  Since we are only returning a one-row DataFrame, pandas requires us to specify an index.  Otherwise it will throw an error.  Also, the result comes back with our grouping variables moved to the index and a new index inserted from the DataFrame we created.  To get this back into a shape we'd want to work with, we have to drop the last index and then call reset_index() to move the grouping variables back into normal columns.

In [37]:
mtcars.groupby("cyl") \
      .apply(lambda df: pd.DataFrame({
        "mean": df.disp.mean(),
        "n": len(df)
      }, index=[0])) \
      .reset_index(level = 1, drop=True) \
      .reset_index()

Unnamed: 0,cyl,mean,n
0,4,105.136364,11
1,6,183.314286,7
2,8,353.1,14


## ```slice()```: Choose rows by position

```slice()``` selects rows with their location:

In [38]:
mtcars.pipe(pplyr.slice, 24, len(mtcars)).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
24,19.2,8,400.0,175,3.08,3.845,17.05,0,0,3,2
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
26,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4


This is straightforward to replicate with ```.iloc```:

In [39]:
mtcars.iloc[24:len(mtcars)].head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
24,19.2,8,400.0,175,3.08,3.845,17.05,0,0,3,2
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
26,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4


## Two-table verbs

When we want to merge two data frames, x and y), we have a variety of different ways to bring them together. The DataFrame.merge() function is a good function for this task.  We provide a few light-weight wrappers that call ```merge()``` and specify the type of join that is being performed.

| dplyr                      | pandas                                |
|:---------------------------|:--------------------------------------|
| ```inner_join(df1, df2)``` | ```pd.merge(df1, df2)```              |
| ```left_join(df1, df2)```  | ```pd.merge(df1, df2, how='left')```  |
| ```right_join(df1, df2)``` | ```pd.merge(df1, df2, how='right')``` |
| ```outer_join(df1, df2)``` | ```pd.merge(df1, df2, how='outer')``` |
| ```semi_join(df1, df2)```  | n/a                                   |
| ```anti_join(df1, df2)```  | n/a                                   |

For more information about two-table verbs, see the vignette: "two-table".

### Mutating joins

pplyr's ```inner_join()```, ```left_join()```, ```right_join()```, and ```outer_join()``` add new columns from y to x, matching rows based on a set of “keys”, and differ only in how missing matches are handled. They are equivalent to calls to merge() with various settings of the ```how``` argument.  All other arguments to ```merge``` can be supplied and will be passed through to the underlying ```merge``` function.  This includes the ```sort``` argument that determines whether results will be sorted by key or not.

### Filtering joins

pplyr’s ```semi_join()``` and ```anti_join()``` affect only the rows, not the columns:

In [40]:
band_members = pd.DataFrame({
    "name": ["Mick", "John", "Paul"],
    "band": ["Stones", "Beatles", "Beatles"]
})

band_instruments = pd.DataFrame({
    "name": ["John", "Paul", "Keith"],
    "plays": ["guitar", "bass", "guitar"]
})

In [41]:
band_members.pipe(pplyr.semi_join, band_instruments)

Unnamed: 0,name,band
0,John,Beatles
1,Paul,Beatles


In [42]:
band_members.pipe(pplyr.anti_join, band_instruments)

Unnamed: 0,name,band
0,Mick,Stones


This can be done in pandas with:

In [43]:
band_members.loc[pd.Series(band_members.name).isin(band_instruments.name), :]

Unnamed: 0,name,band
1,John,Beatles
2,Paul,Beatles


In [44]:
band_members.loc[~pd.Series(band_members.name).isin(band_instruments.name), :]

Unnamed: 0,name,band
0,Mick,Stones
