# Introduction to pplyr

This is a copy of the dplyr vignette [Introduction to dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) showing how to perform the same operations using pplyr/pandas.

When working with data you must:

* Figure out what you want to do.
* Describe those tasks in the form of a computer program.
* Execute the program.

The pplyr package makes these steps fast and easy:

* By constraining your options, it helps you think about your data manipulation challenges.
* It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
* It uses efficient backends, so you spend less time waiting for the computer.

This document introduces you to pplyr’s basic set of tools, and shows you how to apply them to data frames.

## Imports

In [1]:
import sys
if ".." not in sys.path:
    sys.path.append("..")

import pplyr

In [2]:
import pandas as pd

## Data: starwars

To explore the basic data manipulation verbs of dplyr, we’ll use the dataset starwars. This dataset contains 87 characters and comes from the [Star Wars API](https://swapi.dev/).  To get this data in Python I just exported the R data to a CSV file.  A couple of columns that could not be easily serialized (because they contained lists of objects) were dropped from the DataFrame.

In [3]:
starwars = pd.read_csv("../data/starwars.csv.gz")
starwars

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,feminine,,Human
83,Poe Dameron,,,brown,light,brown,,male,masculine,,Human
84,BB8,,,none,none,black,,none,masculine,,Droid
85,Captain Phasma,,,unknown,unknown,unknown,,,,,


## Single-table Verbs

pplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with:

Rows:

* ```filter()``` chooses rows based on column values.
* ```slice()``` chooses rows based on location.
* ```arrange()``` changes the order of the rows.

Columns:

* ```select()``` changes whether or not a column is included.
* ```rename()``` changes the name of columns.
* ```mutate()``` changes the values of columns and creates new columns.
* ```relocate()``` changes the order of the columns.  (NOT IMPLEMENTED)

Groups of rows:

* ```summarise()``` collapses a group into a single row.

## DataFrame.pipe() and pplyr.pipelines

We have followed the design principle of dplyr to implement verbs as functions whose first argument is always a DataFrame.  This type of function is supported by pandas DataFrame.pipe() function so it integrates well with the pandas universe.  For example, we define ```select``` as:

```
def select(df, cols):
    return df[cols]   # NOTE: actual implementation is not quite so simple
```

we can then call this function with a pandas DataFrame (df) using:

```
import pplyr
df.pipe(pplyr.select, cols)
```

We recommend importing pplyr this way as opposed to importing the verbs into your local namespace because some of the verbs conflict with Python built-in functions; specifically: filter and slice.

We also offer an alternative syntax using a ```pipeline``` object.  This object provides methods with the same names and signatures as the pplyr verbs.  Each method returns the pipeline object so that they can be chained together.  This allows for code such as:

```
from pplyr import pipeline
df.pipe(pipeline().select(cols))
```

or for something more complicated:

```
df.pipe(pipeline()
        .select(cols)
        .groupby('age')
        .summarise(
          n = lambda x: len(x),
          avg_height = lambda x: x.height.mean()
        ))
```

Pipelines can also be stored as separate objects and called multiple times with different DataFrames.  The practice of method chaining seems to accomplish the same goal as dplyr's pipe operator (```%>%```), creating easy to read code, especially when many methods are invoked in a row.  The only limitation is that the set of operations available on the ```pipeline``` is not easily extended.  However, if anyone has their own function they want to use they can always take advantage of panda's built-in ```DataFrame.pipe()``` function.  ```pipeline``` also contains a ```pipe()``` function so that this can be done in the context of a pipeline as well.

NOTE: We did try to implementa a pipe operator similar to that found in 'dfply', but doing so forced the user to use the pipe operator.  It did not allow for simple functions to be compatible with DataFrame.pipe().  For these reason we chose python/pandas compatibility rather than forcing R's practices into the python ecosystem.  As mentioned, we believe method chaining accomplishes the same goal anyway and is a bit more "pythonic".

### Filter rows with ```filter```

```filter()``` allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the DataFrame. The second argument provides a function that should return a row-selector for the DataFrame.  In most cases this function will return a Series of TRUE/FALSE values indicating which rows should be included in the result.

For example, we can select all character with light skin color and brown eyes with:

In [4]:
starwars.pipe(pplyr.filter, lambda x: (x.skin_color == "light") & (x.eye_color == "brown"))

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
8,Biggs Darklighter,183.0,84.0,black,light,brown,24.0,male,masculine,Tatooine,Human
57,Cordé,157.0,,brown,light,brown,,female,feminine,Naboo,Human
62,Dormé,165.0,,brown,light,brown,,female,feminine,Naboo,Human
78,Raymus Antilles,188.0,79.0,brown,light,brown,,male,masculine,Alderaan,Human
83,Poe Dameron,,,brown,light,brown,,male,masculine,,Human
86,Padmé Amidala,165.0,45.0,brown,light,brown,46.0,female,feminine,Naboo,Human


This is roughly equivalent to the pandas code:

```
starwars[(starwars.skin_color == "light") & (starwars.eye_color == "brown")]
```

### Arrange rows with ```arrange()```

```arrange()``` works similarly to ```filter()``` except that instead of filtering or selecting rows, it reorders them. It takes a DataFrame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

In [5]:
starwars.pipe(pplyr.arrange, ["height", "mass"]).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
18,Yoda,66.0,17.0,white,green,brown,896.0,male,masculine,,Yoda's species
71,Ratts Tyerell,79.0,15.0,none,"grey, blue",unknown,,male,masculine,Aleen Minor,Aleena
28,Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,masculine,Endor,Ewok
44,Dud Bolt,94.0,45.0,none,"blue, grey",yellow,,male,masculine,Vulpter,Vulptereen
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid


```arrange()``` takes a parameter named ```ascending``` that can be set to ```False``` to sort in descending order.  This parameter can also take an array of values, one corresponding to each sort column if ascending and descending orders are mixed.

In [6]:
starwars.pipe(pplyr.arrange, "height", ascending=False).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
53,Yarael Poof,264.0,,none,white,yellow,,male,masculine,Quermia,Quermian
77,Tarfful,234.0,136.0,brown,brown,blue,,male,masculine,Kashyyyk,Wookiee
68,Lama Su,229.0,88.0,none,grey,black,,male,masculine,Kamino,Kaminoan
12,Chewbacca,228.0,112.0,brown,unknown,blue,200.0,male,masculine,Kashyyyk,Wookiee
34,Roos Tarpals,224.0,82.0,none,grey,orange,,male,masculine,Naboo,Gungan


### Choose rows using their position with ```slice()```

```slice()``` lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows.

We can get characters from row numbers 5 through 10:

NOTE: Don't forget that python uses zero-based index and that the last index of the slice operator is not included in the results.  Thus the slice that would be ```5:10``` in R is ```4:10``` in Python.

In [7]:
starwars.pipe(pplyr.slice, 4, 10)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52.0,male,masculine,Tatooine,Human
6,Beru Whitesun lars,165.0,75.0,brown,light,blue,47.0,female,feminine,Tatooine,Human
7,R5-D4,97.0,32.0,,"white, red",red,,none,masculine,Tatooine,Droid
8,Biggs Darklighter,183.0,84.0,black,light,brown,24.0,male,masculine,Tatooine,Human
9,Obi-Wan Kenobi,182.0,77.0,"auburn, white",fair,blue-gray,57.0,male,masculine,Stewjon,Human


It is accompanied by a number of helpers for common use cases:

* ```slice_head()``` and ```slice_tail()``` select the first or last rows.

In [8]:
starwars.pipe(pplyr.slice_head, n=3)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid


* ```slice_sample()``` randomly selects rows. Use the option prop to choose a certain proportion of the cases.

In [9]:
starwars.pipe(pplyr.slice_sample, n = 5)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
69,Taun We,213.0,,none,grey,black,,female,feminine,Kamino,Kaminoan
39,Quarsh Panaka,183.0,,black,dark,brown,62.0,,,Naboo,
60,Luminara Unduli,170.0,56.2,black,yellow,blue,58.0,female,feminine,Mirial,Mirialan
61,Barriss Offee,166.0,50.0,black,yellow,blue,40.0,female,feminine,Mirial,Mirialan
10,Anakin Skywalker,188.0,84.0,blond,fair,blue,41.9,male,masculine,Tatooine,Human


In [10]:
starwars.pipe(pplyr.slice_sample, prop = 0.1)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
10,Anakin Skywalker,188.0,84.0,blond,fair,blue,41.9,male,masculine,Tatooine,Human
67,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,masculine,Ojom,Besalisk
61,Barriss Offee,166.0,50.0,black,yellow,blue,40.0,female,feminine,Mirial,Mirialan
70,Jocasta Nu,167.0,,white,fair,blue,,female,feminine,Coruscant,Human
13,Han Solo,180.0,80.0,brown,fair,brown,29.0,male,masculine,Corellia,Human
39,Quarsh Panaka,183.0,,black,dark,brown,62.0,,,Naboo,
45,Gasgano,122.0,,none,"white, blue",black,,male,masculine,Troiken,Xexto
42,Bib Fortuna,180.0,,none,pale,pink,,male,masculine,Ryloth,Twi'lek
14,Greedo,173.0,74.0,,green,black,44.0,male,masculine,Rodia,Rodian


Use ```replace = True``` to perform a bootstrap sample. If needed, you can weight the sample with the ```weight_by``` argument.

* ```slice_min()``` and ```slice_max()``` select rows with highest or lowest values of a variable.

In [11]:
starwars.pipe(pplyr.slice_max, "height", n=3)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
53,Yarael Poof,264.0,,none,white,yellow,,male,masculine,Quermia,Quermian
77,Tarfful,234.0,136.0,brown,brown,blue,,male,masculine,Kashyyyk,Wookiee
68,Lama Su,229.0,88.0,none,grey,black,,male,masculine,Kamino,Kaminoan


In [12]:
starwars.pipe(pplyr.slice_min, "height", n=3)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
18,Yoda,66.0,17.0,white,green,brown,896.0,male,masculine,,Yoda's species
71,Ratts Tyerell,79.0,15.0,none,"grey, blue",unknown,,male,masculine,Aleen Minor,Aleena
28,Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,masculine,Endor,Ewok


The following aliases for slice-like functions are also defined:

* head() = slice_head()
* tail() = slice_tail()

### Select columns with ```select()```

Often you work with large datasets with many columns but only a few are actually of interest to you. ```select()``` allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

In [13]:
starwars.pipe(pplyr.select, ["hair_color","skin_color","eye_color"]).head()

Unnamed: 0,hair_color,skin_color,eye_color
0,blond,fair,blue
1,,gold,yellow
2,,"white, blue",red
3,none,white,yellow
4,brown,light,brown


In [14]:
starwars.pipe(pplyr.select, start="hair_color", end="eye_color").head()

Unnamed: 0,hair_color,skin_color,eye_color
0,blond,fair,blue
1,,gold,yellow
2,,"white, blue",red
3,none,white,yellow
4,brown,light,brown


Unlike dpplyr, we do not have a way for ```select()``` to specify columns that should not be included in the selection.  (dplyr accomplishes this with ```select(-col1, -col2)``` or ```select(!(hair_color:eye_color))```.  Instead, we provide a ```drop()``` function to drop columns.

In [15]:
starwars.pipe(pplyr.drop, start="hair_color", end="eye_color").head()

Unnamed: 0,name,height,mass,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,33.0,none,masculine,Naboo,Droid
3,Darth Vader,202.0,136.0,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,19.0,female,feminine,Alderaan,Human


In [16]:
starwars.pipe(pplyr.select, lambda x: x.endswith("color")).head()

Unnamed: 0,hair_color,skin_color,eye_color
0,blond,fair,blue
1,,gold,yellow
2,,"white, blue",red
3,none,white,yellow
4,brown,light,brown


pplyr uses lambda functions to let you provide custom logic.  For this reason we do not need dplyr's helper functions: ends_with, start_with, mathes, and contains.

We also don't support re-naming columns with the ```select``` function.  Instead, we expect this to be done using ```rename```:

In [17]:
starwars.pipe(pplyr.rename, home_world = "homeworld").head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,home_world,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human


### Add new columns with ```mutate()```

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of ```mutate()```:

In [18]:
starwars.pipe(pplyr.mutate, height_m = lambda x: x.height / 100).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species,height_m
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human,1.72
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid,1.67
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid,0.96
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human,2.02
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human,1.5


as with dpplyr's mutate() function, you can refer to columns that you’ve just created since each column is added one-at-time.  (Internally this is done naturally by the pandas DataFrame.assign() function.)

In [19]:
starwars.pipe(pplyr.mutate,
    height_m = lambda x: x.height / 100,
    BMI = lambda x: x.mass / (x.height_m.pow(2))
  ).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species,height_m,BMI
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human,1.72,26.027582
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid,1.67,26.892323
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid,0.96,34.722222
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human,2.02,33.330066
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human,1.5,21.777778


If you only want to keep the new variables, use ```transmute()```:

In [20]:
starwars.pipe(pplyr.transmute,
    height_m = lambda x: x.height / 100,
    BMI = lambda x: x.mass / (x.height_m.pow(2))
  ).head()

Unnamed: 0,height_m,BMI
0,1.72,26.027582
1,1.67,26.892323
2,0.96,34.722222
3,2.02,33.330066
4,1.5,21.777778


### Change column order with ```relocate()```

Use a similar syntax as select() to move blocks of columns at once

In [21]:
starwars.pipe(pplyr.relocate, start="sex", end="homeworld", before="height").head()

Unnamed: 0,name,sex,gender,homeworld,height,mass,hair_color,skin_color,eye_color,birth_year,species
0,Luke Skywalker,male,masculine,Tatooine,172.0,77.0,blond,fair,blue,19.0,Human
1,C-3PO,none,masculine,Tatooine,167.0,75.0,,gold,yellow,112.0,Droid
2,R2-D2,none,masculine,Naboo,96.0,32.0,,"white, blue",red,33.0,Droid
3,Darth Vader,male,masculine,Tatooine,202.0,136.0,none,white,yellow,41.9,Human
4,Leia Organa,female,feminine,Alderaan,150.0,49.0,brown,light,brown,19.0,Human


### Summarise values with ```summarise()```

The last verb is ```summarise()```. It collapses a data frame to a single row.  It’s not that useful until we learn the ```group_by()``` verb below.

In [22]:
starwars.pipe(pplyr.summarise, height = lambda x: x.height.mean())

Unnamed: 0,height
0,174.358025


The following aliases also exist:

* summarize()

## Combining functions with pplyr.pipelines

As mentioned earlier, pplyr's ```pipeline``` object allows method chaining to be used to make your code easier-to-read.  An example of this is:

```
starwars.pipe(pplyr.pipeline()
  .group_by(["species", "sex"])
  .summarise(
    n = lambda x: len(x),
    height = lambda x: x.height.mean(),
    mass = lambda x: x.mass.mean()
  ))
```

## Patterns of operations

NOTE: This section was not copied because it is not relevant to pplyr.  It is only relevant in R where variable names are used as expressions.  In pplyr variable names are always quoted as strings since Python does not support R's expression syntax.