<a href="https://colab.research.google.com/github/dss5202-2410/Notebooks/blob/main/More_on_dplyr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# More on `dplyr`

`dplyr` is a library for the `R` language designed to make data analysis fast and easy.

In this section, we will introduce a package in that make it possible to do `dplyr`-style data manipulation with pipes in python on `pandas` DataFrames.

+ `dfply`

## Install and load package

In [None]:
!pip install dfply

`dfply` works directly on `pandas` DataFrames. It chains operations on data with the `>>` operator.

In [4]:
from dfply import *
diamonds >> head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


We can chain piped operations, and assign the output to a new DataFrame.

In [5]:
df1 = diamonds >> head(10)
df1

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


### The `X` DataFrame symbol

DataFrame passing through the pipe operation is represented by the symbol `X`. For example, the following code selects certain columns from the original DataFrame (`diamonds`).

In [6]:
diamonds >> select(X.carat, X.cut) >> head(10)

Unnamed: 0,carat,cut
0,0.23,Ideal
1,0.21,Premium
2,0.23,Good
3,0.29,Premium
4,0.31,Good
5,0.24,Very Good
6,0.24,Very Good
7,0.26,Very Good
8,0.22,Fair
9,0.23,Very Good


### Selecting and dropping variables

There are two functions for selection, `select` and `drop`. These functions accept string labels, integer positions, and/or symbolically represented column names (`X.column_name`).

In [7]:
diamonds >> select(["color", "clarity"], 1, X.carat, X.cut) >> head()

Unnamed: 0,color,clarity,cut,carat
0,E,SI2,Ideal,0.23
1,E,SI1,Premium,0.21
2,E,VS1,Good,0.23
3,I,VS2,Premium,0.29
4,J,SI2,Good,0.31


The `drop` function does the oposite. It returns all columns besides the ones specified.

In [8]:
diamonds >> drop(["color", "clarity"], 1, X.carat, X.cut) >> head()

Unnamed: 0,depth,table,price,x,y,z
0,61.5,55.0,326,3.95,3.98,2.43
1,59.8,61.0,326,3.89,3.84,2.31
2,56.9,65.0,327,4.05,4.07,2.31
3,62.4,58.0,334,4.2,4.23,2.63
4,63.3,58.0,335,4.34,4.35,2.75


One particularly nice thing about `dplyr` is that we can drop columns inside of a `select()` statement by putting a subtraction sign in front (`... %>% select(-col)`). This can also be done in `dfply`, with the tilde symbol (`~`).

For example, let's say we want to select any column except `carat` and `clarity`. One way to do this is to specify them for removal using the `~` operator.

In [10]:
diamonds >> select(~X.carat, ~X.clarity) >> head(2)

Unnamed: 0,cut,color,depth,table,price,x,y,z
0,Ideal,E,61.5,55.0,326,3.95,3.98,2.43
1,Premium,E,59.8,61.0,326,3.89,3.84,2.31


### Selection filter functions

+ `starts_with(prefix)`: Find columns that starts with a string prefix.

+ `ends_with(suffix)`: Find columns that ends with a string suffix.

+ `contains(string)`: Find columns that contain a string in their name.

+ `everything()`: All columns.

+ `columns_between(start_col, end_col, inclusive = True)`: Find columns between a specified start and end column. The `inclusive = True` boolean argument indicates whether the end column should be included or not.

+ `columns_to(end_col, inclusive = True)`: Get columns up to a specified end column.

+ `columns_from(start_col)`: Get columns starting at a specified column.

Let's see some examples.

Let's say we want to select only the columns that start with "c".

In [11]:
diamonds >> select(starts_with("c")) >> head(2)

Unnamed: 0,carat,cut,color,clarity
0,0.23,Ideal,E,SI2
1,0.21,Premium,E,SI1


The selection filter functions works with the inversion operator (`~`) too. Let's say we want to remove only the columns that start with "c".

In [12]:
diamonds >> select(~starts_with("c")) >> head(2)

Unnamed: 0,depth,table,price,x,y,z
0,61.5,55.0,326,3.95,3.98,2.43
1,59.8,61.0,326,3.89,3.84,2.31
