# Tutorial: Palmer Penguins Dataset

In this tutorial, we will explore the Palmer Penguins dataset using the tidyversetopandas package. This package simplifies data manipulation in Python by bringing R's tidyverse-like functionality to pandas. We'll demonstrate how to use its key functions: `select`, `mutate`, `filter`, and `arrange`.


## Loading the Palmer Penguins Dataset

The Palmer Penguins dataset includes various measurements from three penguin species. It's ideal for demonstrating data manipulation techniques.

First, let's load the dataset into a pandas DataFrame:


In [19]:
# Load Penguins dataset
import pandas as pd
from tidyversetopandas import tidyversetopandas as ttp

penguins = pd.read_csv("penguins.csv")
penguins.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


## Removing NAs with `filter`

Let’s start by finding out how many na value in our columns.


In [2]:
penguins.isna().sum()

rowid                 0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

There are 11 case of `na` in sex, lets try to remove them with `filter` function in `ttp` and also build in function of `isnull` from pandas.


In [3]:
newPenguins = ttp.filter(penguins, "~ sex.isnull()")
newPenguins.isna().sum()

rowid                0
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
dtype: int64

Looks great! We successfully removed all `na` from sex.


## Filtering species and size with `filter`

Next, we want lets limit our study to penguins with `species` of "Adelie" and `body_mass_g` bigger than 3000 gram. And function `filter` is perfect fot this job. We started with 333 rows and we should see a decrease in number of rows after filter


In [4]:
print(newPenguins.shape)
newPenguins = ttp.filter(newPenguins, "species == 'Adelie' & body_mass_g > 3000")
print(newPenguins.shape)

(333, 9)
(138, 9)


We do see a reduce of rows to 138 which is a great sign. Lets check the dataframe to make sure only "Adelie" penguins avaliable and size larger than 3000 grams.


In [5]:
print(newPenguins.species.unique())
print(newPenguins.body_mass_g.min())

['Adelie']
3050.0


We did it again! There is only "Adelie" penguins in species and the smallest penguins has size of 3050 grams.


## Creating new columns with `mutate`

Now, let's create a new column called `body_mass_kg` that converts `body_mass_g` to kilograms. We can do this with the `mutate` function.


In [13]:
ttp.mutate(penguins, "body_mass_kg = body_mass_g / 1000")

penguins.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.25
3,4,Adelie,Torgersen,,,,,,2007,
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.45


Now we can see in the rightmost column that we have a new column called `body_mass_kg`.

### Converting all lengths to cm

Now, let us convert the `bill_length_mm`, `bill_depth_mm`, and `flipper_length_mm` columns to centimeters. We can do this with the `mutate` function as well.


In [14]:
ttp.mutate(penguins, "bill_length_cm = bill_length_mm / 10")
ttp.mutate(penguins, "bill_depth_cm = bill_depth_mm / 10")
ttp.mutate(penguins, "flipper_length_cm = flipper_length_mm / 10")

penguins.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg,bill_length_cm,bill_depth_cm,flipper_length_cm
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75,3.91,1.87,18.1
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8,3.95,1.74,18.6
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.25,4.03,1.8,19.5
3,4,Adelie,Torgersen,,,,,,2007,,,,
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.45,3.67,1.93,19.3


Now we see that we have all these new columns in our dataframe, but some are redundant. Let's use the `select` function to remove the old columns.


## Selecting Columns with `select`

The `select` function in the `tidyversetopandas` package is a powerful tool designed to bring the simplicity and intuitiveness of R's tidyverse to Python's pandas library. This function specifically mirrors the functionality of `dplyr`'s `select` in R, allowing users to easily choose specific columns from a DataFrame for focused analysis.

In practical terms, select lets users streamline their datasets by including only the columns that are relevant to their current analysis, thereby simplifying the data manipulation process. By using `select`, Python users can enjoy a more R-like syntax for data manipulation, making the transition between R and Python smoother and more intuitive.


### Example 1: Selecting One Column

Let's try selecting one column:


In [15]:
# Selecting species
penguins_subset = ttp.select(penguins, "species")
penguins_subset.head()

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie


The output is a subset of the original penguins DataFrame, containing only the columns `species`.
This subset displays the species and island of each penguin, along with the measurement of their flipper length in millimeters.


### Example 2: Selecting Multiple Columns for Comparative Analysis

For a more detailed comparative analysis, let's select columns that would provide insight into the physical characteristics of the penguins. We'll choose species, bill length, bill depth, and body mass.


In [7]:
# Selecting species, bill_length_mm, bill_depth_mm, and body_mass_g columns
penguins_physical = ttp.select(
    penguins, "species", "bill_length_mm", "bill_depth_mm", "body_mass_g"
)
penguins_physical.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,body_mass_g
0,Adelie,39.1,18.7,3750.0
1,Adelie,39.5,17.4,3800.0
2,Adelie,40.3,18.0,3250.0
3,Adelie,,,
4,Adelie,36.7,19.3,3450.0


Here, the output is a DataFrame that includes a different set of columns: `species`, `bill_length_mm`, `bill_depth_mm`, and `body_mass_g`.
This subset is intended for a more detailed comparative analysis, focusing on the physical characteristics of the penguins, such as bill length, bill depth, and body mass.

### Example 3: Only selecting the columns we want to keep after `mutate`

Earlier in the mutate section, we created a few new columns. Let's use `select` to remove the old columns.


In [18]:
penguins = ttp.select(
    penguins,
    "species",
    "bill_length_cm",
    "bill_depth_cm",
    "flipper_length_cm",
    "body_mass_kg",
)

penguins.head()

Unnamed: 0,species,bill_length_cm,bill_depth_cm,flipper_length_cm,body_mass_kg
0,Adelie,3.91,1.87,18.1,3.75
1,Adelie,3.95,1.74,18.6,3.8
2,Adelie,4.03,1.8,19.5,3.25
3,Adelie,,,,
4,Adelie,3.67,1.93,19.3,3.45


Now our pengiuns DataFrame has only the columns we want to keep.


## Sorting the data using `arrange`

In Python, the sort function is `sort_values()`, which is under the package `pandas`. In the `tidyverse` package of R, however, the function is named `arrange()`. It takes in the whole dataset as the first parameter, and then the names of columns that the data will be sorted on. To make the Python version of sort function more friendly to people who have got used to work with R and `tidyverse`, we wrapped the `pandas` function following the structure of the `tidyverse` function `arrange()`.

To have a better understanding of how it works, we can try sorting the palmerpenguin dataset to find the few penguins with the largest weights.


In [21]:
penguins_sorted = ttp.arrange(penguins, False, "body_mass_g")
penguins_sorted.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
169,170,Gentoo,Biscoe,49.2,15.2,221.0,6300.0,male,2007
185,186,Gentoo,Biscoe,59.6,17.0,230.0,6050.0,male,2007
269,270,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,male,2009
229,230,Gentoo,Biscoe,51.1,16.3,220.0,6000.0,male,2008
263,264,Gentoo,Biscoe,49.8,15.9,229.0,5950.0,male,2009


Above, we can see the top few penguins with the largest weight are all Gentoo penguins from Biscoe island.


We can also sort the data with multiple columns using the `arrange()` function. Suppose we want to find penguins with shortest bill length and bill depth:


In [22]:
penguins_small_bill = ttp.arrange(penguins, True, "bill_length_mm", "bill_depth_mm")
penguins_small_bill.head(1)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
142,143,Adelie,Dream,32.1,15.5,188.0,3050.0,female,2009


## Putting it all together with `pipe`

We can call all 4 functions in one line using the `pipe` function. This function is similar to the `%>%` operator in R's `tidyverse` package.


In [24]:
penguins_subset2 = (
    penguins.pipe(ttp.filter, "~ sex.isnull()")
    .pipe(ttp.mutate, "body_mass_kg = body_mass_g / 1000")
    .pipe(ttp.select, "species", "island", "body_mass_kg")
    .pipe(ttp.arrange, False, "body_mass_kg")
)

penguins_subset2.head()

Unnamed: 0,species,island,body_mass_kg
169,Gentoo,Biscoe,6.3
185,Gentoo,Biscoe,6.05
269,Gentoo,Biscoe,6.0
229,Gentoo,Biscoe,6.0
263,Gentoo,Biscoe,5.95
