# Tutorial: Analyzing the Palmer Penguins Dataset with `tidyversetopandas`

In this tutorial, we will explore the Palmer Penguins dataset using the tidyversetopandas package. This package simplifies data manipulation in Python by bringing R's tidyverse-like functionality to pandas. We'll demonstrate how to use its key functions: `select`, `mutate`, `filter`, and `arrange`.

### Loading the Palmer Penguins Dataset

The Palmer Penguins dataset includes various measurements from three penguin species. It's ideal for demonstrating data manipulation techniques.

First, let's load the dataset into a pandas DataFrame:

In [13]:
# Load Penguins dataset
import pandas as pd
from tidyversetopandas import tidyversetopandas as ttp

penguins = pd.read_csv('penguins.csv')
print(penguins.head())

   rowid species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      1  Adelie  Torgersen            39.1           18.7              181.0   
1      2  Adelie  Torgersen            39.5           17.4              186.0   
2      3  Adelie  Torgersen            40.3           18.0              195.0   
3      4  Adelie  Torgersen             NaN            NaN                NaN   
4      5  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  
2       3250.0  female  2007  
3          NaN     NaN  2007  
4       3450.0  female  2007  


### Selecting Columns with select

#### Example 1: Basic Column Selection

Let's focus on a few relevant columns: species, island, and flipper length.

In [14]:
# Selecting species, island, and flipper_length_mm columns
penguins_subset = ttp.select(penguins, 'species', 'island', 'flipper_length_mm')
print(penguins_subset.head())

  species     island  flipper_length_mm
0  Adelie  Torgersen              181.0
1  Adelie  Torgersen              186.0
2  Adelie  Torgersen              195.0
3  Adelie  Torgersen                NaN
4  Adelie  Torgersen              193.0


The output is a subset of the original penguins DataFrame, containing only the columns `species`, `island`, and `flipper_length_mm`.
This subset displays the species and island of each penguin, along with the measurement of their flipper length in millimeters.

#### Example 2: Selecting Multiple Columns for Comparative Analysis

For a more detailed comparative analysis, let's select columns that would provide insight into the physical characteristics of the penguins. We'll choose species, bill length, bill depth, and body mass.

In [15]:
# Selecting species, bill_length_mm, bill_depth_mm, and body_mass_g columns
penguins_physical = ttp.select(penguins, 'species', 'bill_length_mm', 'bill_depth_mm', 'body_mass_g')
print(penguins_physical.head())

  species  bill_length_mm  bill_depth_mm  body_mass_g
0  Adelie            39.1           18.7       3750.0
1  Adelie            39.5           17.4       3800.0
2  Adelie            40.3           18.0       3250.0
3  Adelie             NaN            NaN          NaN
4  Adelie            36.7           19.3       3450.0


Here, the output is a DataFrame that includes a different set of columns: `species`, `bill_length_mm`, `bill_depth_mm`, and `body_mass_g`.
This subset is intended for a more detailed comparative analysis, focusing on the physical characteristics of the penguins, such as bill length, bill depth, and body mass.

#### Example 3: Selecting columns with `.pipe()`
Let's say we want to select the `species`, `island`, and `flipper_length_mm` columns from the penguins DataFrame using a piping approach.

In [18]:
# Selecting species, island, and flipper_length_mm columns using `.pipe()`
penguins_subset2 = penguins.pipe(ttp.select, 'species', 'island', 'flipper_length_mm')
print(penguins_subset2.head())

  species     island  flipper_length_mm
0  Adelie  Torgersen              181.0
1  Adelie  Torgersen              186.0
2  Adelie  Torgersen              195.0
3  Adelie  Torgersen                NaN
4  Adelie  Torgersen              193.0


This output is identical to the first example and is achieved using pandas' `.pipe()` method. It shows the DataFrame after applying `ttp.select` within a pipeline. The same three columns (`species`, `island`, and `flipper_length_mm`) are included, focusing on the same aspects of the data.