# **Practical Activity Notebook**

---
###### ${By \ Vinicius\ Franceschini}$




> The goal of this notebook is to provide students with the opportunity to explore unknown datasets autonomously and efficiently. \
\
Using a dataset on penguin morphology, we will first present the biological context of the dataset, followed by a guided exploration of the dataset. \
\
The initial idea for this 4th day of the workshop would be to let the students formulate their own questions about the dataset, as well as the code necessary to answer them. However, we are not sure if this will be possible. We will develop the notebook with this intention but will address possible questions that may arise during the 4th day.

---
## About the Dataset

The present dataset was originally compiled in a study by [GORMAN, WILLIAMS & FRASER (2014)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081).

In summary, the authors conducted a survey of morphological characteristics of three species of penguins on three different islands in the Palmer Archipelago, Antarctica.

Below, we see a representation of the three species:

![](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)
<p align = "center">
Penguins of the Palmer Archipelago. Art by <a href=https://allisonhorst.github.io/>Allison Horst</a>.
</p>

## Getting to Know the Dataset

> This part will be done together with the students.

In [None]:
# Importing from a set of functions in google.colab
from google.colab import files
_ = files.upload()

Saving dataset_pinguins.csv to dataset_pinguins.csv


First, let's read the .csv file of the dataset with Pandas and then use the `.head()` method for an initial exploration of the dataset.

In [None]:
import pandas as pd
dataset = pd.read_csv('/content/dataset_pinguins.csv')
dataset.head()

Unnamed: 0,especie,ilha,tamanho_bico_mm,altura_bico_mm,tamanho_nadadeira_mm,massa_corporea_g,sexo,ano_coleta
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,macho,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,femea,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,femea,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,femea,2007


Given the context of the dataset, we can see that each row represents a penguin observed by the authors.

In the previous cell, we observe that, in addition to cataloging the species, sex, and island of each penguin, the authors also measured morphological characteristics such as bill length and height, flipper length, and body mass.

Note that the unit of measurement for these characteristics is annotated in the column names of the file (mm and g).


---
The process of data analysis is primarily carried out through asking and answering questions based on a dataset.

For this reason, in this notebook, we will formulate and answer some questions about the dataset.

Before we proceed, it is important to remember two things at this moment:

1. **There are various ways to arrive at an answer.** Some approaches may be more intuitive for certain individuals, while others may find different methods more intuitive. The key is to reach the answer. Additionally, Google is our great ally at this moment! It's crucial to know how to search for our doubts and, even more importantly, understand the responses we find. We can also make use of official package documentation, for instance. The [Pandas documentation](https://pandas.pydata.org/docs/) is quite comprehensive and can help us find answers to our questions.


2. **There is an appropriate way to answer each question.** Knowing that there are various types of variables in Python, it is crucial to consider what type of response is expected for a question. For example, the question "_How many islands were cataloged in the study?_" is better answered with an integer. However, there are questions (usually those inquiring about relationships between variables) whose response is better represented by graphs. Therefore, understanding the type of response your question can generate helps guide and focus your analysis.

We will start with more basic questions:

### How many penguins did the authors observe?

We expect an integer as the answer.

Considering that each row represents a penguin, we can simply count the number of rows in the DataFrame. One way to do this is by using the `len()` function.

In [None]:
# How many penguins did the authors observe?
len(dataset)

344

### How many penguins were observed on each island?

Note that, for this question, we expect a set of three numbers (a count value for each island) as the answer. Since we are working with Pandas, we can specify that we expect a set of type `Series`.

Looking at the [Pandas documentation](https://pandas.pydata.org/docs/), we see that the `value_counts()` method returns exactly what we are expecting for a given column. So, we can use it for the 'island' column, using the notation `dataset[]`.

In [None]:
dataset['ilha'].value_counts()

Biscoe       168
Dream        124
Torgersen     52
Name: ilha, dtype: int64

Another way to answer this question would be by using the `.groupby()` method followed by `.size()`. Essentially, in this approach, we would group the penguins by their island, and, in the end, we would see the size of each group (i.e., the number of penguins observed on each island).

The `groupby()` method is very powerful, as we will see shortly, and it can be used in conjunction with various other methods and functions (such as `mean()`, `sum()`, `count()`, and others; see the Pandas documentation for more information).

In [None]:
dataset.groupby('ilha').size()

ilha
Biscoe       168
Dream        124
Torgersen     52
dtype: int64

How many penguins of each species were observed?

Again, we expect a response of type `Series` to this question, containing a count value of penguins for each of the three species.

To obtain the answer, we can use the same procedure we used to answer the previous question (`groupby()` or `value_counts()`). Let's use the `groupby()` method, mainly for its greater versatility.

In [None]:
dataset.groupby('especie').size()

especie
Adelie       152
Chinstrap     68
Gentoo       124
dtype: int64

From this `Series`, we can see that Adelie penguins were the most observed in the study. Next, we have Gentoo, and finally, Chinstrap.



### How many penguins of each species were observed on each island?

For this question, we also expect a `Series`-type response, showing the count of each penguin species on each island.

We will again use the `groupby()` method followed by `size()`. This time, however, instead of grouping the penguins only by their island, we will group them by both their island and species.

In [None]:
dataset.groupby(['ilha', 'especie']).size()

ilha       especie  
Biscoe     Adelie        44
           Gentoo       124
Dream      Adelie        56
           Chinstrap     68
Torgersen  Adelie        52
dtype: int64

From this `Series`, we can notice several things.

* On Biscoe Island, Adelie penguins were observed (in smaller quantity) and Gentoo penguins (in larger quantity).
* On Dream Island, Chinstrap penguins were observed (in larger quantity) and Adelie penguins (in smaller quantity).
* On Torgersen Island, only Adelie penguins were observed.
* Gentoo penguins were observed only on Biscoe Island.






## Hands-On

The study that originated this dataset drew conclusions about **sexual dimorphism** in these penguin species. In general, sexual dimorphism is any morphological difference (except those of the reproductive organs) between individuals of different sexes of the same species.

From this point forward, this notebook will be completed by you. Try to formulate questions whose answers can provide insights into sexual dimorphism in these penguin species.

Remember to anticipate the most appropriate type of response for your question!

Also, do not hesitate to seek our help or search for solutions on the internet.

* If your answer involves a Pandas structure, perhaps the [Pandas documentation](https://pandas.pydata.org/docs/) can assist you.
* If your answer involves a graph, maybe this [Python graph gallery](https://www.python-graph-gallery.com/) can be helpful.

## Examples of Questions

* In Gentoo penguins, does the flipper size vary with sex?
* For which of the three species does body mass vary with sex?
* What morphological characteristics are larger in Gentoo penguins?
* What morphological characteristics are larger in Chinstrap penguins?
* What morphological characteristics are larger in Adelie penguins?