# Jupyter Notebook: 03_pandas_basics.ipynb

# ---

# # Pandas Basics

Welcome! Now that you know NumPy, let's learn **Pandas**, the library for data manipulation and analysis.

---

## Table of Contents

1. What is Pandas?
2. Series and DataFrames
3. Reading and Writing Data
4. Selecting and Filtering Data
5. Operations on Data
6. Mini-Exercises

---

# 1. What is Pandas?

**Pandas** is a Python library that makes it easy to work with structured data (tables).

First, let's import it:

```python
import pandas as pd
```

---

# 2. Series and DataFrames

## Series
A **Series** is a one-dimensional labeled array.

```python
# Create a Series
s = pd.Series([10, 20, 30, 40])
print(s)
```

## DataFrames
A **DataFrame** is a two-dimensional labeled data structure (like a spreadsheet).

```python
# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)
print(df)
```

---

# 3. Reading and Writing Data

## Reading CSV files

```python
# Read a CSV file
df = pd.read_csv('path/to/your/file.csv')
print(df.head())
```

> (We'll use built-in small examples for now. No need for a real file yet.)

## Writing CSV files

```python
# Save a DataFrame to CSV
df.to_csv('my_data.csv', index=False)
```

---

# 4. Selecting and Filtering Data

## Selecting Columns

```python
# Select one column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'City']])
```

## Selecting Rows

```python
# Select by index
print(df.iloc[0])  # First row

# Select by label (requires a set index)
# df.set_index('Name', inplace=True)
# print(df.loc['Alice'])
```

## Filtering Rows

```python
# People older than 28
older_than_28 = df[df['Age'] > 28]
print(older_than_28)
```

---

# 5. Operations on Data

## Basic Operations

```python
# Mean age
print(df['Age'].mean())
```

## Adding Columns

```python
# Add a new column

# Example: Age in 5 years
df['Age_in_5_years'] = df['Age'] + 5
print(df)
```

## Grouping Data

```python
# Group by city and get average age
print(df.groupby('City')['Age'].mean())
```

---

# 6. Mini-Exercises

### 6.1 Create a DataFrame with your own data (3 columns, 5 rows)

```python
# Your code here
data = {
    'Animal': ['Dog', 'Cat', 'Rabbit', 'Hamster', 'Bird'],
    'Age': [5, 3, 2, 1, 4],
    'Type': ['Mammal', 'Mammal', 'Mammal', 'Mammal', 'Bird']
}

pets = pd.DataFrame(data)
print(pets)
```

### 6.2 Select only the animals older than 2 years

```python
# Your code here
older_pets = pets[pets['Age'] > 2]
print(older_pets)
```

### 6.3 Group your data by type and calculate the average age

```python
# Your code here
avg_age_by_type = pets.groupby('Type')['Age'].mean()
print(avg_age_by_type)
```

---

# Congratulations! 🎉

You've learned the basics of **Pandas**!

Next, we'll move on to **visualizing data** with **Matplotlib**!

---

# Quick Recap
- **Series** = 1D data; **DataFrame** = 2D data.
- Load data with `read_csv()`, save with `to_csv()`.
- Select columns and rows easily.
- Filter, group, and calculate statistics.

See you in the next notebook!


# Pandas

Pandas stands for panel data (so not the animal 🐼). It is a more USER.

If you want to learn more about it, visit [https://pandas.pydata.org/](https://pandas.pydata.org/).

In [1]:
# Import dependencies
import pandas as pd

# MOVE BUT SHOW

# Creating a pandas dataframe

In [None]:
# Create a Series
my_series = pd.Series(data=[10, 20, 30, 40, 180])

print(my_series)
print(type(my_series))

0     10
1     20
2     30
3     40
4    180
dtype: int64
<class 'pandas.core.series.Series'>


In [11]:
print(my_series.skew())
print(my_series.kurtosis())

2.095594125462855
4.502672300647195


In [None]:
# Create a Series
my_series2 = pd.Series(data=["Biscoe", "Biscoe", "Dream", "Dream", "Dream"])

print(my_series2)
print(type(my_series2))

0    Biscoe
1    Biscoe
2     Dream
3     Dream
4     Dream
dtype: object
<class 'pandas.core.series.Series'>


In [17]:
print(my_series2.max())
print(my_series2.min())

Dream
Biscoe


In [18]:
# Create a small dictionary
# You should know by now
penguins_dico = {
    "island": ["Biscoe", "Dream", "Dream"],
    "bill_length_mm": [39.1, 39.5, 40.3],
    "bill_depth_mm": [18.7, 17.4, 18.0]
    }

print(penguins_dico)

{'island': ['Biscoe', 'Dream', 'Dream'], 'bill_length_mm': [39.1, 39.5, 40.3], 'bill_depth_mm': [18.7, 17.4, 18.0]}


In [19]:
penguins_df = pd.DataFrame(data=penguins_dico)
print(penguins_df)

   island  bill_length_mm  bill_depth_mm
0  Biscoe            39.1           18.7
1   Dream            39.5           17.4
2   Dream            40.3           18.0


In [23]:
penguins_df.values

array([['Biscoe', 39.1, 18.7],
       ['Dream', 39.5, 17.4],
       ['Dream', 40.3, 18.0]], dtype=object)

In [24]:
type(penguins_df.values)

numpy.ndarray

In [20]:
# Have a look at the values attribute
print(penguins_df.values)

#
print("-" * 50)

# Have a look at the type of object it is
print(type(penguins_df.values))
print(penguins_df.shape)
print(penguins_df.dtypes)

[['Biscoe' 39.1 18.7]
 ['Dream' 39.5 17.4]
 ['Dream' 40.3 18.0]]
--------------------------------------------------
<class 'numpy.ndarray'>
(3, 3)
island             object
bill_length_mm    float64
bill_depth_mm     float64
dtype: object


In [21]:
# So you still have access to familiar methods
penguins_df.sum(axis=0)

island            BiscoeDreamDream
bill_length_mm               118.9
bill_depth_mm                 54.1
dtype: object

# Better indexing

Because Pandas has column indices, it is capable of

In [29]:
print(penguins_df)

   island  bill_length_mm  bill_depth_mm
0  Biscoe            39.1           18.7
1   Dream            39.5           17.4
2   Dream            40.3           18.0


In [32]:
penguins_df.columns
(penguins_df.index)

RangeIndex(start=0, stop=3, step=1)

In [35]:
# Access the column directly
print(penguins_df["bill_length_mm"])
#
print(penguins_df["island"])

0    39.1
1    39.5
2    40.3
Name: bill_length_mm, dtype: float64
0    Biscoe
1     Dream
2     Dream
Name: island, dtype: object


In [37]:
# You can access several columns by giving their names as a list
print(penguins_df[["bill_length_mm", "bill_depth_mm"]])

   bill_length_mm  bill_depth_mm
0            39.1           18.7
1            39.5           17.4
2            40.3           18.0


In [142]:
# Accessing both columns and rows is (usually) done with .loc[]
palmer.loc[0:3, ["bill_length_mm", "body_mass_g"]]

Unnamed: 0,bill_length_mm,body_mass_g
0,39.1,3750.0
1,39.5,3800.0
2,40.3,3250.0
3,,


In [145]:
# You can access column and 
palmer.loc[(palmer["sex"] == "male"), ["bill_length_mm", "body_mass_g"]]

Unnamed: 0,bill_length_mm,body_mass_g
0,39.1,3750.0
5,39.3,3650.0
7,39.2,4675.0
13,38.6,3800.0
14,34.6,4400.0
...,...,...
334,50.2,3800.0
336,51.9,3950.0
339,55.8,4000.0
341,49.6,3775.0


In [68]:
# You can access column and 
palmer.loc[(palmer["sex"] == "male") & (palmer["island"] == "Dream"), ["bill_length_mm", "body_mass_g"]]

Unnamed: 0,bill_length_mm,body_mass_g
31,37.2,3900.0
33,40.9,3900.0
35,39.2,4150.0
36,38.8,3950.0
39,39.8,4650.0
...,...,...
334,50.2,3800.0
336,51.9,3950.0
339,55.8,4000.0
341,49.6,3775.0


In [None]:
# Another method exists, integer location
# Based on NumPy counting
palmer.iloc[0:3, 0:2]

Unnamed: 0,species,island
0,Adelie,Torgersen
1,Adelie,Torgersen
2,Adelie,Torgersen


# LOCATION VS. INTEGER LOCATIONS

# Additional features

Pandas is more than just sugarcoating NumPy.

In [38]:
# Read the entire Palmer penguins dataset
palmer = pd.read_csv(filepath_or_buffer="../data/penguins.csv")

#
print(palmer)
print(type(palmer))
print(type(palmer.values))

       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0       Adelie  Torgersen            39.1           18.7              181.0   
1       Adelie  Torgersen            39.5           17.4              186.0   
2       Adelie  Torgersen            40.3           18.0              195.0   
3       Adelie  Torgersen             NaN            NaN                NaN   
4       Adelie  Torgersen            36.7           19.3              193.0   
..         ...        ...             ...            ...                ...   
339  Chinstrap      Dream            55.8           19.8              207.0   
340  Chinstrap      Dream            43.5           18.1              202.0   
341  Chinstrap      Dream            49.6           18.2              193.0   
342  Chinstrap      Dream            50.8           19.0              210.0   
343  Chinstrap      Dream            50.2           18.7              198.0   

     body_mass_g     sex  year  
0         3750.0  

In [39]:
# You can check the column names with the .columns attribute
print(palmer.columns)

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')


In [None]:
# Columns are now attributes of the DataFrame
# This means that you can select individual columns by their name rather than by their number
palmer["island"]
#palmer.species

0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object

In [52]:
# Again the most general form use the .loc attribute, as:
palmer.loc[:, :]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


In [None]:
# However, the choice is a bit different
# For columns, these must be the names themselves
# For example "bill_length" for the first 10 observations
palmer.loc[0:10, "bill_length_mm"]

0     39.1
1     39.5
2     40.3
3      NaN
4     36.7
5     39.3
6     38.9
7     39.2
8     34.1
9     42.0
10    37.8
Name: bill_length_mm, dtype: float64

In [66]:
# Groups of columns are given as a list
palmer.loc[:, ["sex", "bill_length_mm", "bill_depth_mm"]]

Unnamed: 0,sex,bill_length_mm,bill_depth_mm
0,male,39.1,18.7
1,female,39.5,17.4
2,female,40.3,18.0
3,,,
4,female,36.7,19.3
...,...,...,...
339,male,55.8,19.8
340,female,43.5,18.1
341,male,49.6,18.2
342,male,50.8,19.0


In [61]:
# In addition, you can also use boolean filters for rows
palmer["sex"] == "male"

0       True
1      False
2      False
3      False
4      False
       ...  
339     True
340    False
341     True
342     True
343    False
Name: sex, Length: 344, dtype: bool

In [65]:
# Groups of columns are given as a list
palmer.loc[palmer["sex"] == "male", ["sex", "bill_length_mm", "bill_depth_mm"]]

Unnamed: 0,sex,bill_length_mm,bill_depth_mm
0,male,39.1,18.7
5,male,39.3,20.6
7,male,39.2,19.6
13,male,38.6,21.2
14,male,34.6,21.1
...,...,...,...
334,male,50.2,18.8
336,male,51.9,19.5
339,male,55.8,19.8
341,male,49.6,18.2


In [72]:
# Note the parentheses
palmer.loc[(palmer["sex"] == "male") & (palmer["island"] == "Dream"), ["island", "sex", "bill_length_mm", "bill_depth_mm"]]

Unnamed: 0,island,sex,bill_length_mm,bill_depth_mm
31,Dream,male,37.2,18.1
33,Dream,male,40.9,18.9
35,Dream,male,39.2,21.1
36,Dream,male,38.8,20.0
39,Dream,male,39.8,19.1
...,...,...,...,...
334,Dream,male,50.2,18.8
336,Dream,male,51.9,19.5
339,Dream,male,55.8,19.8
341,Dream,male,49.6,18.2


# Quick summaries

In [27]:
# You can also get quick summaries of variables
print(palmer.describe(include="number"))

       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \
count      342.000000     342.000000         342.000000   342.000000   
mean        43.921930      17.151170         200.915205  4201.754386   
std          5.459584       1.974793          14.061714   801.954536   
min         32.100000      13.100000         172.000000  2700.000000   
25%         39.225000      15.600000         190.000000  3550.000000   
50%         44.450000      17.300000         197.000000  4050.000000   
75%         48.500000      18.700000         213.000000  4750.000000   
max         59.600000      21.500000         231.000000  6300.000000   

              year  
count   344.000000  
mean   2008.029070  
std       0.818356  
min    2007.000000  
25%    2007.000000  
50%    2008.000000  
75%    2009.000000  
max    2009.000000  


In [29]:
# Qualitative variables use the argument "object"
print(palmer.describe(include="object"))

       species  island   sex
count      344     344   333
unique       3       3     2
top     Adelie  Biscoe  male
freq       152     168   168


In [None]:
# You can use the argument "all" to get it for all variables,
# but the resulting table is ugly and somewhat useless
print(palmer.describe(include="all"))

       species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
count      344     344      342.000000     342.000000         342.000000   
unique       3       3             NaN            NaN                NaN   
top     Adelie  Biscoe             NaN            NaN                NaN   
freq       152     168             NaN            NaN                NaN   
mean       NaN     NaN       43.921930      17.151170         200.915205   
std        NaN     NaN        5.459584       1.974793          14.061714   
min        NaN     NaN       32.100000      13.100000         172.000000   
25%        NaN     NaN       39.225000      15.600000         190.000000   
50%        NaN     NaN       44.450000      17.300000         197.000000   
75%        NaN     NaN       48.500000      18.700000         213.000000   
max        NaN     NaN       59.600000      21.500000         231.000000   

        body_mass_g   sex         year  
count    342.000000   333   344.000000  
uniqu

# Data cleaning

Data cleaning is efficient, you can remove NaN values

In [32]:
palmer.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [35]:
palmer.dropna(inplace=False).head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [36]:
# Data cleaning is efficient, you can remove
#palmer.dropna(inplace=False)
#palmer.dropna(inplace=False, subset=["sex"])
#palmer[palmer.isna().any(axis=1)]

In [74]:
# Removing NaN on only a particular variable
# .dropna() removes values that are 
palmer_piece = palmer.loc[267:272, :].copy()
#
print(palmer_piece)

    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
267  Gentoo  Biscoe            55.1           16.0              230.0   
268  Gentoo  Biscoe            44.5           15.7              217.0   
269  Gentoo  Biscoe            48.8           16.2              222.0   
270  Gentoo  Biscoe            47.2           13.7              214.0   
271  Gentoo  Biscoe             NaN            NaN                NaN   
272  Gentoo  Biscoe            46.8           14.3              215.0   

     body_mass_g     sex  year  
267       5850.0    male  2009  
268       4875.0     NaN  2009  
269       6000.0    male  2009  
270       4925.0  female  2009  
271          NaN     NaN  2009  
272       4850.0  female  2009  


In [None]:
# If you use .dropna(), you will remove both rows 268 and 271
#
print(palmer_piece.dropna())


--------------------------------------------------
    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
267  Gentoo  Biscoe            55.1           16.0              230.0   
269  Gentoo  Biscoe            48.8           16.2              222.0   
270  Gentoo  Biscoe            47.2           13.7              214.0   
272  Gentoo  Biscoe            46.8           14.3              215.0   

     body_mass_g     sex  year  
267       5850.0    male  2009  
269       6000.0    male  2009  
270       4925.0  female  2009  
272       4850.0  female  2009  


In [76]:
# But you can specify on which columns you want to drop rows if they contain NaN values
#
print(palmer_piece.dropna(subset=["bill_depth_mm"]))

    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
267  Gentoo  Biscoe            55.1           16.0              230.0   
268  Gentoo  Biscoe            44.5           15.7              217.0   
269  Gentoo  Biscoe            48.8           16.2              222.0   
270  Gentoo  Biscoe            47.2           13.7              214.0   
272  Gentoo  Biscoe            46.8           14.3              215.0   

     body_mass_g     sex  year  
267       5850.0    male  2009  
268       4875.0     NaN  2009  
269       6000.0    male  2009  
270       4925.0  female  2009  
272       4850.0  female  2009  


In [93]:
#  You can build new variables
print(palmer["bill_length_mm"] / palmer["bill_depth_mm"])

# Adding a column is done simply by 
palmer["bill_ratio"] = palmer["bill_length_mm"] / palmer["bill_depth_mm"]

0      2.090909
1      2.270115
2      2.238889
3           NaN
4      1.901554
         ...   
339    2.818182
340    2.403315
341    2.725275
342    2.673684
343    2.684492
Length: 344, dtype: float64


In [45]:
# You can check to make sure it is there.
print(palmer.columns)

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year', 'bill_ratio'],
      dtype='object')


# XXX Summarizing variables using .groupby

In [77]:
# You can use the .groupby() method to group dataframes by a qualitative variable
print(palmer.groupby(by="island"))
print(type(palmer.groupby(by="island")))

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023A51B98AD0>
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


In [81]:
# You can get the function you want by using it as a method afterwards
# .count()
# .mean()
# .min()
# .max()
print("Average penguin bill length by island")
(palmer.groupby(by="island").count())

Average penguin bill length by island


Unnamed: 0_level_0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
island,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Biscoe,168,167,167,167,167,163,168
Dream,124,124,124,124,124,123,124
Torgersen,52,51,51,51,51,47,52


In [None]:
# The following code will fail, but look at the output to
# understand
#
# You want to change the aggregate function
print("Average penguin bill length by island")
print(palmer.groupby(by="island").mean())

In [86]:
# You can consider a single variable by appending it at the end
print("Average penguin bill length by island")
(palmer.groupby(by="island")["bill_length_mm"].mean())

Average penguin bill length by island


island
Biscoe       45.257485
Dream        44.167742
Torgersen    38.950980
Name: bill_length_mm, dtype: float64

In [None]:
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length by island, then by species")
(palmer.groupby(by=["island", "species"])["bill_length_mm"].mean())

Average penguin bill length by island


island     species  
Biscoe     Adelie       38.975000
           Gentoo       47.504878
Dream      Adelie       38.501786
           Chinstrap    48.833824
Torgersen  Adelie       38.950980
Name: bill_length_mm, dtype: float64

In [90]:
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length by island, then by species, then by sex")
(palmer.groupby(by=["island", "species", "sex"])["bill_length_mm"].mean())

Average penguin bill length by island, then by species, then by sex


island     species    sex   
Biscoe     Adelie     female    37.359091
                      male      40.590909
           Gentoo     female    45.563793
                      male      49.473770
Dream      Adelie     female    36.911111
                      male      40.071429
           Chinstrap  female    46.573529
                      male      51.094118
Torgersen  Adelie     female    37.554167
                      male      40.586957
Name: bill_length_mm, dtype: float64

In [91]:
# More than one variable can be submitted, in which case they must be a list
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length and depth by island, then by species, then by sex")
(palmer.groupby(by=["island", "species", "sex"])[["bill_length_mm", "bill_length_mm"]].mean())

Average penguin bill length and depth by island, then by species, then by sex


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,bill_length_mm,bill_length_mm
island,species,sex,Unnamed: 3_level_1,Unnamed: 4_level_1
Biscoe,Adelie,female,37.359091,37.359091
Biscoe,Adelie,male,40.590909,40.590909
Biscoe,Gentoo,female,45.563793,45.563793
Biscoe,Gentoo,male,49.47377,49.47377
Dream,Adelie,female,36.911111,36.911111
Dream,Adelie,male,40.071429,40.071429
Dream,Chinstrap,female,46.573529,46.573529
Dream,Chinstrap,male,51.094118,51.094118
Torgersen,Adelie,female,37.554167,37.554167
Torgersen,Adelie,male,40.586957,40.586957


In [94]:
# You can even go full complexity by aggregating multiple variables
palmer.groupby(["species", "sex"])[["body_mass_g", "bill_ratio"]].agg(["min", "median", "max"])

Unnamed: 0_level_0,Unnamed: 1_level_0,body_mass_g,body_mass_g,body_mass_g,bill_ratio,bill_ratio,bill_ratio
Unnamed: 0_level_1,Unnamed: 1_level_1,min,median,max,min,median,max
species,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Adelie,female,2850.0,3400.0,3900.0,1.763158,2.105263,2.434524
Adelie,male,3325.0,4000.0,4775.0,1.63981,2.151832,2.45
Chinstrap,female,2700.0,3550.0,4150.0,2.350515,2.628989,3.258427
Chinstrap,male,3250.0,3950.0,4800.0,2.477387,2.667567,2.872928
Gentoo,female,3950.0,4700.0,5200.0,2.836735,3.21884,3.492424
Gentoo,male,4750.0,5500.0,6300.0,2.566474,3.132075,3.612676


# PIVOT STACK et al.

In [100]:
# You can submit more than one variable in the "by" parameter
# These are considered in succession
toto = (palmer.groupby(by=["island", "species"])["bill_length_mm"].mean())

print("Average penguin bill length by island, then by species")
print(toto)

Average penguin bill length by island, then by species
island     species  
Biscoe     Adelie       38.975000
           Gentoo       47.504878
Dream      Adelie       38.501786
           Chinstrap    48.833824
Torgersen  Adelie       38.950980
Name: bill_length_mm, dtype: float64


In [None]:
# Unstack
print(toto.unstack())

species       Adelie  Chinstrap     Gentoo
island                                    
Biscoe     38.975000        NaN  47.504878
Dream      38.501786  48.833824        NaN
Torgersen  38.950980        NaN        NaN
island     species  
Biscoe     Adelie       38.975000
           Gentoo       47.504878
Dream      Adelie       38.501786
           Chinstrap    48.833824
Torgersen  Adelie       38.950980
Name: bill_length_mm, dtype: float64


In [105]:
toto

island     species  
Biscoe     Adelie       38.975000
           Gentoo       47.504878
Dream      Adelie       38.501786
           Chinstrap    48.833824
Torgersen  Adelie       38.950980
Name: bill_length_mm, dtype: float64

In [None]:
# Reset the index to convert the MultiIndex into columns
toto_df = toto.reset_index()

# Pivot the DataFrame: 'island' as rows, 'species' as columns, and 'bill_length_mm' as values
pivoted_df = toto_df.pivot(index="island", columns="species", values="bill_length_mm")

# Display the result
print(pivoted_df)

species       Adelie  Chinstrap     Gentoo
island                                    
Biscoe     38.975000        NaN  47.504878
Dream      38.501786  48.833824        NaN
Torgersen  38.950980        NaN        NaN
