# Pivot tables

## TLDR;
The arguments of the pivot_table method are as follows
1. `values`: this is the first argument (could skip typing "values=" all together as the method will assume the
first value in the method call to be the "values")
    - this is basically the column you want to aggregate (pivot the entire new DataFrame around, hence "Pivot table").

2. `index`: this needs to be specified, could be a column or a list of columns. In essence, this is going to be
the "index" if the newly generated DataFrame.

3. `columns`: as you may guess by now, this is what is going to be the columns in the new DataFrame

<img src="../4_Data_Manipulation_with_pandas\Resoureces\Screenshot 2025-07-26 184848.png"/>

In [None]:
import pandas as pd
import numpy as np

## Usage
Pivot tables are another way of calculating grouped summary statistics. If you've ever used a spreadsheet, chances are you've used a pivot table. Let's see how to create pivot tables in pandas.

In [None]:
data = [
    {"date_of_birth": "2011-12-11", "name": "Cooper", "breed": "Schanuzer", "color": "Grey", "height_cm": 49, "weight_kg": 17},
    {"date_of_birth": "2013-07-01", "name": "Bella", "breed": "Labrador", "color": "Brown", "height_cm": 56, "weight_kg": 25},
    {"date_of_birth": "2014-08-25", "name": "Lucy", "breed": "Chow Chow", "color": "Brown", "height_cm": 46, "weight_kg": 22},
    {"date_of_birth": "2015-04-20", "name": "Stella", "breed": "Chihuahua", "color": "Tan", "height_cm": 18, "weight_kg": 2},
    {"date_of_birth": "2016-09-16", "name": "Charlie", "breed": "Poodle", "color": "Black", "height_cm": 43, "weight_kg": 23},
    {"date_of_birth": "2017-01-20", "name": "Max", "breed": "Labrador", "color": "Black", "height_cm": 59, "weight_kg": 29},
    {"date_of_birth": "2018-02-27", "name": "Bernie", "breed": "St. Bernard", "color": "White", "height_cm": 77, "weight_kg": 74},
]

dogs = pd.DataFrame(data)
dogs['date_of_birth'] = pd.to_datetime(dogs['date_of_birth'])
dogs.set_index('date_of_birth', inplace=True)


In [None]:
dogs

## Group by to pivot table
In the last lesson, we grouped the dogs by color and calculated their mean weights. We can do the same thing using the pivot_table method. The "values" argument is the column that you want to summarize, and the index column is the column that you want to group by. By default, pivot_table takes the **mean value** for each group.

In [None]:
dogs.groupby("color")["weight_kg"].mean()

In [None]:
dogs.pivot_table(values="weight_kg", index="color")

If we want a different summary statistic, we can use the **aggfunc** argument and pass it a function. Here, we take the median for each dog color.

In [None]:
dogs.pivot_table(values="weight_kg", index="color", aggfunc="median")

To get multiple summary statistics at a time, we can pass a list of functions to the aggfunc argument. Here, we get the mean and median for each dog color.

In [None]:
dogs.pivot_table(values="weight_kg", index="color", aggfunc=["median", "mean"])

## Pivot on two variables
You also previously computed the mean weight grouped by two variables: color and breed. We can also do this using the pivot_table method. To group by two variables, we can pass a second variable name into the columns argument. While the result looks a little different than what we had before, it contains the same numbers. There are NaNs, or missing values, because there are no black Chihuahuas or gray Labradors in our dataset, for example.y

In [None]:
dogs.groupby(["color", "breed"])["weight_kg"].mean()

In [None]:
dogs.pivot_table(values="weight_kg", index="color", columns="breed")

### Filling missing values in pivot tables
Instead of having lots of missing values in our pivot table, we can have them filled in using the fill_value argument. Here, all of the NaNs get filled in with zeros.

In [None]:
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

## Summing with pivot tables

If we set the margins argument to True, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with Os. For example, in the last row of the Labrador column, we can see that the mean weight of the Labradors is 26 kilograms. In the last column of the Brown row, the mean weight of the Brown dogs is 24 kilograms. The value in the bottom right, in the last row and last column, is the mean weight of all the dogs in the dataset. Using margins equals True allows us to see a summary statistic for multiple levels of the dataset: the entire dataset, grouped by one variable, by another variable, and by two variables.

In [None]:
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)

In [None]:
dogs_csv = pd.read_csv("Resources\dogs.csv", index_col='date_of_birth', parse_dates=['date_of_birth'])

In [None]:
dogs_csv

- The first argument is the column name containing values to aggregate.
- The index argument lists the columns to group by and display in rows
- The columns argument lists the columns to group by and display in columns.

We'll use the default aggregation function, which is mean.

In [None]:
dogs_height_by_breed_vs_color = dogs_csv.pivot_table("height_cm", index="breed", columns="color")
dogs_height_by_breed_vs_color

Pivot tables are just DataFrames with sorted indexes. That means that all the fun stuff you've learned so far this chapter can be used on them. In particular, the loc and slicing combination is ideal for subsetting pivot tables, like so.

In [None]:
dogs_height_by_breed_vs_color.loc["American Staffordshire Terrier":"Basset Hound"]

## The axis argument

The methods for calculating summary statistics on a DataFrame, such as mean, have an axis argument. The default value is "index," which means "calculate the statistic across rows." Here, the mean is calculated for each color. That is, "across the breeds." The behavior is the same as if you hadn't specified the axis argument.

In [None]:
dogs_height_by_breed_vs_color.mean(axis="index")

Calculating summary stats across columns

To calculate a summary statistic for each row, that is, "across the columns," you set axis to "columns." Here, the mean height is calculated for each breed. That is, "across the colors." For most DataFrames, setting the axis argument doesn't make any sense, since you'll have different data types in each column. Pivot tables are a special case since every column contains the same data type.

In [None]:
dogs_height_by_breed_vs_color.mean(axis="columns")

In [None]:
dogs_height_by_breed_vs_color.mean()

In [None]:
dogs_height_by_breed_vs_color

# Recap
You learned about the power of pivot tables in pandas for data analysis, focusing on creating, subsetting, and performing calculations on them. Pivot tables allow you to reorganize and summarize selected columns and rows of data in order to draw insights from large datasets. Key points include:

- Creating pivot tables using the `pivot_table` method, where you specify the values to aggregate, and the indices and columns to group by. The default aggregation function is mean.
- Adding a new column to a DataFrame based on date components, such as extracting the year from a date to use in a pivot table.
- Subsetting pivot tables using the `.loc[]` method combined with slicing to select specific ranges of data. This is useful for drilling down into specific segments of your data.
- Calculating summary statistics on pivot tables, such as the mean temperature for each year or city. You can calculate these statistics across rows or columns by specifying the `axis` argument.

For example, to create a pivot table of average temperatures by country and city versus year and then see the result, you would use:
```python
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index=["country", "city"], columns="year")

# See the result
print(temp_by_country_city_vs_year)
```

This lesson equipped you with the skills to manipulate and analyze data more effectively using pivot tables in pandas.

The goal of the next lesson is to teach you how to visualize and understand data distributions, relationships, and trends over time using `pandas` and `matplotlib` in Python.