Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

# Explicit indexes

## Explicit indexes

In [1]:
import pandas as pd
dogs = pd.read_csv('dogs.csv')
dogs.head()

Unnamed: 0,name,breed,color,height_cm,weight_kg,date_of_birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Chow Chow,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20


In [2]:
print(dogs)

      name        breed   color  height_cm  weight_kg date_of_birth
0    Bella     Labrador   Brown         56         25    2013-07-01
1  Charlie       Poodle   Black         43         23    2016-09-16
2     Lucy    Chow Chow   Brown         46         22    2014-08-25
3   Cooper    Schnauzer    Gray         49         17    2011-12-11
4      Max     Labrador   Black         59         29    2017-01-20
5   Stella    Chihuahua     Tan         18          2    2015-04-20
6   Bernie  St. Bernard   White         77         74    2018-02-27


In [3]:
dogs.columns

Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'], dtype='object')

In [4]:
dogs.index

RangeIndex(start=0, stop=7, step=1)

## Setting a column as the index

In [5]:
dogs_ind = dogs.set_index("name")
print(dogs_ind)

               breed   color  height_cm  weight_kg date_of_birth
name                                                            
Bella       Labrador   Brown         56         25    2013-07-01
Charlie       Poodle   Black         43         23    2016-09-16
Lucy       Chow Chow   Brown         46         22    2014-08-25
Cooper     Schnauzer    Gray         49         17    2011-12-11
Max         Labrador   Black         59         29    2017-01-20
Stella     Chihuahua     Tan         18          2    2015-04-20
Bernie   St. Bernard   White         77         74    2018-02-27


## Removing an index

In [6]:
dogs_ind.reset_index()

Unnamed: 0,name,breed,color,height_cm,weight_kg,date_of_birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Chow Chow,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20
5,Stella,Chihuahua,Tan,18,2,2015-04-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


## Dropping an index

In [7]:
dogs_ind.reset_index(drop=True)

Unnamed: 0,breed,color,height_cm,weight_kg,date_of_birth
0,Labrador,Brown,56,25,2013-07-01
1,Poodle,Black,43,23,2016-09-16
2,Chow Chow,Brown,46,22,2014-08-25
3,Schnauzer,Gray,49,17,2011-12-11
4,Labrador,Black,59,29,2017-01-20
5,Chihuahua,Tan,18,2,2015-04-20
6,St. Bernard,White,77,74,2018-02-27


## Indexes make subsetting simpler

In [8]:
dogs[dogs["name"].isin(["Bella", "Stella"])]

Unnamed: 0,name,breed,color,height_cm,weight_kg,date_of_birth
0,Bella,Labrador,Brown,56,25,2013-07-01
5,Stella,Chihuahua,Tan,18,2,2015-04-20


In [9]:
dogs_ind.loc[["Bella", "Stella"]]

Unnamed: 0_level_0,breed,color,height_cm,weight_kg,date_of_birth
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bella,Labrador,Brown,56,25,2013-07-01
Stella,Chihuahua,Tan,18,2,2015-04-20


## Index values don't need to be unique

In [10]:
dogs_ind2 = dogs.set_index("breed")
print(dogs_ind2)

                name   color  height_cm  weight_kg date_of_birth
breed                                                           
Labrador       Bella   Brown         56         25    2013-07-01
Poodle       Charlie   Black         43         23    2016-09-16
Chow Chow       Lucy   Brown         46         22    2014-08-25
Schnauzer     Cooper    Gray         49         17    2011-12-11
Labrador         Max   Black         59         29    2017-01-20
Chihuahua     Stella     Tan         18          2    2015-04-20
St. Bernard   Bernie   White         77         74    2018-02-27


## Subsetting on duplicated index values

In [11]:
dogs_ind2.loc["Labrador"]

Unnamed: 0_level_0,name,color,height_cm,weight_kg,date_of_birth
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Bella,Brown,56,25,2013-07-01
Labrador,Max,Black,59,29,2017-01-20


## Multi-level indexes a.k.a. hierarchical indexes

In [12]:
dogs_ind3 = dogs.set_index(["breed", "color"])
print(dogs_ind3)

                       name  height_cm  weight_kg date_of_birth
breed       color                                              
Labrador    Brown     Bella         56         25    2013-07-01
Poodle      Black   Charlie         43         23    2016-09-16
Chow Chow   Brown      Lucy         46         22    2014-08-25
Schnauzer   Gray     Cooper         49         17    2011-12-11
Labrador     Black      Max         59         29    2017-01-20
Chihuahua   Tan      Stella         18          2    2015-04-20
St. Bernard White    Bernie         77         74    2018-02-27


## Subset the outer level with a list

In [13]:
dogs_ind3.loc[["Labrador", "Chihuahua"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Labrador,Black,Max,59,29,2017-01-20
Chihuahua,Tan,Stella,18,2,2015-04-20


## Subset inner levels with a list of tuples

In [14]:
dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Chihuahua,Tan,Stella,18,2,2015-04-20


## Sorting by index values

In [15]:
dogs_ind3.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chihuahua,Tan,Stella,18,2,2015-04-20
Chow Chow,Brown,Lucy,46,22,2014-08-25
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Schnauzer,Gray,Cooper,49,17,2011-12-11
St. Bernard,White,Bernie,77,74,2018-02-27


## Controlling sort_index

In [16]:
dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Black,Max,59,29,2017-01-20
Poodle,Black,Charlie,43,23,2016-09-16
Labrador,Brown,Bella,56,25,2013-07-01
Chow Chow,Brown,Lucy,46,22,2014-08-25
Schnauzer,Gray,Cooper,49,17,2011-12-11
Chihuahua,Tan,Stella,18,2,2015-04-20
St. Bernard,White,Bernie,77,74,2018-02-27


## Now you have two problems

- Index values are just data
- Indexes violate "tidy data" principles
- You need to learn two syntaxes

## Setting & removing indexes

pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.

### Instructions

- Look at temperatures.
- Set the index of temperatures to "city", assigning to temperatures_ind.
- Look at temperatures_ind. How is it different from temperatures?
- Reset the index of temperatures_ind, keeping its contents.
- Reset the index of temperatures_ind, dropping its contents.

In [18]:
temperatures = pd.read_csv('temp.csv')
temperatures.head()

Unnamed: 0,date,city,country,avg_temp_c
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685
2,2000-03-01,Abidjan,Côte D'Ivoire,29.061
3,2000-04-01,Abidjan,Côte D'Ivoire,28.162
4,2000-05-01,Abidjan,Côte D'Ivoire,27.547


In [19]:
# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index("city")

# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

         date     city        country  avg_temp_c
0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
               date        country  avg_temp_c
city                                          
Abidjan  2000-01-01  Côte D'Ivoire      27.293
Abidjan  2000-02-01  Côte D'Ivoire      27.685
Abidjan  2000-03-01  Côte D'Ivoire      29.061
Abidjan  2000-04-01  Côte D'Ivoire      28.162
Abidjan  2000-05-01  Côte D'Ivoire      27.547
      city        date        country  avg_temp_c
0  Abidjan  2000-01-01  Côte D'Ivoire      27.293
1  Abidjan  2000-02-01  Côte D'Ivoire      27.685
2  Abidjan  2000-03-01  Côte D'Ivoire      29.061
3  Abidjan  2000-04-01  Côte D'Ivoire      28.162
4  Abidjan  2000-05-01  Côte D'Ivoire      27.547
         date        country  avg_temp_c
0  2000-01-01  Côte D'Ivoire  

***

## Subsetting with .loc[]

The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.

### Instructions

- Create a list of cities to subset on: Moscow and Saint Petersburg. Assign to cities.
- Use [] subsetting to filter temperatures for rows where the city column takes a value in cities.
- Use .loc[] subsetting to filter temperatures_ind for rows where the city is in cities.

In [None]:
# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

## Setting multi-level indexes

Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or other group, and we can say that test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say city is nested inside country.

The main downside is that the code for manipulating indexes is different to the code for the manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.

pandas is loaded as pd. temperatures is available.

### Instructions

- Set the index of temperatures to the "country" and "city" columns, assigning to temperatures_ind.
- Specify two country/city pairs to keep: Brazil/Rio De Janeiro and Pakistan/Lahore, assigning to rows_to_keep.
- Subset for rows_to_keep using .loc[].

In [None]:
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

## Sorting by index values

Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

pandas is loaded as pd. temperatures_ind has a multi-level index of country and city, and is available.

### Instructions

- Sort temperatures_ind by the index values.
- Sort temperatures_ind by the index values at the "city" level.
- Sort temperatures_ind by ascending country then descending city.

In [None]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level=["city"]))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending=[True, False]))

***

# Slicing and subsetting with .loc and .iloc

## Slicing lists

In [21]:
breeds = ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador", "Chihuahua", "St. Bernard"]

In [22]:
breeds

['Labrador',
 'Poodle',
 'Chow Chow',
 'Schnauzer',
 'Labrador',
 'Chihuahua',
 'St. Bernard']

In [23]:
breeds[2:5]

['Chow Chow', 'Schnauzer', 'Labrador']

In [24]:
breeds[:3]

['Labrador', 'Poodle', 'Chow Chow']

In [25]:
breeds[:]

['Labrador',
 'Poodle',
 'Chow Chow',
 'Schnauzer',
 'Labrador',
 'Chihuahua',
 'St. Bernard']

## Sort the index before you slice

In [26]:
dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
print(dogs_srt)

                       name  height_cm  weight_kg date_of_birth
breed       color                                              
Chihuahua   Tan      Stella         18          2    2015-04-20
Chow Chow   Brown      Lucy         46         22    2014-08-25
Labrador     Black      Max         59         29    2017-01-20
            Brown     Bella         56         25    2013-07-01
Poodle      Black   Charlie         43         23    2016-09-16
Schnauzer   Gray     Cooper         49         17    2011-12-11
St. Bernard White    Bernie         77         74    2018-02-27


## Slicing the outer index level

In [27]:
dogs_srt.loc["Chow Chow":"Poodle"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chow Chow,Brown,Lucy,46,22,2014-08-25
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16


In [31]:
dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
print(dogs_srt)

                       name  height_cm  weight_kg date_of_birth
breed       color                                              
Chihuahua   Tan      Stella         18          2    2015-04-20
Chow Chow   Brown      Lucy         46         22    2014-08-25
Labrador     Black      Max         59         29    2017-01-20
            Brown     Bella         56         25    2013-07-01
Poodle      Black   Charlie         43         23    2016-09-16
Schnauzer   Gray     Cooper         49         17    2011-12-11
St. Bernard White    Bernie         77         74    2018-02-27


## Slicing the inner index levels badly

In [32]:
dogs_srt.loc["Tan":"Gray"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1


## Slicing the inner index levels correctly

In [33]:
dogs_srt.loc[
("Labrador", "Brown"):("Schnauzer", "Gray")]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Schnauzer,Gray,Cooper,49,17,2011-12-11


## Slicing columns

In [35]:
dogs_srt.loc[:, "name":"height_cm"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,Tan,Stella,18
Chow Chow,Brown,Lucy,46
Labrador,Black,Max,59
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Gray,Cooper,49
St. Bernard,White,Bernie,77


## Slice twice

In [36]:
dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"),"name":"height_cm"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Gray,Cooper,49


## Dog days

In [37]:
dogs = dogs.set_index("date_of_birth").sort_index()
print(dogs)

                  name        breed   color  height_cm  weight_kg
date_of_birth                                                    
2011-12-11      Cooper    Schnauzer    Gray         49         17
2013-07-01       Bella     Labrador   Brown         56         25
2014-08-25        Lucy    Chow Chow   Brown         46         22
2015-04-20      Stella    Chihuahua     Tan         18          2
2016-09-16     Charlie       Poodle   Black         43         23
2017-01-20         Max     Labrador   Black         59         29
2018-02-27      Bernie  St. Bernard   White         77         74


## Slicing by dates

In [40]:
dogs.loc["2014-08-25":"2016-09-16"]

Unnamed: 0_level_0,name,breed,color,height_cm,weight_kg
date_of_birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Chow Chow,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2
2016-09-16,Charlie,Poodle,Black,43,23


## Slicing by partial dates

In [41]:
dogs.loc["2014":"2016"]

Unnamed: 0_level_0,name,breed,color,height_cm,weight_kg
date_of_birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Chow Chow,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2


## Subsetting by row/column number

In [43]:
print(dogs.iloc[2:5, 1:4])

                   breed  color  height_cm
date_of_birth                             
2014-08-25     Chow Chow  Brown         46
2015-04-20     Chihuahua    Tan         18
2016-09-16        Poodle  Black         43


***

## Slicing index values

Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values, or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

You can only slice an index if the index is sorted (using .sort_index()).
To slice at the outer level, first and last can be strings.
To slice at inner levels, first and last should be tuples.
If you pass a single slice to .loc[], it will slice the rows.
pandas is loaded as pd. temperatures_ind has country and city in the index, and is available.

### Instructions

- Sort the index of temperatures_ind.
- Use slicing with .loc[] to get these subsets:
- from Pakistan to Russia.
- from Lahore to Moscow. (This will return nonsense.)
- from Pakistan, Lahore to Russia, Moscow.

In [None]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Incorrectly subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])

# Subset rows from Lahore to Moscow
print(temperatures_srt.loc["Lahore":"Moscow"])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan", "Lahore") : ("Russia", "Moscow")])

## Slicing in both directions

You've seen slicing DataFrames by rows and by columns, but since DataFrames are two dimensional objects it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted index, and is available.

### Instructions

- Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
- Use .loc[] slicing to subset columns from date to avg_temp_c.
- Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

In [None]:
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])

## Slicing time series

Slicing is particularly useful for time series, since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, yyyy-mm-dd.

Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators (such as &). To do so in one line of code you'll need to add parentheses () around each condition.

pandas is loaded as pd and temperatures, with no index, is available.

### Instructions

- Use Boolean conditions to subset for rows in 2010 and 2011, and print the results.
- Set the index to the date column.
- Use .loc[] to subset for rows in 2010 and 2011.
- Use .loc[] to subset for rows from Aug 2010 to Feb 2011.

In [None]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
print(temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] < "2012")])

# Set date as an index
temperatures_ind = temperatures.set_index("date")

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

## Subsetting by row/column number

The most common ways to subset rows are the ways we've previously discussed: using a Boolean condition, or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

pandas is loaded as pd. temperatures (without an index) is available.

### Instructions

Use .iloc[] on temperatures to take subsets.

- Get the 23rd row, 2nd column (index positions 22 and 1).
- Get the first 5 rows (index positions 0 to 5).
- Get all rows, columns 2 and 3 (index positions 2 to 4).
- Get the first 5 rows, columns 2 and 3.

In [None]:
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22, 1])

# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])

# Use slicing to get columns 2 to 3
print(temperatures.iloc[:, 2:4])

# Use slicing in both directions at once
print(temperatures.iloc[:5, 2:4])

***

# Working with pivot tables

## A bigger dog dataset

In [None]:
print(dog_pack)

## Pivoting the dog pack

In [None]:
dogs_height_by_breed_vs_color = dog_pack.pivot_table("height_cm", index="breed", columns="color")
print(dogs_height_by_breed_vs_color)

## .loc[] + slicing is a power combo

In [None]:
dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]

## The axis argument

In [None]:
dogs_height_by_breed_vs_color.mean(axis="index")

## Calculating summary stats across columns

In [None]:
dogs_height_by_breed_vs_color.mean(axis="columns")

***

## Pivot temperature by city and year

It's interesting to see how temperatures for each city change over time. Looking at every month results in a big table that's tricky to reason about. Instead, let's look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form dataframe.dt.component. For example, the month component is dataframe.dt.month, and the year component is dataframe.dt.year.

Once you have the year column, you can create a pivot table with the data aggregated by city and year, which you'll explore in the coming exercises.

pandas is loaded as pd. temperatures is available.

### Instructions

- Add a year column to temperatures, from the year component of the date column.
- Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [None]:
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index = ["country", "city"], columns = "year")

# See the result
print(temp_by_country_city_vs_year)

## Subsetting pivot tables

A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

pandas is loaded as pd. temp_by_country_city_vs_year is available.

### Instructions

Use .loc[] on temp_by_country_city_vs_year to take subsets.

- From Egypt to India.
- From Egypt, Cairo to India, Delhi.
- From Egypt, Cairo to India, Delhi and 2005 to 2010.

In [None]:
# Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi")]

# Subset in both directions at once
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi"), "2005":"2010"]

## Calculating on a pivot table

Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where a highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

pandas is loaded as pd and the DataFrame temp_by_country_city_vs_year is available.

### Instructions

- Calculate the mean temperature for each year, assigning to mean_temp_by_year.
- Filter mean_temp_by_year for the year that had the highest mean temperature.
- Calculate the mean temperature for each city (across columns), assigning to mean_temp_by_city.
- Filter mean_temp_by_city for the city that had the lowest mean temperature.

In [None]:
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Find the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Find the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])