# Data Manipulation with pandas

## Introduction

These are my notes for DataCamp's course [_Data Manipulation with pandas_](https://www.datacamp.com/courses/data-manipulation-with-pandas).

Presented by Maggie Matsui, Curriculum Manager at DataCamp, and Richie Cotton, Learning Solutions Architect at DataCamp. Collaborators are Amy Peterson, Adel Nehme, Alex Yarosh, and Justin Saddlemeyer.

Prerequisite:

- [_Intermediate Python_](../Intermediate%20Python/Intermediate%20Python.ipynb)

This course is part of these tracks:

- Data Analyst with Python
- Data Manipulation with Python
- Data Scientist with Python
- Data Scientist Professional with Python
- Python Programmer

## Versions

The course uses Python '3.9.7 (default, Sep 10 2021, 00:03:59) \n[GCC 7.5.0]'.

## Imports

Imports are placed here for convenience and clarity.

In [None]:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Datasets

| Name | File |
| :--- | :--- |
| Avocado Prices | avoplotto.pkl |
| Walmart Sales | sales_subset.csv |
| Homelessness Data | homelessness.csv |
| Temperatures | temperatures.csv |

### Walmart Sales

In [None]:
# Load sales_subset.csv into a DataFrame.
sales = pd.read_csv("sales_subset.csv", index_col=0, parse_dates=["date"])
print(sales.head(), "\n")
print(sales.info(), "\n")
print(sales.shape, "\n")
print(sales.describe())

### Homelessness Data

In [None]:
# Load homelessness.csv into a DataFrame.
homelessness = pd.read_csv("homelessness.csv", index_col=0)
print(homelessness.head(), "\n")
print(homelessness.info(), "\n")
print(homelessness.shape, "\n")
print(homelessness.describe())

### Temperatures

In [None]:
# Load temperatures.csv into a DataFrame.
temperatures = \
    pd.read_csv(
        "temperatures.csv",
        index_col=None,
        usecols=[1, 2, 3, 4],
        parse_dates=[0])
print(temperatures.head(), "\n")
print(temperatures.info(), "\n")
print(temperatures.shape, "\n")
print(temperatures.describe())

## Transforming DataFrames

### Introducting DataFrames

Pandas is built on top of NumPy and Matplotlib. Pandas organizes data in rectangular or tabular form. Pandas often provides multiple ways of doing something.

#### Create a DataFrame from a Dictionary (Demonstration)

In [None]:
# Build the example dogs DataFrame.
# Extra: Make sure the values for date_of_birth have dtype datetime64[ns].
# Later in the course, the examples use "Grey" instead of "Gray".
date_of_birth = np.array(
    ["2013-07-01", "2016-09-16", "2014-08-25", "2011-12-11",
        "2017-01-20", "2015-04-20", "2018-02-27"],
    dtype="datetime64[ns]")
print(date_of_birth)
print()

# Create the DataFrame and explore it.
data_dict = {
    "name" : ["Bella", "Charlie", "Lucy", "Cooper", "Max", 
              "Stella", "Bernie"],
    "breed" : ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador",
               "Chihuahua", "St. Bernard"],
    "color" : ["Brown", "Black", "Brown", "Gray", "Black", "Tan", "White"],
    "height_cm" : [56, 43, 46, 49, 59, 18, 77],
    "weight_kg" : [25, 23, 22, 17, 29, 2, 74],
    "date_of_birth" : date_of_birth}
dogs = pd.DataFrame(data_dict)

# Explore the content of a new DataFrame.
print(dogs.head())
print()

# Print a summary of information about the DataFrame.
print(dogs.info())
print()

# Print the shape of the DataFrame (rows, columns).
print(dogs.shape)
print()

# Create summary statistics.
print(dogs.describe())
print()

# Get the values from the DataFrame as NumPy array containing the
# rows of the DataFrame.
print(type(dogs.values))
print(dogs.values)
print()

# Get the names of the columns.
print(dogs.columns)
print()

# Get the names of the rows. Here, we have a RangeIndex object with row
# numbers since there aren't any row labels.
print(dogs.index)
print()

# Let Jupyter format the DataFrame nicely.
dogs

#### Create a DataFrame from a Tuple of Tuples (Extra)

In [None]:
# Build a DataFrame from a tuple of tuples (as if the data came from a
# database query). Note how the dates must be handled.
dogs_columns = ("name", "breed", "color", "height_cm", "weight_kg",
                "date_of_birth")
dogs_data = (
    ('Bella', 'Labrador', 'Brown', 56, 25, datetime.datetime.fromisoformat('2013-07-01')),
    ('Charlie', 'Poodle', 'Black', 43, 23, datetime.datetime.fromisoformat('2016-09-16')),
    ('Lucy', 'Chow Chow', 'Brown', 46, 22, datetime.datetime.fromisoformat('2014-08-25')),
    ('Cooper', 'Schnauzer', 'Gray', 49, 17, datetime.datetime.fromisoformat('2011-12-11')),
    ('Max', 'Labrador', 'Black', 59, 29, datetime.datetime.fromisoformat('2017-01-20')),
    ('Stella', 'Chihuahua', 'Tan', 18, 2, datetime.datetime.fromisoformat('2015-04-20')),
    ('Bernie', 'St. Bernard', 'White', 77, 74, datetime.datetime.fromisoformat('2018-02-27'))
)
dogs2 = pd.DataFrame(data=dogs_data, columns=dogs_columns)
dogs2

In [None]:
# Check equivalence of the dogs and dogs2 DataFrames.
print(dogs.equals(dogs2))

#### Load Homeless Data and Print Information (Exercise)

Homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018.
- The individual column is the number of homeless individuals not part of a family with children.
- The family_members column is the number of homeless individuals part of a family with children.
- The state_pop column is the state's total population.

In [None]:
# The parts of a DataFrame are the values, index, and columns.
# Print each of the three parts.
print(homelessness.values)
print()
print(homelessness.columns)
print()
print(homelessness.index)

### Sorting and Subsetting

#### Sort Rows by Values in a Column (Demonstration)

In [None]:
# Sort rows by values in a column.
dogs.sort_values("weight_kg")

In [None]:
# Sort descending (ascending=False).
dogs.sort_values("weight_kg", ascending=False)

In [None]:
# Sort rows by multiple columns.
dogs.sort_values(["weight_kg", "height_cm"])

In [None]:
# Sort rows by multiple columns, ascending for one column and descending
# for another. This sorts by weight_kg in ascending order, then by
# height_cm in descending order.
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])

#### Create Subsets from One or More Columns of a DataFrame

In [None]:
# Create a Series subset from a column of a DataFrame.
subset1 = dogs["name"]
print("type(subset1):", type(subset1))
subset1

In [None]:
# Create a DataFrame subset from a column of a DataFrame.
subset2 = dogs[["name"]]
print("type(subset2):", type(subset2))
subset2

In [None]:
# Create a DataFrame subset containing multiple columns.
# Subsetting by multiple columns requires [[ ... ]]. The inner
# [ ... ] is the list of column names. The outer [ ... ] performs the
# subsetting.
dogs[["breed", "height_cm"]]

In [None]:
# Subset using multiple columns, a second way that emphasizes
# how the [[ ... ]] work.
cols_to_subset = ["breed", "height_cm"]
dogs[cols_to_subset]

#### Subset Using a Boolean Filter Mask (Demonstration)

In [None]:
# Create a boolean filter mask.
dogs["height_cm"] > 50

In [None]:
# Create and apply the boolean filter mask.
dogs[dogs["height_cm"] > 50]

In [None]:
# Subset using a boolean filter mask based on text data.
dogs[dogs["breed"] == "Labrador"]

In [None]:
# Subset based on a date. I suspect this is using string comparison
# rather than comparing datetime objects.
dogs[dogs["date_of_birth"] < "2015-01-01"]

#### Subset Based on Multiple Conditions (Demonstration)

In [None]:
# Subset based on multiple conditions.
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
print(is_lab & is_brown)
dogs[is_lab & is_brown]

In [None]:
# Subset based on multiple conditions, using one line of code.
dogs[(dogs["breed"] == "Labrador") & (dogs["color"] == "Brown")]

#### Subset Based on Multiple Categories Using `.isin()` (Demonstration)

In [None]:
# Subset based on multiple categories using .isin().
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]

In [None]:
# Extra credit.
# Using | to produce the equivalent of .isin().
is_black_or_brown = (dogs["color"] == "Black") | (dogs["color"] == "Brown")
dogs[is_black_or_brown]

#### Sort a DataFrame by a Column and Show the First Few Rows (Exercise)

In [None]:
# Sort homelessness by the "individuals" column and show the first few rows.
homelessness_ind = homelessness.sort_values("individuals")
homelessness_ind.head()

In [None]:
# Sort homelessness by the "family_members" column descending and show the
# first few rows.
homelessness_fam = homelessness.sort_values("family_members", ascending=False)
homelessness_fam.head()

#### Sort a DataFrame by Two Columns, One Ascending, One Descending (Exercise)

In [None]:
# Sort homelessness by "region" ascending and then by "family_members"
# descending.
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True, False])
homelessness_reg_fam.head()

#### Create a DataFrame Subset from One Column (Exercise)

In [None]:
# Extract the "individuals" column from homelessness.
# The course says the result is a DataFrame, but it is a Series.
individuals = homelessness["individuals"]
print(type(individuals))
individuals.head()

In [None]:
# Create a DataFrame containing the "individuals" column from homelessness.
individuals2 = homelessness[["individuals"]]
print(type(individuals2))
individuals2.head()

#### Create a DataFrame from a Subset of Multiple Columns (Exercise)

In [None]:
# Create a DataFrame from the "state" and "family_members" columns of
# homelessness.
state_fam = homelessness[["state", "family_members"]]
state_fam.head()

In [None]:
# Create a DataFrame from the "individuals" and "state" columns of
# homelessness.
ind_state = homelessness[["individuals", "state"]]
ind_state.head()

#### Create a DataFrame Subset Using a Boolean Filter (Exercise)

In [None]:
# Find the rows where individuals > 10000.
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]
ind_gt_10k.head()

In [None]:
# Filter for rows where region is "Mountain".
mountain_reg = homelessness[homelessness["region"] == "Mountain"]
mountain_reg

In [None]:
# Find the rows where family_members < 1000 and region == "Pacific".
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) \
                & (homelessness["region"] == "Pacific")]
fam_lt_1k_pac

In [None]:
# Find rows in homelessness where the region is in "South Atlantic" or
# "Mid-Atlantic".
south_mid_atlantic = homelessness[homelessness["region"].isin(
    ["South Atlantic", "Mid-Atlantic"])]
south_mid_atlantic

In [None]:
# Find the rows where the state is in the Mojave Desert states.
canu = ["California", "Arizona", "Nevada", "Utah"]
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]
mojave_homelessness

### New Columns

#### Create New Columns from Existing Columns (Demonstration)

It is often necessary to create new columns in a DataFrame that are derived from existing columns to simplify further analysis. This process is called "mutating a DataFrame", "transforming a DataFrame", or "feature engineering".

In [None]:
dogs["height_m"] = dogs["height_cm"] / 100

# Now calculate the BMI for the dogs: kg / m ** 2.
# The numbers that come out show what a terrible measurement BMI is.
# The values should be similar for humans and dogs.
dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2
dogs

#### Perform Multiple Manipulations of a DataFrame (Demonstration)

In [None]:
# Perform multiple manipulations by finding the names of skinny tall dogs.
# Find the rows for skinny dogs.
bmi_lt_100 = dogs[dogs["bmi"] < 100]

# Sort the rows by height descending to get the tallest dogs at the top.
bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending=False)

# Keep only the columns we're interested in.
bmi_lt_100_height[["name", "height_cm", "bmi"]]

#### Add Columns to a DataFrame (Exercise)

In [None]:
# Add "total" and "individuals_p" columns to homelessness.
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]
homelessness.head()

#### Analyze and Return Data from a DataFrame

In [None]:
# Which state has the highest number of homeless individuals per 10000 in
# the state?
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
result = high_homelessness_srt[["state", "indiv_per_10k"]]
state = result.iloc[0]
state

## Aggregating DataFrames

### Summary Statistics

Summary statistics can be calculated on individual columns of a DataFrame.

```Python
# Summary statistics for all columns.
dogs.describe()
# The mean of one numerical column.
dogs["height_cm"].mean()
```

Other summary statistics include `.median()`, `.mode()`, `.max()`, `.min()`, `.var()`, `.std()`, `.sum()`, `.quantile()`. Summary statistics work for date columns.

There is a method, `.agg()`, for custom summary statistics.

#### Using the `.agg()` Method for Custom Summary Statistics (Demonstration)

In [None]:
# This is a simple example of using the .agg() method.
# Single column result
print(dogs["weight_kg"])
def pct30(column):
    return column.quantile(0.3)

dogs["weight_kg"].agg(pct30)

In [None]:
# Two-column result.
dogs[["weight_kg", "height_cm"]].agg(pct30)

In [None]:
# Use .agg() to compute multiple summary statistics.
def pct40(column):
    return column.quantile(0.4)

dogs[["weight_kg", "height_cm"]].agg([pct30, pct40])

#### Calculate Cumulative Statistics (Demonstration)

`.cumsum()`, `.cummax()`, `.cummin()`, and `.cumprod()`.

In [None]:
# Calculate the cumulative sum.
dogs["weight_kg"].cumsum()

#### Mean and Median (Exercise)

In [None]:
# Get information about the sales DataFrame.
print(sales.head())
print(sales.info())
# Print the mean and median of weekly_sales.
print(sales["weekly_sales"].mean())
print(sales["weekly_sales"].median())

#### Summarizing Dates (Exercise)

Note that the sales["date"] column has dtype `datetime64[ns]`.

In [None]:
# Print the minimum and maximum of the date column.
print(sales["date"].max())
print(sales["date"].min())

#### Efficient Summary Statistics Using `.agg()` (Exercise)

> The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. 

In [None]:
# Create and Apply a Function to Calculate Inter-Quartile Range
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print("75th percentile:", sales["temperature_c"].quantile(0.75))
print("25th percentile:", sales["temperature_c"].quantile(0.25))
print("inter-quartile range:", sales["temperature_c"].agg(iqr))
print("inter-quartile range:")
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))
print("inter-quartile range, median:")
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, "median"]))

#### Cumulative Statistics (Exercise)

Cumulative statistics calculations return a column of values, as demonstrated below.

In [None]:
# Calculate some cumulative statistics for Store 1 Department 1.
# First, create a DataFrame with data for store 1 department 1.
# Use parentheses when using the & operator.
# Sort by date.
sales_1_1 = sales[(sales["store"] == 1) & (sales["department"] == 1)]
sales_1_1 = sales_1_1.sort_values("date")

# Add new columns for cumulative weekly sales.
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

### Counting

Categorical data can be counted. The video's example involves counting the number of different dogs that have visited a veterinarian. The data set contains the names and breeds of the dogs who have visited. We need to remove duplicates before counting, using
```Python
unique_dogs = vet_visits.drop_duplicates(subset="name")
```

In the example, two dogs of different breeds have the name "Max". Drop duplicates using
```Python
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
```

Count the dogs by breed using
```Python
unique_dogs["breed"].value_counts()
```
The values can be sorted like this:
```Python
unique_dogs["breed"].value_counts(sort=True)
```
The values can be normalized like this:
```Python
unique_dogs["breed"].value_counts(normalize=True)
```

#### Dropping Duplicates (Exercise)

In [None]:
# The course's way of doing this produces DataFrame objects.
# My way of doing this produces Series objects.
# I preferred to subset first to get rid of the columns we're not interested
# in before dropping the duplicates.
# Drop duplicate store/type combinations.
# store_types = sales[["store", "type"]].drop_duplicates()
store_types = sales.drop_duplicates(subset=["store", "type"])
print("store_types.head():")
print(store_types.head())
print()

# Drop duplicate store/department combinations.
# store_depts = sales[["store", "department"]].drop_duplicates()
store_depts = sales.drop_duplicates(subset=["store", "department"])
print("store_depts.head():")
print(store_depts.head())
print()

# Subset the rows where is_holiday is True, drop duplicate dates, and
# print the "date" column.
# My approach, where holiday_dates1 is a Series:
holiday_dates1 = sales[sales["is_holiday"] == True]["date"].drop_duplicates()
print("holiday_dates1:", holiday_dates1)
print()

# The course's solution to the exercise, where holiday_dates2 is a DataFrame.
holiday_dates2 = sales[sales["is_holiday"] == True].drop_duplicates(subset=["date"])
print("holiday_dates2:", holiday_dates2["date"])

#### Counting Categorial Variables (Exercise)

In [None]:
# Count the number of stores of each type.
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type.
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort.
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort.
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

#### Sort and Count (Extra)

In [None]:
# I experimented with sorting and counting.
# I wanted to know the counts for the store and type combinations,
# sorted by store and then by type.
# .value_counts() should be named .count_values().
store_types = sales[["store", "type"]].sort_values(["store", "type"])
print(store_types.info())
# This returns a pandas.Series with a MultiIndex.
# Watch out! sort=True by default.
store_types_value_counts = store_types.value_counts(sort=False)
print(type(store_types_value_counts))
print(store_types_value_counts)
print(store_types_value_counts.index)

# Alternatively,specify the subset as part of the call to .value_counts().
store_types_value_counts2 = \
    sales.sort_values(["store", "type"]).\
        value_counts(subset=["store", "type"], sort=False)
print(store_types_value_counts.equals(store_types_value_counts2))

### Grouped Summary Statistics

#### Summaries by Group (Demonstration)

Use the `.groupby()` method to make this easy.

In [None]:
# Create summaries by group, doing it the hard way.
print(dogs[dogs["color"] == "Black"]["weight_kg"].mean())
print(dogs[dogs["color"] == "Brown"]["weight_kg"].mean())
print(dogs[dogs["color"] == "Gray"]["weight_kg"].mean())
print(dogs[dogs["color"] == "Tan"]["weight_kg"].mean())
print(dogs[dogs["color"] == "White"]["weight_kg"].mean())

In [None]:
# Use .groupby(), which creates a DataFrameGroupBy object.
print(dogs.groupby("color"))
print(dogs.groupby("color")["weight_kg"].mean())

In [None]:
# Use .agg() to get multiple grouped statistics.
print(dogs.groupby("color")["weight_kg"].agg(["min", "max", "sum"]))

In [None]:
# Group by multiple columns.
print(dogs.groupby(["color", "breed"])["weight_kg"].mean())

In [None]:
# Group by and aggregate by multiple columns.
print(dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean())
print()

# When using .agg(), pass "mean". The result is the same.
print(dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].agg("mean"))
print()

# Calculate statistics.
print(dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].agg(
    ["mean", "min", "max", "sum"]))

#### Grouped Summary Statistics Without Using `.groupby()` (Exercise)

From the course:

> While .groupby() is useful, you can calculate grouped summary statistics without it.

> Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

In [None]:
# Calculate sales by type of store.
# The type of sales_all, sales_A, sales_B, sales_C is numpy.float64.
# This enables the division of the items of a list by a scalar.
sales_all = sales["weekly_sales"].sum()
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()
print(type(sales_all))
# For the vectorized division to work, all values must be NumPy objects,
# here, numpy.float64 objects.
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

#### Grouped Summary Statistics Using `.groupby()` (Exercise)

In [None]:
# Make the same calculations using .groupby().
# This was my approach:
print(sales.groupby("type")["weekly_sales"].sum() / sales["weekly_sales"].sum())
print()
# This was the course's approach:
sales_by_type = sales.groupby("type")["weekly_sales"].sum()
print(sales_by_type / sum(sales_by_type))
print()

# Extra: Three different ways to calculate the sum of sales_by_type:
print(sales_by_type.sum())
print(sum(sales_by_type))
print(np.sum(sales_by_type))
print()

# Group by "type" and "is_holiday" and calculate sales proportions.
sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()
print(sales_by_type_is_holiday)
print(sales_by_type_is_holiday / sales_by_type_is_holiday.sum())

#### Multiple Grouped Summaries (Exercise)

In [None]:
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"]\
    .agg(["min", "max", "mean", "median"])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l:
# get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")\
    [["unemployment", "fuel_price_usd_per_l"]]\
    .agg(["min", "max", "mean", "median"])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

### Pivot Tables

Use the `.pivot_table()` method to create the equivalent of a spreadsheet pivot table. Use the `index` parameter to specify the column(s) you want to group by. Use the `values` parameter to specify the column you want to summarize. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html

#### Simple Pivot Table (Demonstration)

In [None]:
# Using .groupby() and .pivot_table() to achieve the same result.
# For a pivot table, the default for aggfunc is "mean".
# This data set is too small.
# The output from pivot_table shows the values column and the summary
# statistic method.
# One pivot, one values column, one summary statistic.
# Pass a list to aggfunc to get its name included in the output table.
print(dogs.groupby("color")["weight_kg"].mean())
print()
print(dogs.pivot_table(index="color", values="weight_kg", aggfunc="mean"))
print()
print(dogs.pivot_table(index="color", values="weight_kg", aggfunc=["mean"]))
print()

#### Pivot Table with Multiple Summary Statistics (Demonstration)

In [None]:
# Multiple summary statistics.
print(dogs.groupby("color")["weight_kg"].agg(["mean", "median"]))
print()
print(dogs.pivot_table(index="color", values="weight_kg", aggfunc=["mean", "median"]))
print()

#### Pivot on Multiple Columns (Demonstration)

In [None]:
# Pivot on multiple columns.
print(dogs.groupby(["color", "breed"])["weight_kg"].mean())
print()
print(dogs.groupby(["color", "breed"])["weight_kg"].agg(["mean"]))
print()
print(dogs.pivot_table(index=["color", "breed"], values="weight_kg", aggfunc="mean"))
print()

#### Pivot on Multiple Columns with Multiple Statistics (Demonstration)

In [None]:
# Pivot on multiple columns with multiple summary statistics.
print(dogs.groupby(["color", "breed"])["weight_kg"].agg(["mean", "median"]))
print()
print(dogs.pivot_table(index=["color", "breed"], values="weight_kg", aggfunc=["mean", "median"]))

#### Pivot Table Specifying Columns (Demonstration)

In [None]:
# Pivot the table to put breeds into the columns.
print(dogs.pivot_table(index="color", values="weight_kg", columns="breed", aggfunc=["mean"]))
print()
# Multiple summary statistics.
print(dogs.pivot_table(index="color", values="weight_kg", columns="breed", aggfunc=["mean", "median"]))
print()

#### Fill Missing Values in a Pivot Table (Demonstration)

Replace `NaN` with `0` in a pivot table. Personally, I prefer seeing the `NaN` values.

In [None]:
# Replace NaN with 0 in the pivot table.
print(dogs.pivot_table(index="color", values="weight_kg", columns="breed", aggfunc=["mean"], fill_value=0))

#### Summary Statistics for Rows and Columns (Demonstration)

Use the `margins=True` argument to get summary statistics for rows and columns in the table margins.

> The `margins` parameter is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In [None]:
# Get summary statistics for rows and column.
print(dogs.pivot_table(index="color", values="weight_kg", 
    columns="breed", aggfunc=["mean"], fill_value=0, margins=True))

#### Pivot on One Variable (Exercise)

In [None]:
# Get the mean weekly sales for store type using a pivot table.
mean_sales_by_type = sales.pivot_table(
    index="type", values="weekly_sales", aggfunc="mean")
print(mean_sales_by_type)
print()

# Get the mean and median weekly sales for store type.
mean_med_sales_by_type = sales.pivot_table(
    index="type", values="weekly_sales", aggfunc=["mean", "median"])
print(mean_med_sales_by_type)
print()

# Get the mean weekly sales for store type and holiday.
mean_sales_by_type_holiday = sales.pivot_table(
    index=["type", "is_holiday"], values="weekly_sales", aggfunc=["mean"])
print(mean_sales_by_type_holiday)
print()
# Pivot the above table to put is_holiday in columns.
mean_sales_by_type_holiday2 = sales.pivot_table(
    index="type", columns="is_holiday", values="weekly_sales", aggfunc=["mean"])
print(mean_sales_by_type_holiday2)

#### Fill in Missing Values and Get Row and Column Summary Statistics (Exercise)

In [None]:
# Print the mean weekly_sales by department and type, replacing missing values
# with 0.
print(sales.pivot_table(
    index="department", columns="type", values="weekly_sales", fill_value=0,
    aggfunc="mean"))
print()
# Print the mean weekly sales by department and type, replacing missing values
# with 0, and providing summary means for rows and columns.
print(sales.pivot_table(
    index="department", columns="type", values="weekly_sales", fill_value=0, 
    aggfunc="mean", margins=True))

## Slicing and Indexing DataFrames

### Explicit Indexes

#### Setting a Column as an Index (Demonstration)

Use the `set_index` method to set the index from a column of the DataFrame.

In [None]:
# Set the index to the names column.
# In the example, the color for Cooper has changed from "Gray" to "Grey".
print(dogs)
print()
dogs_ind = dogs.set_index("name")
print(dogs_ind)

#### Remove an Index (Demonstration)

Remove an index (restoring the values to a column) using the `reset_index` method.

In [None]:
# Reset the index.
dogs_ind.reset_index()

#### Drop an Index (Demonstration)

Drop an index using the `reset_index` method with the argument `drop=True`.

In [None]:
# Drop an index.
dogs_ind.reset_index(drop=True)

#### Indexes Make Subsetting Code Cleaner (Demonstration)

In [None]:
# Subset using the names column.
print(dogs[dogs["name"].isin(["Bella", "Stella"])])
# Subset using the index, which contains the names of the dogs.
print(dogs_ind.loc[["Bella", "Stella"]])

#### Index Values Don't Need to Be Unique (Demonstration)

In [None]:
# Create an Index from the breed Column.
# There are two rows with the index "Labrador".
dogs_ind2 = dogs.set_index("breed")
print(dogs_ind2)
print()
print(dogs_ind2.loc["Labrador"])
print()
print(dogs_ind2.loc[["Labrador"]])

#### Set Multi-level (Hierarchical) Indexes (Demonstration)

The idea is that the inner level is nested in the outer level.

In [None]:
# Create a hierarchical index using two columns.
dogs_ind3 = dogs.set_index(["breed", "color"])
print(dogs_ind3)

#### Subset the Outer Level with a List (Demonstration)

In [None]:
# Subset using the outer level of a hierarchical index.
print(dogs_ind3.loc[["Labrador", "Chihuahua"]])

#### Subset Inner Levels with a List of Tuples (Demonstration)

In [None]:
# Subset a dataframe using the inner level of a hierarchical index.
print(dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]) 

#### Sorting by Index Values (Demonstration)

Rows of a DataFrame can be sorted by values using the `.sort_values()` method. Rows of a DataFrame can also be sorted by index values using the `.sort_index()` method.

In [None]:
# Sort by index values.
print(dogs_ind3.sort_index())
print()
# Sort by "color" ascending first, then by "breed" descending.
print(dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False]))

#### Problems with Indexes

Using indexes causes two problems:
- Indexes are just data, and storing data in multiple forms makes it harder to think about
- Using indexes violates these tidy data principles (https://vita.had.co.nz/papers/tidy-data.pdf):
    - Data is stored in tabular form such as a DataFrame
    - Each row of the DataFrame contains a single observation, where each variable is stored in its own column; indexes violate this rule since the data values don't get their own column
- In pandas, the syntax for working with indexes is different from the syntax for working with columns, making the code more complicated and introducing more bugs

This is an example of a tidy table, taken from the paper, where each column defines a variable, each row contains an observation, and each type of observational unit forms a table.

<table>
    <tr>
      <th>name</th>
      <th>trt</th>
      <th>result</th>
    </tr>
    <tr>
        <td>John Smith</td>
        <td>a</td>
        <td style="text-align: right;">-</td>
    </tr>
    <tr>
        <td>Jane Doe</td>
        <td>a</td>
        <td style="text-align: right;">16</td>
    </tr>
    <tr>
        <td>Mary Johnson</td>
        <td>a</td>
        <td style="text-align: right;">3</td>
    </tr>
    <tr>
        <td>John Smith</td>
        <td>b</td>
        <td style="text-align: right;">2</td>
    </tr>
    <tr>
        <td>Jane Doe</td>
        <td>b</td>
        <td style="text-align: right;">11</td>
    </tr>
    <tr>
        <td>Mary Johnson</td>
        <td>b</td>
        <td style="text-align: right;">1</td>
    </tr>
</table>

#### Setting and Removing Indexes (Exercise)

In [None]:
# Carry out the exercise.
print(temperatures)
print()
# Index temperatures by city
temperatures_ind = temperatures.set_index("city")
# Look at temperatures_ind
print(temperatures_ind)
print()
# Reset the index, keeping its contents
print(temperatures_ind.reset_index())
print()
# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

#### Subsetting with `.loc[]` (Exercise)

With a good index, subsetting is easier using `.loc[]`.

In [None]:
# Subsetting two ways. The second DataFrame is indexed
# by "city".
cities = ["Moscow", "Saint Petersburg"]
print(temperatures[temperatures["city"].isin(cities)])
print()
print(temperatures_ind.loc[cities])

#### Setting Multi-Level Indexes (Exercise)

In [None]:
# Index temperatures by country & city.
temperatures_ind = temperatures.set_index(["country", "city"])

# Create a list of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

# Subset for rows to keep.
print(temperatures_ind.loc[rows_to_keep])

#### Sorting by Index Values (Exercise)

Use `.sort_values()` when sorting using columns; use `.sort_index()` when sorting using an index.

In [None]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
print()

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level=["city"]))
print()

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending=[True, False]))

### Slicing and Subsetting with `.loc` and `.iloc`

DataFrames can be sliced by index values or by row/column numbers.

#### Slicing a List (Demonstration)

In [None]:
# Slice a list.
breeds = ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador", "Chihuahua", "St. Bernard"]
print(breeds[2:5])
print(breeds[:3])
print(breeds[:])
print(breeds)

#### Slice a DataFrame after Calling `.sort_index()` (Demonstration)

In [None]:
# Review the dogs DataFrame.
print(dogs)
print()

# There can be problems if you don't sort the index first.
try:
    print(dogs.set_index(["breed", "color"]).loc["Chow Chow":"Poodle"])
except Exception as ex:
    print("Exception:", ex)
print()

# Add an index and sort before slicing.
dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
print(dogs_srt)
print()
# Slice using the outer level. The last item is included!
print(dogs_srt.loc["Chow Chow":"Poodle"])

#### Slicing the Inner Index Levels Badly (Demonstration)

Pandas does not raise an exception to let you know that you've done this incorrectly.

In [None]:
# This does not work; it returns an empty DataFrame.
print(dogs_srt.loc["Tan":"Gray"])
print()
# To slice correctly, you must provide two tuples.
print(dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Gray")])

#### Slicing Columns (Demonstration)

Since DataFrames are two-dimensional, you can also slice columns.

In [None]:
# Slice columns.
print(dogs_srt.loc[:, "name":"height_cm"])

#### Slicing Rows and Columns at the Same Time (Demonstration)

In [None]:
# Slice rows and columns.
print(dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Gray"), "name":"height_cm"])

#### Slicing Using a Range of Dates (Demonstration)

An important use case for slicing is using a range of dates.

In [None]:
# Create an index and sort it.
dogs_bd = dogs.set_index("date_of_birth").sort_index()
print(dogs_bd)
print()
# Slice the DataFrame by date of birth.
print(dogs_bd.loc["2014-08-25":"2016-09-16"])

#### Slice by Partial Dates (Demonstration)

For this to work correctly, the column or index containing the dates must
have dtype "datetime64[ns]".

In [None]:
# Slice by partial dates. 
print(dogs_bd.loc["2014":"2016"])

#### Subsetting by Row or Column Number (Demonstration)

Use the `.iloc[]` method for row and column numbers.

In [None]:
# Use row and column numbers to create a subset.
print(dogs.iloc[2:5])
print()
print(dogs.iloc[:, 1:4])
print()
print(dogs.iloc[2:5, 1:4])

#### Slicing Index Values (Exercise)

In [None]:
# Create an indexed DataFrame for the exercise.
temperatures_ind = temperatures.set_index(["country", "city"])
# Sort the index of temperatures_ind.
temperatures_srt = temperatures_ind.sort_index()
# Subset rows from Pakistan to Russia.
print(temperatures_srt.loc["Pakistan":"Russia"])
# Try to subset rows from Lahore to Moscow. The results are not what
# is wanted.
print(temperatures_srt.loc["Lahore":"Moscow"])
# Subset rows from Pakistan, Lahore to Russia, Moscow.
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])

#### Slicing in Both Directions (Exercise)

In [None]:
# Subset rows from India, Hyderabad to Iraq, Baghdad.
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

# Subset columns from date to avg_temp_c.
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])

### Working with Pivot Tables

## Creating and Visualizing DataFrames

### Visualizing Your Data

### Missing Values

### Creating DataFrames

### Reading and Writing CSVs

### Wrap-Up