
# 📊 Data Manipulation with pandas — Summary Statistics (Complete Notes)

**Source:** DataCamp  
**Topic:** Summary statistics, `.agg()`, cumulative methods

---

## 🐶 Step 1: Dataset Setup

We start with a simple dataset about dogs including their breed, height, weight, and date of birth.

```python
import pandas as pd

dogs = pd.DataFrame({
    "name": ["Bella", "Charlie", "Lucy", "Cooper", "Max", "Stella", "Bernie"],
    "breed": ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador", "Chihuahua", "St. Bernard"],
    "color": ["Brown", "Black", "Brown", "Gray", "Black", "Tan", "White"],
    "height_cm": [56, 43, 46, 49, 59, 18, 77],
    "weight_kg": [24, 24, 24, 17, 29, 2, 74],
    "date_of_birth": pd.to_datetime(["2013-07-01", "2016-09-16", "2014-08-25", "2011-12-11", "2017-01-20", "2015-04-20", "2018-02-27"])
})
````

---

## 📈 Step 2: Summary Statistics on Numeric Columns

Summary statistics give us a quick overview of the data's central tendencies and spread.

```python
# Mean height
dogs["height_cm"].mean()
```

🟢 **Output:** `49.714285714285715` (average height)

Other useful statistics:

```python
dogs["height_cm"].median()    # Middle value
dogs["height_cm"].mode()      # Most frequent value
dogs["height_cm"].min()       # Smallest height
dogs["height_cm"].max()       # Tallest dog
dogs["height_cm"].var()       # Variance (spread of values)
dogs["height_cm"].std()       # Standard deviation
dogs["height_cm"].sum()       # Total height
dogs["height_cm"].quantile(0.25)  # 25th percentile (1st quartile)
```

---

## 🕰️ Step 3: Summarizing Dates

You can also use summary functions on datetime columns.

```python
dogs["date_of_birth"].min()  # Oldest dog
dogs["date_of_birth"].max()  # Youngest dog
```

🟢 **Output:**

```
2011-12-11
2018-02-27
```

---

## 🧮 Step 4: Custom Aggregation using `.agg()`

`.agg()` allows you to apply a custom function to a column.

```python
def pct30(column):
    return column.quantile(0.3)

dogs["weight_kg"].agg(pct30)
```

🟢 **Output:** `22.599999999999998`
📌 This gives the 30th percentile of the weight column.

---

## 📊 Step 5: Aggregation on Multiple Columns

You can also pass multiple columns to `.agg()`:

```python
dogs[["weight_kg", "height_cm"]].agg(pct30)
```

🟢 **Output:**

```
weight_kg     22.6
height_cm     45.4
dtype: float64
```

---

## 🔁 Step 6: Multiple Aggregation Functions

You can even apply **more than one** function at once.

```python
def pct40(column):
    return column.quantile(0.4)

dogs["weight_kg"].agg([pct30, pct40])
```

🟢 **Output:**

```
pct30    22.6
pct40    24.0
Name: weight_kg, dtype: float64
```

---

## 📈 Step 7: Cumulative Sums

Cumulative methods return running totals. This is useful to see the trend over rows.

```python
dogs["weight_kg"].cumsum()
```

🟢 **Output:**

```
0     24
1     48
2     72
3     89
4    118
5    120
6    194
Name: weight_kg, dtype: int64
```

---

## ➕ Step 8: Other Cumulative Methods

pandas provides additional cumulative methods:

```python
dogs["weight_kg"].cummax()   # Running max
dogs["weight_kg"].cummin()   # Running min
dogs["weight_kg"].cumprod()  # Cumulative product
```

These return a value for each row instead of a single summary number.

---

## 🏪 Step 9: Walmart Dataset Preview (Next Lesson Intro)

You’ll later work on a **Walmart weekly sales dataset** with columns like:

* `store`, `type`, `dept`, `date`, `weekly_sales`
* Environmental and economic data: `temp_c`, `fuel_price`, `unemp`, `is_holiday`

```python
walmart_sales.head()
```

🟢 **Output (example):**

```
  store type  dept       date  weekly_sales  is_holiday  temp_c  fuel_price  unemp
0     1    A     1 2010-02-05      24924.50       False    5.73       0.679  8.106
```

---

## 🧠 Summary - What You Learned
```
✅ Use `.mean()`, `.median()`, `.std()` etc. to understand numeric data
✅ Use `.min()`, `.max()` for datetimes to find oldest/newest entries
✅ Create and apply custom functions with `.agg()`
✅ Apply `.agg()` on one or many columns
✅ Use multiple functions inside `.agg([func1, func2])`
✅ Compute running stats with `.cumsum()`, `.cummax()`, etc.
✅ Prepped to handle a real-world dataset (Walmart sales)



In [None]:
# Mean and median
# Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

# sales is available and pandas is loaded as pd.

# Instructions
# 100 XP
# Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
# Print information about the columns in sales.
# Print the mean of the weekly_sales column.
# Print the median of the weekly_sales column.

# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales['weekly_sales'].mean())

# Print the median of weekly_sales
print(sales['weekly_sales'].median())


#    store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
# 0      1    A           1 2010-02-05      24924.50       False       5.727778              0.679451         8.106
# 1      1    A           1 2010-03-05      21827.90       False       8.055556              0.693452         8.106
# 2      1    A           1 2010-04-02      57258.43       False      16.816667              0.718284         7.808
# 3      1    A           1 2010-05-07      17413.94       False      22.527778              0.748928         7.808
# 4      1    A           1 2010-06-04      17558.09       False      27.050000              0.714586         7.808
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10774 entries, 0 to 10773
# Data columns (total 9 columns):
#  #   Column                Non-Null Count  Dtype         
# ---  ------                --------------  -----         
#  0   store                 10774 non-null  int64         
#  1   type                  10774 non-null  object        
#  2   department            10774 non-null  int32         
#  3   date                  10774 non-null  datetime64[ns]
#  4   weekly_sales          10774 non-null  float64       
#  5   is_holiday            10774 non-null  bool          
#  6   temperature_c         10774 non-null  float64       
#  7   fuel_price_usd_per_l  10774 non-null  float64       
#  8   unemployment          10774 non-null  float64       
# dtypes: bool(1), datetime64[ns](1), float64(4), int32(1), int64(1), object(1)
# memory usage: 599.8+ KB
# None
# 23843.95014850566
# 12049.064999999999


In [1]:
# Exercise
# Summarizing dates
# Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

# sales is available and pandas is loaded as pd.

# Instructions
# 100 XP
# Print the maximum of the date column.
# Print the minimum of the date column.

# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())



<script.py> output:
    2012-10-26 00:00:00
    2010-02-05 00:00:00


In [None]:
# Exercise
# Efficient summaries
# While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

# The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

# df['column'].agg(function)
# In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

# sales is available and pandas is loaded as pd.

# Instructions 1/3
# 35 XP
# 1 Use the custom iqr function defined for you along with .agg() to print the IQR of the temperature_c column of sales.

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

# 16.583333333333336



# Instructions 2/3
# 35 XP
# 2 Update the column selection to use the custom iqr function with .agg() to print the IQR of temperature_c, fuel_price_usd_per_l, and unemployment, in that order.

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg(iqr))

# temperature_c           16.583333
# fuel_price_usd_per_l     0.073176
# unemployment             0.565000
# dtype: float64

# 3. Update the aggregation functions called by .agg(): include iqr and "median" in that order.
# Create a custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, 'median']))


# <script.py> output:
#             temperature_c  fuel_price_usd_per_l  unemployment
#     iqr         16.583333              0.073176         0.565
#     median      16.966667              0.743381         8.099



In [None]:
# Exercise
# Cumulative statistics
# Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

# A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.

# Instructions
# 100 XP
# Sort the rows of sales_1_1 by the date column in ascending order.
# Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.
# Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.
# Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date')

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()


# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])





# <script.py> output:
#              date  weekly_sales  cum_weekly_sales  cum_max_sales
#     0  2010-02-05      24924.50          24924.50       24924.50
#     1  2010-03-05      21827.90          46752.40       24924.50
#     2  2010-04-02      57258.43         104010.83       57258.43
#     3  2010-05-07      17413.94         121424.77       57258.43
#     4  2010-06-04      17558.09         138982.86       57258.43
#     5  2010-07-02      16333.14         155316.00       57258.43
#     6  2010-08-06      17508.41         172824.41       57258.43
#     7  2010-09-03      16241.78         189066.19       57258.43
#     8  2010-10-01      20094.19         209160.38       57258.43
#     9  2010-11-05      34238.88         243399.26       57258.43
#     10 2010-12-03      22517.56         265916.82       57258.43
#     11 2011-01-07      15984.24         281901.06       57258.43


# 🐶 Counting & Avoiding Double Counting (Data Manipulation with pandas)

#### 📝 Slide Summary:
We’re working with a `vet_visits` DataFrame. Some dogs appear multiple times due to repeat visits (e.g., Max, Stella). To get an accurate count of breeds, we must avoid double-counting.

#### 🎙️ Transcript Breakdown:
- First, we want to count how many dogs of each **breed** visited the vet.
- But some dogs (like Max) visited multiple times, and appear in multiple rows — so a simple count will overestimate.
- We fix this by dropping duplicates. First, by name (`drop_duplicates(subset="name")`), but that can wrongly drop different dogs with the same name.
- Better: drop duplicates using **both name and breed** — this way, we only keep one entry for each unique dog.
- After cleaning, we use `value_counts()` to count how many dogs of each breed, and `normalize=True` to see proportions.

---

#### 💻 Step 1: View raw data
```python
print(vet_visits)
````

✅ *Shows multiple entries for some dogs, e.g., Max the Labrador and Max the Chow Chow*

---

#### 💻 Step 2: Drop duplicate names only

```python
vet_visits.drop_duplicates(subset="name")
```

📤 Output (sample):

```
         date     name        breed  weight_kg
0  2018-09-02    Bella     Labrador      24.87
1  2019-06-07      Max    Chow Chow      24.01
2  2019-03-19  Charlie       Poodle      24.95
3  2018-01-17   Stella    Chihuahua       1.51
4  2019-10-19     Lucy    Chow Chow      24.07
7  2019-03-30   Cooper    Schnauzer      16.91
10 2019-01-04   Bernie  St. Bernard      74.98
```

⚠️ Max the Labrador is missing! Two dogs named Max exist, but different breeds.

---

#### 💻 Step 3: Drop duplicate (name, breed) pairs

```python
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
print(unique_dogs)
```

📤 Output:

```
         date     name        breed  weight_kg
0  2018-09-02    Bella     Labrador      24.87
1  2019-03-13      Max    Chow Chow      24.13
2  2019-03-19  Charlie       Poodle      24.95
3  2018-01-17   Stella    Chihuahua       1.51
4  2019-10-19     Lucy    Chow Chow      24.07
6  2019-06-07      Max     Labrador      28.35
7  2019-03-30   Cooper    Schnauzer      16.91
10 2019-01-04   Bernie  St. Bernard      74.98
```

✅ Now each dog appears only once per unique name–breed combo.

---

#### 💻 Step 4: Count dog breeds

```python
unique_dogs["breed"].value_counts()
```

📤 Output:

```
Labrador       2
Chow Chow      2
Schnauzer      1
St. Bernard    1
Poodle         1
Chihuahua      1
Name: breed, dtype: int64
```

📌 Labradors and Chow Chows have appeared twice — likely different dogs.

---

#### 💻 Step 5: Proportions of dog breeds

```python
unique_dogs["breed"].value_counts(normalize=True)
```

📤 Output:

```
Labrador       0.250
Chow Chow      0.250
Schnauzer      0.125
St. Bernard    0.125
Poodle         0.125
Chihuahua      0.125
Name: breed, dtype: float64
```

🔍 This tells us that 25% of the unique dogs visiting this vet are Labradors, and 25% are Chow Chows.

---

✅ **Takeaway:** Use `.drop_duplicates(subset=["name", "breed"])` to avoid counting the same dog twice. Then `.value_counts()` and `.value_counts(normalize=True)` help summarize categorical data — both as counts and as proportions.

```





In [None]:
# Exercise
# Dropping duplicates
# Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from sales.

# sales is available and pandas is imported as pd.

# Instructions
# 100 XP
# Remove rows of sales with duplicate pairs of store and type and save as store_types and print the head.
# Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head.
# Subset the rows that are holiday weeks using the is_holiday column, and drop the duplicate dates, saving as holiday_dates.
# Select the date column of holiday_dates, and print.


# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset = ['store', 'type'])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset =['store', 'department'])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales['is_holiday']].drop_duplicates(subset = 'date')

# Print date col of holiday_dates
print(holiday_dates['date'])


# <script.py> output:
#           store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
#     0         1    A           1 2010-02-05      24924.50       False       5.727778              0.679451         8.106
#     901       2    A           1 2010-02-05      35034.06       False       4.550000              0.679451         8.324
#     1798      4    A           1 2010-02-05      38724.42       False       6.533333              0.686319         8.623
#     2699      6    A           1 2010-02-05      25619.00       False       4.683333              0.679451         7.259
#     3593     10    B           1 2010-02-05      40212.84       False      12.411111              0.782478         9.765
#         store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
#     0       1    A           1 2010-02-05      24924.50       False       5.727778              0.679451         8.106
#     12      1    A           2 2010-02-05      50605.27       False       5.727778              0.679451         8.106
#     24      1    A           3 2010-02-05      13740.12       False       5.727778              0.679451         8.106
#     36      1    A           4 2010-02-05      39954.04       False       5.727778              0.679451         8.106
#     48      1    A           5 2010-02-05      32229.38       False       5.727778              0.679451         8.106
#     498    2010-09-10
#     691    2011-11-25
#     2315   2010-02-12
#     6735   2012-09-07
#     6810   2010-12-31
#     6815   2012-02-10
#     6820   2011-09-09
#     Name: date, dtype: datetime64[ns]

In [None]:
# Exercise
# Counting categorical variables
# Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

# # Drop duplicate store/type combinations
# store_types = sales.drop_duplicates(subset=["store", "type"])

# # Drop duplicate store/department combinations
# store_depts = sales.drop_duplicates(subset=["store", "department"])
# The store_types and store_depts DataFrames you created in the last exercise are available, and pandas is imported as pd.

# Instructions
# 100 XP
# Count the number of stores of each store type in store_types.
# Count the proportion of stores of each store type in store_types.
# Count the number of stores of each department in store_depts, sorting the counts in descending order.
# Count the proportion of stores of each department in store_depts, sorting the proportions in descending order.



# Count the number of stores of each type
store_counts = store_types['type'].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of stores for each department and sort
dept_counts_sorted = store_depts['department'].value_counts().sort_values(ascending=False)
print(dept_counts_sorted)

# Get the proportion of stores in each department and sort
dept_props_sorted = store_depts['department'].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

# <script.py> output:
#     type
#     A    11
#     B     1
#     Name: count, dtype: int64
#     type
#     A    0.916667
#     B    0.083333
#     Name: proportion, dtype: float64
#     department
#     1     12
#     3     12
#     5     12
#     6     12
#     7     12
#           ..
#     37    10
#     48     8
#     50     6
#     39     4
#     43     2
#     Name: count, Length: 80, dtype: int64
#     department
#     1     0.012917
#     55    0.012917
#     72    0.012917
#     71    0.012917
#     67    0.012917
#             ...   
#     37    0.010764
#     48    0.008611
#     50    0.006459
#     39    0.004306
#     43    0.002153
#     Name: proportion, Length: 80, dtype: float64


# 📊 Grouped Summary Statistics (Data Manipulation with pandas)

#### 📝 Slide Summary:
Instead of summarizing the whole dataset at once, we can use **grouped summaries** to compare subsets — like average weight by dog color or breed. This is much easier using `.groupby()` instead of writing multiple filter statements manually.

---

#### 🎙️ Transcript Breakdown:
- Until now, we used `.mean()`, `.sum()`, etc., across entire columns.
- But to answer questions like **"Which color of dog is heaviest?"** or **"What’s the average size for each breed?"**, we need to summarize by **groups**.
- At first, we could filter each color manually and compute the mean, but that’s repetitive and error-prone.
- Using `.groupby()` allows us to group the data by one or more columns, and then apply summary functions (e.g., `mean`, `min`, `sum`).
- `.agg()` lets us get **multiple summaries at once**, and we can even group by **multiple variables** like color + breed.

---

#### 💻 Step 1: Manual summary by color (inefficient)
```python
dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()
dogs[dogs["color"] == "White"]["weight_kg"].mean()
dogs[dogs["color"] == "Gray"]["weight_kg"].mean()
dogs[dogs["color"] == "Tan"]["weight_kg"].mean()
````

📤 Output (sample):

```
26.0
24.0
74.0
17.0
2.0
```

⚠️ Tedious and repetitive — better to use `.groupby()`!

---

#### 💻 Step 2: Grouped mean weight by color

```python
dogs.groupby("color")["weight_kg"].mean()
```

📤 Output:

```
color
Black    26.0
Brown    24.0
Gray     17.0
Tan       2.0
White    74.0
Name: weight_kg, dtype: float64
```

✅ Much simpler — just one line for all groups!

---

#### 💻 Step 3: Multiple grouped summaries (min, max, sum)

```python
dogs.groupby("color")["weight_kg"].agg([min, max, sum])
```

📤 Output:

```
         min  max  sum
color                  
Black     24   29   53
Brown     24   24   48
Gray      17   17   17
Tan        2    2    2
White     74   74   74
```

🛠️ `.agg()` lets you customize multiple summary functions at once.

---

#### 💻 Step 4: Group by multiple columns (color + breed)

```python
dogs.groupby(["color", "breed"])["weight_kg"].mean()
```

📤 Output:

```
color  breed     
Black  Chow Chow      25
       Labrador       29
       Poodle         24
Brown  Chow Chow      24
       Labrador       24
Gray   Schnauzer      17
Tan    Chihuahua       2
White  St. Bernard    74
Name: weight_kg, dtype: int64
```

🔍 Now we see breed-specific weights grouped by color!

---

#### 💻 Step 5: Many groups + many summaries (weight & height)

```python
dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()
```

📤 Output:

```
                         weight_kg  height_cm
color breed                              
Black Labrador             29         59
       Poodle              24         43
Brown Chow Chow            24         46
       Labrador            24         56
Gray  Schnauzer            17         49
Tan   Chihuahua             2         18
White St. Bernard          74         77
```

📊 This gives us multi-dimensional summaries for complex comparisons.

---

✅ **Takeaway:**
Use `.groupby()` to summarize data by category. Combine with `.agg()` for multiple stats, and group by more than one column to analyze complex relationships (e.g., color + breed). This is more powerful, cleaner, and avoids manual filtering.

```



In [None]:
# Exercise
# What percent of sales occurred at each store type?
# While .groupby() is useful, you can calculate grouped summary statistics without it.

# Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

# sales is available and pandas is imported as pd.

# Instructions
# 100 XP
# Calculate the total weekly_sales over the whole dataset.
# Subset for type "A" stores, and calculate their total weekly sales.
# Do the same for type "B" and type "C" stores.
# Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.


# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)


# <script.py> output:
#     [0.9097747 0.0902253 0.       ]

In [None]:
# Exercise
# Calculations with .groupby()
# The .groupby() method makes life much easier. In this exercise, you'll perform the same calculations as last time, except you'll use the .groupby() method. You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

# sales is available and pandas is loaded as pd.

# Instructions 1/2
# 50 XP
# 1. Group sales by "type", take the sum of "weekly_sales", and store as sales_by_type.
# Calculate the proportion of sales at each store type by dividing by the sum of sales_by_type. Assign to sales_propn_by_type.

# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)

# type
# A    0.909775
# B    0.090225
# Name: weekly_sales, dtype: float64


# Instructions 2/2
# 50 XP
# 2. Group sales by "type" and "is_holiday", take the sum of weekly_sales, and store as sales_by_type_is_holiday

# From previous step
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = sales.groupby(['type', 'is_holiday'])['weekly_sales'].sum()
print(sales_by_type_is_holiday)


# <script.py> output:
#     type  is_holiday
#     A     False         2.336927e+08
#           True          2.360181e+04
#     B     False         2.317678e+07
#           True          1.621410e+03
#     Name: weekly_sales, dtype: float64

In [None]:
# Multiple grouped summaries
# Earlier in this chapter, you saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. You can use built-in functions like min, max, mean, and median.

# sales is available and pandas is imported as pd.

# Instructions
# 100 XP
# Get the min, max, mean, and median of weekly_sales for each store type using .groupby() and .agg(). Store this as sales_stats.
# Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for each store type. Store this as unemp_fuel_stats.

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')['weekly_sales'].agg(['min', 'max', 'mean', 'median'])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby('type')[['unemployment','fuel_price_usd_per_l']].agg(['min', 'max', 'mean', 'median'])

# Print unemp_fuel_stats
print(unemp_fuel_stats)



#          min        max          mean    median
# type                                           
# A    -1098.0  293966.05  23674.667242  11943.92
# B     -798.0  232558.51  25696.678370  13336.08
#      unemployment                         fuel_price_usd_per_l                              
#               min    max      mean median                  min       max      mean    median
# type                                                                                        
# A           3.879  8.992  7.972611  8.067             0.664129  1.107410  0.744619  0.735455
# B           7.170  9.765  9.279323  9.199             0.760023  1.107674  0.805858  0.803348


# 📈 Pivot Tables (Data Manipulation with pandas)

#### 📝 Slide Summary:
Pivot tables offer a spreadsheet-style way to compute grouped summaries — but directly in pandas. They work similarly to `groupby()`, but give more structured, tabular output and support advanced options like filling missing data or calculating multiple statistics at once.

---

#### 🎙️ Transcript Breakdown:
- **Pivot tables** help calculate grouped summaries just like in Excel.
- Instead of using `.groupby()`, we can use `pivot_table()` to group and summarize — it's cleaner and customizable.
- The `values` parameter is the numeric column to summarize, and `index` is what to group by.
- `aggfunc` lets us choose which summary function to apply (default is `mean`; others include `median`, `sum`, etc.).
- We can also pivot on **two variables** using `columns`.
- Missing values (NaNs) can be replaced using `fill_value`.
- `margins=True` adds summary rows and columns (like totals/averages).

---

#### 💻 Step 1: Mean weight by color using pivot table
```python
dogs.pivot_table(values="weight_kg", index="color")
````

📤 Output:

```
        weight_kg
color            
Black        26.5
Brown        24.0
Gray         17.0
Tan           2.0
White        74.0
```

✅ Same result as groupby, but more structured output.

---

#### 💻 Step 2: Use median instead of mean

```python
dogs.pivot_table(values="weight_kg", index="color", aggfunc="median")
```

📤 Output:

```
        weight_kg
color            
Black        26.5
Brown        24.0
Gray         17.0
Tan           2.0
White        74.0
```

📌 Just switch the aggregation function with `aggfunc="median"`.

---

#### 💻 Step 3: Get both mean and median

```python
dogs.pivot_table(values="weight_kg", index="color", aggfunc=["mean", "median"])
```

📤 Output:

```
              mean  median
        weight_kg weight_kg
color                       
Black        26.5     26.5
Brown        24.0     24.0
Gray         17.0     17.0
Tan           2.0      2.0
White        74.0     74.0
```

📊 Now the table shows multiple statistics per group.

---

#### 💻 Step 4: Pivot on two variables (color × breed)

```python
dogs.pivot_table(values="weight_kg", index="color", columns="breed")
```

📤 Output:

```
breed  Chihuahua  Chow Chow  Labrador  Poodle  Schnauzer  St. Bernard
color                                                                 
Black        NaN        NaN      29.0    24.0        NaN          NaN
Brown        NaN       24.0      24.0     NaN        NaN          NaN
Gray         NaN        NaN       NaN     NaN       17.0          NaN
Tan          2.0        NaN       NaN     NaN        NaN          NaN
White        NaN        NaN       NaN     NaN        NaN         74.0
```

⚠️ Missing combinations appear as NaN — e.g., no Black Chihuahua.

---

#### 💻 Step 5: Fill missing values with 0

```python
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
```

📤 Output:

```
breed  Chihuahua  Chow Chow  Labrador  Poodle  Schnauzer  St. Bernard
color                                                                 
Black          0          0        29      24          0            0
Brown          0         24        24       0          0            0
Gray           0          0         0       0         17            0
Tan            2          0         0       0          0            0
White          0          0         0       0          0           74
```

💡 Cleaner output with all empty cells replaced by 0.

---

#### 💻 Step 6: Add row & column summaries using `margins=True`

```python
dogs.pivot_table(
    values="weight_kg",
    index="color",
    columns="breed",
    fill_value=0,
    margins=True
)
```

📤 Output:

```
breed  Chihuahua  Chow Chow  Labrador  Poodle  Schnauzer  St. Bernard       All
color                                                                           
Black          0          0        29      24          0            0  26.500000
Brown          0         24        24       0          0            0  24.000000
Gray           0          0         0       0         17            0  17.000000
Tan            2          0         0       0          0            0   2.000000
White          0          0         0       0          0           74  74.000000
All            2         24        26      24         17           74  27.714286
```

📌 The **last row/column** shows summary means for each group and the entire dataset.

---

✅ **Takeaway:**
Use `pivot_table()` when you want tabular summaries like in Excel:

* One or two grouping variables
* Summary stats (`mean`, `median`, etc.)
* Multiple stats with `aggfunc=[...]`
* Fill NaNs with `fill_value`
* Add overall summaries with `margins=True`

```


In [None]:
# Exercise
# Pivoting on one variable
# Pivot tables are the standard way of aggregating data in spreadsheets.

# In pandas, pivot tables are essentially another way of performing grouped calculations. That is, the .pivot_table() method is an alternative to .groupby().

# In this exercise, you'll perform calculations using .pivot_table() to replicate the calculations you performed in the last lesson using .groupby().

# sales is available and pandas is imported as pd.

# Instructions 1/3
# 35 XP
# 1 Get the mean weekly_sales by type using .pivot_table() and store as mean_sales_by_type.


# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values= 'weekly_sales', index= 'type')

# Print mean_sales_by_type
print(mean_sales_by_type)

#       weekly_sales
# type              
# A     23674.667242
# B     25696.678370


# 2.Get the mean and median of weekly_sales by type using .pivot_table() and store as mean_med_sales_by_type.

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values='weekly_sales', index='type', aggfunc=(['mean','median']))

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

# <script.py> output:
#                   mean       median
#           weekly_sales weekly_sales
#     type                           
#     A     23674.667242     11943.92
#     B     25696.678370     13336.08


# 3. Get the mean of weekly_sales by type and is_holiday using .pivot_table() and store as mean_sales_by_type_holiday.

# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values='weekly_sales', index='type', columns = 'is_holiday')

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

# <script.py> output:
#     is_holiday         False      True 
#     type                               
#     A           23768.583523  590.04525
#     B           25751.980533  810.70500

In [None]:
# Exercise
# Fill in missing values and sum values with pivot tables
# The .pivot_table() method has several useful arguments, including fill_value and margins.

# fill_value replaces missing values with a real value (known as imputation). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
# margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.
# In this exercise, you'll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!

# sales is available and pandas is imported as pd.

# Instructions 1/2
# 50 XP
# 1
# Print the mean weekly_sales by department and type, filling in any missing values with 0.

# Print mean weekly_sales by department and type; fill missing values with 0
print(sales.pivot_table(values='weekly_sales', index='type', columns='department', fill_value=0))


# <script.py> output:
#     department            1              2             3             4             5             6             7             8             9             10  ...            90            91             92            93            94             95            96            97            98          99
#     type                                                                                                                                                     ...                                                                                                                                            
#     A           30961.725379   67600.158788  17160.002955  44285.399091  34821.011364   7136.292652  38454.336818  48583.475303  30120.449924  30930.456364  ...  85776.905909  70423.165227  139722.204773  53413.633939  60081.155303  123933.787121  21367.042857  28471.266970  12875.423182  379.123659
#     B           44050.626667  112958.526667  30580.655000  51219.654167  63236.875000  10717.297500  52909.653333  90733.753333  66679.301667  48595.126667  ...  14780.210000  13199.602500   50859.278333   1466.274167    161.445833   77082.102500   9528.538333   5828.873333    217.428333    0.000000



# 2. Print the mean weekly_sales by department and type, filling in any missing values with 0 and summing all rows and columns.

# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value= 0, aggfunc=sum, margins=True))

# <script.py> output:
#     type                   A            B           All
#     department                                         
#     1           4.086948e+06    528607.52  4.615555e+06
#     2           8.923221e+06   1355502.32  1.027872e+07
#     3           2.265120e+06    366967.86  2.632088e+06
#     4           5.845673e+06    614635.85  6.460309e+06
#     5           4.596374e+06    758842.50  5.355216e+06
#     6           9.419906e+05    128607.57  1.070598e+06
#     7           5.075972e+06    634915.84  5.710888e+06
#     8           6.413019e+06   1088805.04  7.501824e+06
#     9           3.975899e+06    800151.62  4.776051e+06
#     10          4.082820e+06    583141.52  4.665962e+06
#     11          3.039737e+06    425861.15  3.465598e+06
#     12          8.958630e+05    115878.24  1.011741e+06
#     13          6.784558e+06    806563.05  7.591121e+06
#     14          2.964416e+06    484800.24  3.449216e+06
#     16          3.326763e+06    354698.19  3.681461e+06
#     17          2.134121e+06    332104.22  2.466226e+06
#     18          1.610634e+06    208336.17  1.818970e+06
#     19          1.998018e+05     40390.74  2.401926e+05
#     20          1.097193e+06    194301.72  1.291495e+06
#     21          1.230819e+06    124427.62  1.355247e+06
#     22          1.877743e+06    312537.57  2.190280e+06
#     23          3.874298e+06    767523.28  4.641822e+06
#     24          1.083006e+06    184521.95  1.267528e+06
#     25          1.766446e+06    247431.42  2.013877e+06
#     26          1.485569e+06    210425.49  1.695995e+06
#     27          2.744492e+05     54644.65  3.290938e+05
#     28          1.278086e+05     17073.15  1.448818e+05
#     29          9.054023e+05    179998.12  1.085400e+06
#     30          7.132778e+05    149304.96  8.625828e+05
#     31          4.251722e+05     73823.94  4.989962e+05
#     32          1.267108e+06    252568.79  1.519677e+06
#     33          1.133747e+06    178945.92  1.312693e+06
#     34          2.527633e+06    539174.27  3.066807e+06
#     35          4.626222e+05     90720.54  5.533428e+05
#     36          3.241680e+05     45932.64  3.701007e+05
#     37          3.918261e+05     51640.58  4.434667e+05
#     38          1.119311e+07   1027983.70  1.222110e+07
#     39          3.639000e+01         0.00  3.639000e+01
#     40          9.331837e+06   1257967.55  1.058980e+07
#     41          2.928441e+05     90174.17  3.830183e+05
#     42          1.033817e+06    135369.58  1.169187e+06
#     43          1.320000e+00         0.00  1.320000e+00
#     44          7.899491e+05    109093.70  8.990428e+05
#     45          2.179780e+03       501.02  2.680800e+03
#     46          3.958821e+06    640873.96  4.599695e+06
#     47          1.069460e+03      -863.00  2.064600e+02
#     48          7.963667e+04     23581.30  1.032180e+05
#     49          1.453065e+06    231762.65  1.684828e+06
#     50          1.395138e+05     47898.00  1.874118e+05
#     51          2.144350e+03       231.97  2.376320e+03
#     52          3.854076e+05     64330.09  4.497377e+05
#     54          2.921002e+04      6612.20  3.582222e+04
#     55          2.028577e+06    245048.72  2.273626e+06
#     56          7.298506e+05     11907.72  7.417583e+05
#     58          8.593800e+05     61882.00  9.212620e+05
#     59          1.793208e+05     30438.46  2.097592e+05
#     60          6.087531e+04      6934.20  6.780951e+04
#     67          1.592545e+06    159932.57  1.752478e+06
#     71          9.882041e+05    191350.26  1.179554e+06
#     72          9.605358e+06   1742852.72  1.134821e+07
#     74          2.916816e+06    533763.86  3.450580e+06
#     77          1.408475e+04      1590.00  1.567475e+04
#     78          4.358000e+02        12.00  4.478000e+02
#     79          5.364891e+06    477624.46  5.842515e+06
#     80          2.960224e+06      1400.16  2.961624e+06
#     81          3.751154e+06    133015.10  3.884169e+06
#     82          2.864145e+06    263321.95  3.127467e+06
#     83          9.267668e+05      3924.14  9.306910e+05
#     85          4.743071e+05     34647.80  5.089549e+05
#     87          2.976557e+06    261949.59  3.238506e+06
#     90          1.132255e+07    177362.52  1.149991e+07
#     91          9.295858e+06    158395.23  9.454253e+06
#     92          1.844333e+07    610311.34  1.905364e+07
#     93          7.050600e+06     17595.29  7.068195e+06
#     94          7.930712e+06      1937.35  7.932650e+06
#     95          1.635926e+07    924985.23  1.728425e+07
#     96          2.692247e+06    114342.46  2.806590e+06
#     97          3.758207e+06     69946.48  3.828154e+06
#     98          1.699556e+06      2609.14  1.702165e+06
#     99          4.663221e+04         0.00  4.663221e+04
#     All         2.337163e+08  23178403.89  2.568947e+08