---
## 📘 Author Information

**👨‍💻 Name:** Abdul Rehman  
**📌 Role:** Data Science Enthusiast | Python Learner  
**📅 Notebook Created:** 31 July 2025  

**🔗 Connect with Me:**  


[![LinkedIn](https://img.shields.io/badge/LinkedIn-blue?style=flat&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/abdul-rehman-74b418350/)
[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/datawithrehman/Data-Science-Beginning)
[![Twitter](https://img.shields.io/badge/Twitter-blue?style=flat&logo=twitter&logoColor=white)](https://x.com/datawithrehman)



Understanding pandas `groupby()` for Beginners
==============================================

Introduction
------------

In data analysis, you often need to summarize data by categories. For
example, you might want to find the average sales for each product
category, or the total quantity sold per region. The `groupby()`
function in pandas is an incredibly powerful tool that allows you to do
just that. It’s similar to the “PivotTable” feature in Excel, but with
much more flexibility and power in Python.

At its core, `groupby()` involves three steps: 1. **Splitting** the data
into groups based on some criteria. 2. **Applying** a function (like
`mean()`, `sum()`, `count()`, etc.) to each group independently. 3.
**Combining** the results into a single data structure.

Let’s dive into some examples to see how it works!

Setup: Creating a Sample DataFrame
----------------------------------

First, let’s create a sample pandas DataFrame that we’ll use throughout
this notebook. This DataFrame represents some fictional sales data.

``` python
import pandas as pd
import numpy as np

data = {
    'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B'],
    'Category': ['Electronics', 'Clothing', 'Electronics', 'Home', 'Clothing', 'Home', 'Electronics', 'Clothing'],
    'Region': ['East', 'West', 'East', 'North', 'West', 'North', 'East', 'West'],
    'Sales': [100, 150, 120, 80, 200, 90, 110, 180],
    'Quantity': [2, 3, 2, 1, 4, 1, 2, 3]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
```

Grouping by a Single Column
---------------------------

The most common use case for `groupby()` is to group data by a single
column and then apply an aggregation function. Let’s find the average
sales for each product category.

``` python
# Group by 'Category' and calculate the mean of 'Sales'
category_sales_mean = df.groupby("Category")["Sales"].mean()
print("\nAverage Sales by Category:")
print(category_sales_mean)
```

In the example above, `df.groupby("Category")` splits the DataFrame into
groups based on unique values in the ‘Category’ column. Then,
`["Sales"].mean()` calculates the average of the ‘Sales’ column for each
of these groups.

Performing Multiple Aggregations
--------------------------------

You’re not limited to just one aggregation. You can apply multiple
aggregation functions to one or more columns using the `.agg()` method.
This is incredibly flexible.

Let’s find the mean sales, total quantity, and the minimum and maximum
price (we’ll use ‘Sales’ as a proxy for price here for simplicity,
though in a real dataset ‘Price’ would be a separate column) for each
category.

``` python
# Group by 'Category' and apply multiple aggregations
multi_agg_by_category = df.groupby("Category").agg(
    Sales_Mean=("Sales", "mean"),
    Quantity_Sum=("Quantity", "sum"),
    Sales_Min=("Sales", "min"),
    Sales_Max=("Sales", "max")
)
print("\nMultiple Aggregations by Category:")
print(multi_agg_by_category)
```

Here, we’ve used a dictionary-like syntax within `.agg()` to specify the
new column names and the aggregation functions to apply.

Grouping by Multiple Columns
----------------------------

Sometimes, you need to group your data based on more than one criterion.
For instance, you might want to see the average sales per product
category *and* per region. You can do this by passing a list of column
names to `groupby()`.

``` python
# Group by 'Category' and 'Region' and calculate the mean of 'Sales'
category_region_sales_mean = df.groupby(["Category", "Region"])["Sales"].mean()
print("\nAverage Sales by Category and Region:")
print(category_region_sales_mean)
```

Notice that when you group by multiple columns, pandas creates a
`MultiIndex` (hierarchical index) in the result. This is powerful for
multi-dimensional analysis.

Flattening the Result with `.reset_index()`
-------------------------------------------

As seen in the previous example, `groupby()` often returns a Series or
DataFrame with one or more columns set as the index. While this can be
useful, sometimes you want these grouped columns back as regular columns
in your DataFrame. This is where `.reset_index()` comes in handy.

``` python
# Group by 'Category' and 'Region' and calculate the mean of 'Sales', then reset the index
category_region_sales_mean_flat = df.groupby(["Category", "Region"])["Sales"].mean().reset_index()
print("\nAverage Sales by Category and Region (with reset_index()):")
print(category_region_sales_mean_flat)
```

Using `.reset_index()` converts the MultiIndex back into regular
columns, making the result easier to work with, especially if you plan
further operations or want to export the data.

Mini Challenge: Group by Product and find Total Quantity
--------------------------------------------------------

Now it’s your turn! Using the `df` DataFrame, try to group by the
‘Product’ column and find the total ‘Quantity’ sold for each product.
Remember to use `.reset_index()` to get a clean DataFrame as your
result.

``` python
# Your code here:
# product_quantity_sum = df.groupby(...)
# print(product_quantity_sum)
```

Conclusion
----------

The `groupby()` function is a cornerstone of data analysis with pandas.
Mastering it allows you to quickly summarize, aggregate, and gain
insights from your data. Experiment with different aggregation functions
and grouping combinations to unlock its full potential!