
#### What is Data Aggregation and Group Operations?

Data aggregation and group operations involve **splitting** a dataset into **groups**, then applying **functions** (like aggregation, transformation, or filtering) on each group. This process is central to many data analysis workflows, especially for **summarizing data**, **creating reports**, or **visualizations**.

#### Importance in Data Analysis

- After loading, merging, and cleaning data, often you'll need to **analyze groups** of data separately.
- Examples: 
  - Calculating average income by region,
  - Normalizing scores within categories,
  - Running regressions per group.

#### The Role of `groupby` in pandas

The `groupby()` method in pandas allows you to:
1. **Split** the data into groups based on a key (or multiple keys),
2. **Apply** a function to each group,
3. **Combine** the results back into a useful structure.

This is often referred to as the **Split–Apply–Combine** strategy.

#### Why Not Just SQL?

- SQL is great for filtering, joining, and basic aggregations.
- However, SQL is **limited** in performing:
  - Complex group transformations,
  - Custom logic for each group,
  - Advanced statistical computations.
- Python and pandas allow you to **write custom Python functions** for these operations — making it highly flexible and expressive.

#### Key Capabilities of pandas Group Operations

##### 1. **Split** Data Using Keys
- Keys can be:
  - **Column names** (e.g., `df.groupby('city')`)
  - **Multiple columns** (e.g., `df.groupby(['city', 'gender'])`)
  - **Functions** (e.g., `df.groupby(lambda x: x[:3])`)
  - **Arrays** (e.g., an array specifying labels for grouping)


##### 2. **Aggregate** Data
- Built-in aggregation methods:
  - `count()`, `sum()`, `mean()`, `median()`, `min()`, `max()`, `std()`, etc.
- Custom aggregations:
  - Use `agg()` with a **dictionary** or **custom functions**.
- Example:
  ```python
  df.groupby('category')['sales'].agg(['sum', 'mean'])
  ```

##### 3. **Transform** Data Within Groups
- The `transform()` method returns an object **same size as input**.
- Useful for:
  - **Normalization** (e.g., z-score within group),
  - **Filling missing values by group mean**,
  - **Ranking** elements within a group.
- Example:
  ```python
  df['normalized'] = df.groupby('group')['value'].transform(lambda x: (x - x.mean()) / x.std())
  ```

##### 4. **Filter Groups**
- You can filter groups based on a **condition**.
- Example: Keep groups with size > 3:
  ```python
  df.groupby('category').filter(lambda x: len(x) > 3)
  ```

##### 5. **Apply Custom Functions**
- `apply()` allows full flexibility:
  - You can define custom operations per group,
  - Can return scalars, Series, or DataFrames.
- Example:
  ```python
  def top_n(df, n=3):
      return df.sort_values('score', ascending=False).head(n)

  df.groupby('class').apply(top_n)
  ```

##### 6. **Pivot Tables and Cross-tabulations**
- `pivot_table()` creates summary tables:
  ```python
  df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
  ```
- `crosstab()` is used for frequency tables:
  ```python
  pd.crosstab(df['gender'], df['product'])
  ```

##### 7. **Statistical Group Analysis**
- **Quantile analysis**, regression, correlation by group:
  ```python
  df.groupby('group')['score'].quantile(0.75)
  ```

#### Summary of Key Concepts

| Concept                          | Description |
|----------------------------------|-------------|
| `groupby()`                      | Splits data into groups |
| Aggregation                      | Computes statistics like sum, mean, etc. |
| `agg()`                          | Allows multiple or custom aggregation |
| `transform()`                    | Transforms each group, keeping original shape |
| `filter()`                       | Drops groups based on a condition |
| `apply()`                        | Fully flexible, applies custom function to each group |
| `pivot_table()` / `crosstab()`   | Summary and frequency tables |
| Advanced Analysis                | Quantiles, correlations, regressions by group |

> Time-based aggregation of time series data , a special use case of groupby -- will do it later

In [21]:
import numpy as np 
import pandas as pd 

## [ How to Think about Group Operations ]
- Hadley Wickham, an author of many popular packages for the R programming langauge, coined the term split-apply-combine for describing group operations.
- In the first stage of the process, data contained in a pandas object is split into groups based on one or more keys that we provide.
- The splitting is performed on a particular axis of an object.
- Once this is done, a function is applied to each group, producing a new value.
- Finally, the results of all those function applications are combined into a result object.
- The form of the resulting object will usually depend on what's being done to the data

- The grouping keys tell pandas how to divide the data -- and pandas supports several flexible ways to define those keys.

#### **Types of Grouping Keys in pandas**

1. ##### **Column name**
   - Group by values in a column.
   - 📌 Example: `df.groupby('city')`

2. ##### **List or array of values**
   - A list/array of same length as the DataFrame/Series axis.
   - Each element indicates the group label for the corresponding row.
   - 📌 Example: `df.groupby(['A', 'A', 'B', 'B'])`

3. ##### **Multiple column names (list of labels)**
   - Group by combinations of multiple columns.
   - 📌 Example: `df.groupby(['city', 'gender'])`

4. ##### **Dictionary**
   - Map index labels to group names.
   - 📌 Example: `df.groupby({'a': 'group1', 'b': 'group1', 'c': 'group2'})`

5. ##### **Series**
   - A Series mapping each row (or column) to a group.
   - 📌 Example:
     ```python
     group_map = pd.Series(['X', 'X', 'Y'], index=df.index)
     df.groupby(group_map)
     ```

6. ##### **Function**
   - A function applied to the index (or column names) that returns the group label.
   - 📌 Examples:
     - Group by first letter of index: `df.groupby(lambda x: x[0])`
     - Group by row length: `df.groupby(lambda x: len(x))`
     
7. ##### **Combinations (Mix of types)**
   - You can combine any of the above types in a list.
   - 📌 Example:
     ```python
     df.groupby([array, 'column_name', lambda x: x[-1]])
     ```


In [22]:
# Now Examples
# here is a small tabular dataset as a DataFrame

df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7),
                   "data2" : np.random.standard_normal(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-1.667753,1.192453
1,a,2.0,-0.244164,-0.197482
2,,1.0,0.85469,2.362503
3,b,2.0,-1.626087,-0.477084
4,b,1.0,-0.853424,2.019757
5,a,,-1.249935,-1.183894
6,,1.0,0.138188,0.562949


In [23]:
# suppose you wanted to compute the mean of the data1 column using the labels from key1
# there are a number of ways to do this.
# one is to access data1 and call groupby with the column (a series) at key1

grouped = df["data1"].groupby(df["key1"])
grouped
# this grouped variable is now a special "GroupBy" object. 
# this object has all of the information needed to then apply some operation to each of the groups.

<pandas.core.groupby.generic.SeriesGroupBy object at 0x723df90691d0>

In [24]:
# for ex, to compute group means we can call the GroupBy's mean method
grouped.mean()

key1
a   -1.053951
b   -1.239755
Name: data1, dtype: float64

In [25]:
# passing multiple arrays as a list
means = df["data1"].groupby([df["key1"], df["key2"]]).mean()
means

key1  key2
a     1      -1.667753
      2      -0.244164
b     1      -0.853424
      2      -1.626087
Name: data1, dtype: float64

In [26]:
# here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed
means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.667753,-0.244164
b,-0.853424,-1.626087


In [27]:
# in this example, the group keys are all series, though they could be any arrays of the right length

states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

df["data1"].groupby([states, years]).mean()

CA  2005   -0.747049
    2006    0.854690
OH  2005   -1.646920
    2006   -0.357618
Name: data1, dtype: float64

In [28]:
# frequently, the grouping information is found in the same DataFrame as the data you want to work on

df.groupby("key1").mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,-1.053951,-0.062974
b,1.5,-1.239755,0.771337


In [29]:
# df.groupby("key2").mean()

In [30]:
df.groupby(["key1", "key2"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,-1.667753,1.192453
a,2,-0.244164,-0.197482
b,1,-0.853424,2.019757
b,2,-1.626087,-0.477084


In [31]:
# regardless of why we're using groupby, sometimes we just want to know how many rows are in each group.
df.groupby(["key1", "key2"]).size()

key1  key2
a     1       1
      2       1
b     1       1
      2       1
dtype: int64

In [32]:
# note that any missing values in a group key are excluded from result by default
# this behavior can be disabled by passing dropna=False to groupby
df.groupby("key1", dropna=False).size()

key1
a      3
b      2
NaN    2
dtype: int64

In [33]:
df.groupby(["key1", "key2"], dropna=False).size()

key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

In [34]:
# a group function similar in spirit to size is count, which computes the number of nonnull values in each group
df.groupby("key1").count()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2


## [ Iterating over Groups ]
the object returned by groupby supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [35]:
# 
for name, group in df.groupby("key1"):
    print(name)
    print(group)

a
  key1  key2     data1     data2
0    a     1 -1.667753  1.192453
1    a     2 -0.244164 -0.197482
5    a  <NA> -1.249935 -1.183894
b
  key1  key2     data1     data2
3    b     2 -1.626087 -0.477084
4    b     1 -0.853424  2.019757


In [36]:
# in case of multiple keys, the first element in the tuple will be a tuple of key values
for (k1, k2), group in df.groupby(["key1", "key2"]):
    print((k1, k2))
    print(group)

('a', np.int64(1))
  key1  key2     data1     data2
0    a     1 -1.667753  1.192453
('a', np.int64(2))
  key1  key2     data1     data2
1    a     2 -0.244164 -0.197482
('b', np.int64(1))
  key1  key2     data1     data2
4    b     1 -0.853424  2.019757
('b', np.int64(2))
  key1  key2     data1     data2
3    b     2 -1.626087 -0.477084


In [37]:
# compute a dictionary of the data pieces as a one-liner
pieces = {name: group for name, group in df.groupby("key1")}
pieces["b"]

Unnamed: 0,key1,key2,data1,data2
3,b,2,-1.626087,-0.477084
4,b,1,-0.853424,2.019757


In [42]:
# by default groupby groups on axis="index", but we can group on any of the other axes
grouped = df.groupby({"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns")

for group_key, group_values in grouped:
    print(group_key)
    print(group_values)

data
      data1     data2
0 -1.667753  1.192453
1 -0.244164 -0.197482
2  0.854690  2.362503
3 -1.626087 -0.477084
4 -0.853424  2.019757
5 -1.249935 -1.183894
6  0.138188  0.562949
key
   key1  key2
0     a     1
1     a     2
2  None     1
3     b     2
4     b     1
5     a  <NA>
6  None     1


  grouped = df.groupby({"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns")


## [ Selecting a Column or Subset of Columns ]

In [None]:
# indexing a groupby object created from a dataframe with a column name or array of column names has the effect of column subsetting for aggregation
# this means that: 
    # df.groupby("key1")["data1"]   --  # Group df by key1, then select data1
    # df.groupby("key1")[["data2"]]  --  # Group df by key1, then select data2 (as DataFrame) 
# are conveniences for: 
    # df["data1"].groupby(df["key1"])   --  # Select data1, then group by key1
    # df[["data2"]].groupby(df["key1"])  --   # Select data2, then group by key1

# df.groupby("key1")["data1"]  ≡  df["data1"].groupby(df["key1"])
# both lines give the same grouped object

# Why does this matter?
    # this flexibility lets you:
        # write shorter and cleaner code
        # control which columns you're aggregating
        # combine groupby with selections for better performance and clarity

In [43]:
# especially for large datasets,  it may be desirable to aggregate only a few columns
# For example, in the preceding dataset, to compute the means for just the data2 column and get the result as a DataFrame, we could write:

df.groupby(["key1", "key2"])[["data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,1.192453
a,2,-0.197482
b,1,2.019757
b,2,-0.477084


In [44]:
# the object returned by this indexing operation is a grouped dataframe if a list or array is passed, or a grouped series
# if only a single column name is passed as a scalar

s_grouped = df.groupby(["key1", "key2"])["data2"]
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x723df906bed0>

In [45]:
s_grouped.mean()

key1  key2
a     1       1.192453
      2      -0.197482
b     1       2.019757
      2      -0.477084
Name: data2, dtype: float64

## [ Grouping with Dictionaries and Series ]

In [46]:
# grouping information may exist in a form other than an array/
# consider another example
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])

people.iloc[2:3, [1,2]] = np.nan # adding a few NA values
people 

Unnamed: 0,a,b,c,d,e
Joe,0.34255,-0.349884,-0.508735,-2.228331,-0.321739
Steve,0.076718,-1.154433,-0.048853,-1.414574,0.662565
Wanda,-1.718774,,,-0.804147,1.078869
Jill,-0.057635,-0.237819,2.651237,1.113195,-0.270864
Trey,0.426659,-0.019596,-1.370235,-0.514216,0.361476


In [50]:
# now suppose I have a group correspondence for the columns and want to sum the columns by group
mapping = {"a": "red", "b": "red", "c": "blue",
           "d": "blue", "e": "red", "f" : "orange"}

# now we can construct an array from this dictionary to pass to groupby
by_column = people.groupby(mapping, axis="columns")
by_column.sum()

  by_column = people.groupby(mapping, axis="columns")


Unnamed: 0,blue,red
Joe,-2.737066,-0.329073
Steve,-1.463427,-0.415149
Wanda,-0.804147,-0.639905
Jill,3.764433,-0.566318
Trey,-1.884451,0.768539


> NOTE: the unused grouping keys are OK

In [51]:
# the same functionality holds for Series, which can be viewed as a fixed-size mapping

map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [52]:
people.groupby(map_series, axis="columns").count()

  people.groupby(map_series, axis="columns").count()


Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3
