# Group By: split-apply-combine
### https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

By "group by" we are referring to a process involving one or more of the following steps:

  - **Splitting** the data into groups based on some criteria. (로우들을 그룹으로 나눈다)
  - **Applying** a function to each group independently. (그룹별로 무언가 액션을 취한다)
  - **Combining** the results into a data structure. (그룹별 결과를 모은다)

In the **apply** step, we might wish to do one of the following: (그룹별로 무슨 액션을 취하는가?)

  - **Aggregation**: compute a summary statistic (or statistics) for each group. (그룹별로 스칼라 값을 추출)
    - Compute group sums or means.
    - Compute group sizes / counts.

  - **Transformation**: perform some group-specific computations and return a like-indexed object. (그룹별로 새로운 데이터프레임을 만듬)
    - Standardize data (zscore) within a group.
    - Filling NAs within groups with a value derived from each group.

  - **Filtration**: discard some groups, according to a group-wise computation that evaluates True or False. (그룹들 중 어떤 그룹을 살릴지 결정)

    - Discard data that belongs to groups with only a few members.
    - Filter out data based on the group sum or mean.

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns

tips = sns.load_dataset('tips')
tips.head()

### Splitting an object into groups

- On a DataFrame, we obtain a `GroupBy` object by calling `groupby()`. We could naturally group by either the A or B columns, or both:

> [**References**]
> - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
> - https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#indexing-iteration
> - https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

```
grouped = tips.groupby('smoker')
grouped

grouped.groups
grouped.groups.keys()

grouped = tips.groupby(['smoker', 'time'])
grouped.groups

len(grouped)
grouped.sum()
grouped.describe()

df = grouped.sum()
df.index
```

### Selecting a group

- A single group can be selected using `get_group()`:

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#indexing-iteration

```
keys = list(grouped.groups.keys())
keys

index = grouped.groups[keys[0]]
tips.loc[index]

grouped.get_group(keys[0])
```

### Iterating through groups

- With the `GroupBy` object in hand, iterating through the grouped data is very natural and functions similarly to `itertools.groupby()`:

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#indexing-iteration

```
for key, group in grouped:
    print(key, '\n', group.head(), end='\n'*2)
```

### Aggregation

- Once the `GroupBy` object has been created, several methods are available to perform a computation on the grouped data. An obvious one is aggregation via the `aggregate()` or equivalently `agg()` method:

```
grouped2 = grouped[['total_bill', 'tip']]
grouped2.get_group(('Yes', 'Lunch')).sum()

grouped2.aggregate(np.sum)

grouped2.sum()
grouped2.agg([np.sum, np.mean, np.std])
```

### Transformation

- The transform method returns an object that is indexed the same (same size) as the one being grouped.

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation

1. 먼저 1999년 10월 1일부터 2002년 10월 4일까지 1100일간의 시계열데이터를 만든다.
2. 데이터 값은 가우시안 분포(평균 0.5, 표준편차 2)에서 랜덤하게 추출한다.
3. 이 시계열데이터의 100일 이동평균을 구한다.

```
index = pd.date_range('10/1/1999', periods=1100)
index

ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
ts.head()

ts = ts.rolling(window=100).mean().dropna() # moving window (이동평균)
ts.head()
```

- 각 년도별로 데이터 노멀라이즈는 어떻게? (평균 0, 표준편차 1)
> `(s - s.mean()) / s.std()`

```
grouped = ts.groupby(lambda x: x.year)
grouped.groups

s = grouped.get_group(2000)
(s - s.mean()) / s.std()

transformed = grouped.transform(lambda s: (s - s.mean()) / s.std())
transformed

s = transformed.groupby(lambda x: x.year).get_group(2000)
s.mean(), s.std()

compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})
compare.T

compare.plot(figsize=(15,4))
```

### Filteration

- The filter method returns a subset of the original object.

```
grouped2.sum()

df = grouped2.filter(lambda df: df.tip.sum() > 100 and df.tip.sum() < 300)
df.T

df = grouped2.get_group(('Yes', 'Dinner'))
df.T

df = grouped2.get_group(('No', 'Lunch'))
df.T
```