# Grouping


## groupby()

groupby(by=None, axis=0, level=None, as_index=True, ...)


```python
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=list('ABUV'), index=pd.MultiIndex.from_product([['a','b'],['u','v']]))

df
      A   B   U   V
a u   0   1   2   3
  v   4   5   6   7
b u   8   9  10  11
  v  12  13  14  15
    
for name, gp in df.groupby(level=0): print(name); print(gp); break
a
     A  B  U  V
a u  0  1  2  3
  v  4  5  6  7
    
for name, gp in df.groupby(['v','c','v','c'], axis=1): print(name); print(gp); break
c
      B   V
a u   1   3
  v   5   7
b u   9  11
  v  13  15
```

## agg(), apply(), filter(), transform() 

* agg(): The function used in agg() is applied to each column and returns a scalar value.


* apply(): The function used in apply() is applied to each subframe and returns a dataframe, a series or a scalar. While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.


* filter(): The function used in filter() is applied to each subframe and returns True or False.


* transform(): The function used in transform() is applied to each column and returns a Series having the same indexes as the original object filled with the transformed values.


```python
df
   A  B     C
0  1  1  1.82
1  1  2 -0.87
2  2  3  0.35
3  2  4 -1.26


df.groupby('A').agg(['min', 'max'])
    B         C      
  min max   min   max
A                    
1   1   2 -0.87  1.82
2   3   4 -1.26  0.35


def f(x): x['C'] = x['C']/x['B']**2; return x
df.groupby('A').apply(f)
   A  B         C
0  1  1  1.820000
1  1  2 -0.217500
2  2  3  0.038889
3  2  4 -0.078750


df.groupby('A').filter(lambda x: x['B'].mean() > 2)
   A  B     C
2  2  3  0.35
3  2  4 -1.26


df.groupby('A').transform(lambda x: x - x.mean())
     B      C
0 -0.5  1.345
1  0.5 -1.345
2 -0.5  0.805
3  0.5 -0.805


# Here ts is a DataFrame whose index is of type DatetimeIndex.
ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
```

# rolling(), expanding()
    
```python    
df = pd.DataFrame({'A': [1] * 4 + [5] * 4, 'B': range(8)})

df.groupby('A').B.rolling(2).sum()
A   
1  0     NaN
   1     1.0
   2     3.0
   3     5.0
5  4     NaN
   5     9.0
   6    11.0
   7    13.0
Name: B, dtype: float64

        
df.groupby('A').B.expanding(2).sum()
A   
1  0     NaN
   1     1.0
   2     3.0
   3     6.0
5  4     NaN
   5     9.0
   6    15.0
   7    22.0
Name: B, dtype: float64
```

## pd.Grouper()

A Grouper allows the user to specify a groupby instruction for a target object.

```python
df
     A   B   U
a u  3   1   2
  v  7   5   6
b u  7   9  10
  v  3  13  14

df.groupby([pd.Grouper(level=1), 'A']).sum()
      B   U
  A        
u 3   1   2
  7   9  10
v 3  13  14
  7   5   6

# Specify a resample operation on the column 'date' with a frequency of 60s
df.groupby(Grouper(key='date', freq='60s'))

# Specify a resample operation on the level 'date' on the columns axis with a frequency of 60s
df.groupby(Grouper(level='date', freq='60s', axis=1))
```


## Other functions used in groups

### first(), last(), head(), tail(), nth()


### nlargest(), nsmallest()


### shift()

# Wide format, Long format


## stack(), unstack()

* stack(level=-1, dropna=True): Stack the prescribed level(s) from columns to index.
* unstack(level=-1, fill_value=None): Unstack Series with MultiIndex to produce DataFrame. 

```python
s = pd.Series(range(4), index=pd.MultiIndex.from_product([['A','B'],['a','b']]))
s
A  a    0
   b    1
B  a    2
   b    3
    
s.unstack()
   a  b
A  0  1
B  2  3

s.unstack().stack()    # same as s
```

## melt(), pivot()

* melt(id_vars, value_vars, ...)
* pivot(index, columns, values)

```python
df = pd.DataFrame({'A': list('abc'), 'B':[1,3,5],'C':[2,4,6]})
df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6

df = df.melt(id_vars='A', value_vars='B')
df
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

df2 = df.melt(id_vars='A', value_vars=['B','C'])
df2
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

df2.pivot(index='A', columns='variable', values='value')
variable  B  C
A             
a         1  2
b         3  4
c         5  6
```


## pivot_table()

pivot_table = pivot + aggregate function

pivot_table(values, index, columns, aggfunc, fill_value, margins, dropna, ...)

```python
df = pd.DataFrame({'A': ['foo',]*5+['bar',]*4, 
                  'B': np.array(['one','two'])[[0,0,0,1,1,0,0,1,1]], 
                  'C': np.array(['small','large'])[[0,1,1,0,0,1,0,0,1]],
                  'D': [1,2,2,3,3,4,5,6,7],
                  'E': [2,3,5,5,6,6,8,9,9]})

df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  3
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

df.pivot_table(values='D', index=['A','B'], columns=['C'], aggfunc=np.sum)
C        large  small
A   B                
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0
```

# Regular expressions


## \\*number*

\\*number* is a backreference. For example, \1 is the first parentheses-delimited expression inside of the regex.

```python
import re
re.sub(r"([?.,!])", r" \1 ", "He's good,, but not always..")       # "He's good ,  ,  but not always .  . "
```