# Data Wrangling : Group Data

In [1]:
import pandas as pd
import numpy as np
%load_ext line_profiler
#%load_ext cython
#from joblib import Parallel, delayed
from scipy.stats import zscore

# Grouping Data 

Grouping data means we: 

* Split: Segment data based on criteria

* Apply: Aggregate - Transform - Filter operations to the elemes of each group


In [2]:
# Load data 
df = pd.read_csv('/Users/stewarta/Documents/DATA/Home Data/bureau_balance.csv')

In [3]:
df.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


## Index the Data

In [4]:
df_bal = df.set_index('SK_ID_BUREAU')
df_bal.head()

Unnamed: 0_level_0,MONTHS_BALANCE,STATUS
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1
5715448,0,C
5715448,-1,C
5715448,-2,C
5715448,-3,C
5715448,-4,C


# Split a Dataframe into Groups

Split: Segment data based on criteria

Pandas objects can be split on any of their axes. Splits are created using the using the groupby() function.  

We form groups by passeing one or more columns and and axis to the groupby function. Default axis = 0



In [5]:
group_bal = df_bal.groupby('SK_ID_BUREAU')
group_bal.head()

Unnamed: 0_level_0,MONTHS_BALANCE,STATUS
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1
5715448,0,C
5715448,-1,C
5715448,-2,C
5715448,-3,C
5715448,-4,C
5715449,0,C
5715449,-1,C
5715449,-2,C
5715449,-3,C
5715449,-4,C


## Inspecting Groups
A single group can be selected using get_group()

In [6]:
%%timeit -n 1 -r 1 -t x = range(10) 
group_bal.get_group(5715448) # returns a dataframe!

2.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Iterating through Groups¶

In [7]:
%%timeit -n 1 -r 1 -t x = range(10)
count = 0
lst = df['SK_ID_BUREAU'].unique()
for key in lst:
    print(key) ## <class 'int'>
    print(group_bal.get_group(key)) ## <class 'pandas.core.frame.DataFrame'>
    count += 1
    if count == 1:
        break
    pass

5715448
              MONTHS_BALANCE STATUS
SK_ID_BUREAU                       
5715448                    0      C
5715448                   -1      C
5715448                   -2      C
5715448                   -3      C
5715448                   -4      C
5715448                   -5      C
5715448                   -6      C
5715448                   -7      C
5715448                   -8      C
5715448                   -9      0
5715448                  -10      0
5715448                  -11      X
5715448                  -12      X
5715448                  -13      X
5715448                  -14      0
5715448                  -15      0
5715448                  -16      0
5715448                  -17      0
5715448                  -18      0
5715448                  -19      0
5715448                  -20      X
5715448                  -21      X
5715448                  -22      X
5715448                  -23      X
5715448                  -24      X
5715448             

# Simple Operations on Groups

## Single Operations on Groups

Compute a statistics for each group: first, last, nth, mean, sum, std, var, min, max, size, count, describe,sem

NOTE: Not every method is supported in dask : describe, nth, sem

### Variance for each Group

2.42 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [8]:
%%timeit -n 1 -r 1 -t x = range(10)
group_bal['MONTHS_BALANCE'].var()

523 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Sum for a selected group and column...

In [15]:
%%timeit -n 1 -r 1 -t x = range(10)
res_bal = group_bal['MONTHS_BALANCE'].sum()
print(res_bal)

SK_ID_BUREAU
5001709   -4656
5001710   -3403
5001711      -6
5001712    -171
5001713    -231
5001714    -105
5001715   -1770
5001716   -3655
5001717    -231
5001718    -741
5001719    -903
5001720    -630
5001721   -3570
5001722   -3655
5001723    -465
5001724    -465
5001725     -28
5001726    -741
5001727   -4465
5001728       0
5001729     -21
5001730   -1830
5001731     -55
5001732    -630
5001733     -55
5001734      -3
5001735    -276
5001736    -276
5001737    -231
5001738    -946
           ... 
6842859    -630
6842860    -630
6842861    -630
6842862   -2346
6842863   -3081
6842864    -595
6842865    -171
6842866     -10
6842867    -105
6842868   -1540
6842869    -351
6842870    -630
6842871    -171
6842872   -3828
6842873   -1540
6842874   -4005
6842875   -2926
6842876    -325
6842877    -820
6842878    -435
6842879    -741
6842880   -1653
6842881    -496
6842882     -28
6842883    -666
6842884   -1128
6842885    -276
6842886    -528
6842887    -666
6842888   -1891
Name: MONTH

In [14]:
res_bal

NameError: name 'res_bal' is not defined

# Aggregate

## Multiple Operations on Groups

The aggregation API allows one to use one or more operations over the specified axis. 

* pass multiple aggregation arguments as a list

* You can also pass **named methods** as strings. These will return a Series of the aggregated output. Example: df.agg(['sum', 'mean'])

* a NumPy Mathematical Functions: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html 

* NOTE: Using a single function is equivalent to apply().

### Multiple Operations to the Same Column

Pandas: 1.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [12]:
%%timeit -n 1 -r 1 -t x = range(10)
group_bal['MONTHS_BALANCE'].agg([np.sum, np.mean, np.std])

2.11 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Multiple Operations to Different Columns

In [13]:
%%timeit -n 1 -r 1 -t x = range(10)
group_bal.agg({'MONTHS_BALANCE' : np.mean, 'STATUS' : np.sum})

1min 42s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Transformation Operation

The transform() method returns an object that is indexed the same (same size) as the original. This API allows you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar to the .agg API
Some examples:

* Standardize data (zscore) within a group.
* Filling NAs within groups with a value derived from each group.


In [24]:
%%timeit -n 1 -r 1 -t x = range(10) # Not Good!
group_bal['MONTHS_BALANCE'].transform(zscore)

892 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Apply Operation :  Not Good!

In [16]:
## T_01: stat.np zscore apply
%prun -l 15 group_bal['MONTHS_BALANCE'].apply(zscore) ## i have no idea what this function is doing!!!

  return (a - mns) / sstd


 

         160210183 function calls (158575374 primitive calls) in 259.787 seconds

   Ordered by: internal time
   List reduced from 235 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   817395   27.231    0.000   44.896    0.000 _methods.py:86(_var)
  2452185   23.560    0.000   23.560    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   817397   14.289    0.000   57.640    0.000 base.py:255(__new__)
 37600322   12.762    0.000   16.577    0.000 {built-in method builtins.isinstance}
4086977/3269582   10.832    0.000   18.452    0.000 {built-in method numpy.core.multiarray.array}
   817395    8.925    0.000  102.127    0.000 stats.py:2194(zscore)
        1    7.511    7.511  257.069  257.069 groupby.py:2247(apply)
   817395    6.559    0.000   21.426    0.000 _methods.py:53(_mean)
   817395    5.844    0.000   51.207    0.000 _methods.py:133(_std)
   817395    4.845    0.000  109.068    0.000 internals.py:4702(get_slice)
  572179

In [54]:
# T_02: pure python: computed z score
def zcore_score_loop(group):
    result = {}
    for name, g in group:
        x = g.values
        m = x.mean()
        std = x.std()
        result[name] = (x - m) / std
    return result

In [55]:
%prun -l 15 zcore_score_loop(group_bal['MONTHS_BALANCE'])



 

         147131493 function calls (146314097 primitive calls) in 226.697 seconds

   Ordered by: internal time
   List reduced from 189 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   817395   40.348    0.000   63.317    0.000 _methods.py:91(_var)
  2452185   32.202    0.000   32.202    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   817396   11.589    0.000   46.632    0.000 base.py:255(__new__)
        1   10.277   10.277  226.011  226.011 <ipython-input-54-f49d24733107>:2(zcore_score_loop)
 35148060    8.983    0.000   12.249    0.000 {built-in method builtins.isinstance}
   817395    5.821    0.000   25.448    0.000 _methods.py:58(_mean)
   817395    5.217    0.000   68.911    0.000 _methods.py:138(_std)
  1634790    4.544    0.000    4.849    0.000 _methods.py:48(_count_reduce_items)
  5721776    4.174    0.000    4.174    0.000 {built-in method builtins.hasattr}
   817395    3.975    0.000   88.142    0.000 internals.p

## Filter:

Discard some groups, according to a group-wise computation that evaluates True or False. 

* Discard data that belongs to groups with only a few members.
* Filter out data based on the group sum or mean.   


### This takes a long time ...

In [11]:
%%timeit -n 1 -r 1 -t x = range(10)
group_bal['MONTHS_BALANCE'].apply(lambda x: x.sum() > 0 )

2min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### ... but, reformulate, then the runtime is shorter.

In [49]:
%%timeit -n 1 -r 1 -t x = range(10)
group_bal['MONTHS_BALANCE'].sum().apply(lambda x : x > 0)

736 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
