# Subsetting and split-apply-combine

Subsetting: only load columns and rows that you need.

Split-apply-combine strategy:
- **split** your data in smaller subsets
- **apply** necessary transformation to subsets one at a time, storing transformation results
- **combine** results from subsets to get final result

In [None]:
import os

import pandas as pd
import joblib


os.getpid()

### bigger than memory

Loading the whole example dataset in a DataFrame (20M rows, 15 cols) will take about 1 minute and occupy 4GB+ in memory. More memory will be used if you start running computations. Not feasible in Binder environment where memory is limited to 1-2GB, it will crash and restart your kernel. Multiply by 10 for actual InfoGroup.

In [None]:
# running this cell in Binder will restart your kernel
df = []
for year in range(2001, 2021):
    df.append(pd.read_csv(f'data/synig/{year}.csv'))
df = pd.concat(df)

### use a subset for development and testing

If data rows are in random order, reading just the first few rows will give you a representative sample.

In [None]:
df = []
for year in range(2001, 2021):
    df.append(pd.read_csv(f'data/synig/{year}.csv', nrows=10_000))
df = pd.concat(df)

In [None]:
df.sample(10)

In [None]:
df.groupby('SECTOR')['EMPLOYEES'].agg(['size', 'sum', 'mean']).astype(int).T

We are not getting all sectors of the economy here. Clearly, row order is not random.

### create a random sample

Let's create a random 5% sample.

I will only use subset of years to save time. Normally you would want to save results of long-running intermediate steps on disk. I will return to this when we talk about `parquet` and `joblib`.

In [None]:
df = []
for year in range(2001, 2006):
    d = pd.read_csv(f'data/synig/{year}.csv')
    d = d.sample(frac=0.05)
    df.append(d)
df = pd.concat(df)

In [None]:
df.shape

In [None]:
df.sample(10)

Problem with this simple approach on our dataset: longitudinal histories are broken. It won't help if we could even load all years of data and sample from that. Solution: draw random sample of unique identifiers and then get full histories for those identifiers. This approach will yield a sample that has the same distribution as the original.

In [None]:
abi = []
for year in range(2001, 2006):
    abi.append(pd.read_csv(f'data/synig/{year}.csv', usecols=['ABI']))
abi = pd.concat(abi)
abi = abi.drop_duplicates()
abi = abi.sample(frac=0.05)

In [None]:
df = []
for year in range(2001, 2006):
    d = pd.read_csv(f'data/synig/{year}.csv')
    d = d.merge(abi, 'left', 'ABI', indicator=True)
    d = d[d['_merge'] == 'both']
    del d['_merge']
    df.append(d)
df = pd.concat(df)

In [None]:
df.sample(5)

We can use this lightweight sample to get some insights about the whole, for example, compare sector sizes.

In [None]:
df.groupby('SECTOR')['EMPLOYEES'].agg(['size', 'sum', 'mean']).astype(int).T

### persist intermediate data for later use

You can save dataframe as CSV, `parquet` (stay tuned) or some other storage format. Or use standard Python `pickle` module. Here I am using `joblib`.

In [None]:
joblib.dump(df, 'data/rand_5pct.pkl')
df.shape

Restart kernel and import modules.

In [None]:
df = joblib.load('data/rand_5pct.pkl')
df.shape

### example: size vs age