# Chunk Size Guide

## Overview

When Featuretools calculates a feature matrix it is passed or creates a dataframe called `cutoff_time` to determine what data can be used to calculate each row of features.  In this example we create a create a `cutoff_time` dataframe for the "customers" entity, with each customer's feature values being calculated at the time of the final purchase by that customer.

In [27]:
import featuretools as ft
import pandas as pd
from featuretools.computational_backends import bin_cutoff_times

entityset = ft.demo.load_retail()
cutoff_time = pd.DataFrame({'time': entityset["customers"].last_time_index,
                            'instance_id': entityset["customers"].df["CustomerID"]})
cutoff_time.head()

Unnamed: 0_level_0,instance_id,time
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
17850.0,17850.0,2011-02-10 14:38:00
13047.0,13047.0,2011-11-08 12:10:00
12583.0,12583.0,2011-12-07 08:07:00
13748.0,13748.0,2011-09-05 09:45:00
15100.0,15100.0,2011-01-13 17:09:00


By default `calculate_feature_matrix` will divide the `cutoff_time` dataframe into smaller `chunks` to calculate on, trying to group rows with the same time in the same chunk.  The cell below shows featuretools creating a 10 row chunk, keeping rows with the same time together.

In [3]:
from featuretools.computational_backends import get_next_chunk

chunk_iterator = get_next_chunk(cutoff_time=cutoff_time, time_variable='time', num_per_chunk=10)
print(chunk_iterator.next())

            instance_id                time
CustomerID                                 
13823.0         13823.0 2011-11-21 10:22:00
12658.0         12658.0 2011-11-21 10:22:00
14723.0         14723.0 2011-11-29 15:23:00
12670.0         12670.0 2011-11-29 15:23:00
14824.0         14824.0 2011-11-29 14:20:00
13428.0         13428.0 2011-11-29 14:20:00
15153.0         15153.0 2011-11-08 17:09:00
17022.0         17022.0 2011-11-08 17:09:00
15215.0         15215.0 2011-10-26 12:28:00
17712.0         17712.0 2011-10-26 12:28:00


If there are more rows with the same cutoff time than the chunk size allows, the rows will be be placed in multiple chunks. By binning the timestamp values by 30 minute increments we get times that have more than 10 rows in `cutoff_time`. The chunking algorithm will create a chunk of 10 rows of the same time, then save the other rows to include in another chunk.

In [62]:
cutoff_time_30_minutes = bin_cutoff_times(cutoff_time.copy(), "30 minutes")
chunk_iterator = get_next_chunk(cutoff_time=cutoff_time_30_minutes, time_variable='time', num_per_chunk=10)

print "All rows with timestamp 2011-12-06 12:00\n"
print cutoff_time_30_minutes[cutoff_time_30_minutes['time'] == pd.Timestamp("2011-12-06 12:00:00")]
print "\nFirst chunk\n"
print(chunk_iterator.next())
print "\nSecond chunk\n"
print(chunk_iterator.next())

All rows with timestamp 2011-12-06 12:00

            instance_id                time
CustomerID                                 
17346.0         17346.0 2011-12-06 12:00:00
15061.0         15061.0 2011-12-06 12:00:00
12971.0         12971.0 2011-12-06 12:00:00
16353.0         16353.0 2011-12-06 12:00:00
17744.0         17744.0 2011-12-06 12:00:00
14735.0         14735.0 2011-12-06 12:00:00
14113.0         14113.0 2011-12-06 12:00:00
18283.0         18283.0 2011-12-06 12:00:00
16979.0         16979.0 2011-12-06 12:00:00
14110.0         14110.0 2011-12-06 12:00:00
14071.0         14071.0 2011-12-06 12:00:00
17250.0         17250.0 2011-12-06 12:00:00
16620.0         16620.0 2011-12-06 12:00:00
15907.0         15907.0 2011-12-06 12:00:00
17481.0         17481.0 2011-12-06 12:00:00
13755.0         13755.0 2011-12-06 12:00:00
14121.0         14121.0 2011-12-06 12:00:00

First chunk

            instance_id                time
CustomerID                                 
17346.0         1734

The size of each chunk is determined by the `chunk_size` parameter  in `dfs` or `calculate_feature_matirx`.  Valid inputs are:

* A positive integer (each chunk has this many rows)

In [64]:
from featuretools.computational_backends import calc_num_per_chunk
calc_num_per_chunk(200, cutoff_time.shape)

200

* A float between 0 and 1 (each chunk is a percentage of the entire cutoff dataframe)

In [65]:
# Each chunk is 25% of all rows in cutoff_time
calc_num_per_chunk(0.25, cutoff_time.shape)

1093

* None (the default option. Each chunk will be 10% of the entire cutoff dataframe, or 10 rows per chunk, whichver is bigger)

In [67]:
calc_num_per_chunk(None, cutoff_time.shape)

437

* "cutoff time" 

Unlike the other options, "cutoff time" does not genereate a specific number of rows per chunk.  Instead of trying to create uniformly sized chunks, featuretools will calculate every row with the same time together. 

In [69]:
calc_num_per_chunk("cutoff time", cutoff_time.shape)

'cutoff time'

Choosing the right chunk size can speed up the time spent calculating a feature matrix.  Small chunks can slow down computation in a few ways. Having to split up rows that share the same cutoff time means less shared computation.  If the time necessary to compute a chunk is too short, the cost in overhead for creating that chunk can impact the overall running time of the calculation.  Overly large chunks can slow down computation if the size of the data necessary for the calculations is too large.  What constitutes a good chunk size varies by dataset, cutoff time, and machine hardware.

### Small number of cutoff times

Here we look at the runtimes of several different chunk sizes for a feature matrix calculation with a small number of cutoff times (500) and few cutoff times sharing the same time (1.03 rows per timestamp, on average).

#### 10% per chunk

In [75]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500])

Building features: 134it [00:00, 8343.16it/s]
Progress: 100%|██████████| 500/500 [01:19<00:00,  6.23cutoff time/s]


#### 1 chunk

In [78]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size=500)

Building features: 134it [00:00, 6051.02it/s]
Progress: 100%|██████████| 500/500 [01:21<00:00,  6.11cutoff time/s]


#### 2 rows per chunk

In [80]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size=2)

Building features: 134it [00:00, 4867.93it/s]
Progress: 100%|██████████| 500/500 [01:45<00:00,  4.61cutoff time/s]


#### "cutoff time" option

In [28]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size="cutoff time")

Building features: 134it [00:00, 5268.34it/s]
Progress: 100%|██████████| 500/500 [01:44<00:00,  3.66cutoff time/s]


Since the number of cutoff times is so small and there are few cutoff times with the same timestamp, calculating all cutoff times in the same chunk is comparable to the default 10% approach.  With a very small chunk size of 2 rows per chunk, the calculation has slowed down considerably.  Grouping by time is similarly slow, which seems reasonable given the avergae number of rows per cutoff time.

### All cutoff times

Using all of the cutoff times does not change how few rows have the same cutoff time.  The default chunk size of 10% and calculating all of the cutoff times in a single chunk still takes about the same amount of time. Using a chunk size of 2 is noticably slower, and grouping by cutoff time even more so.

#### 10% per chunk

In [82]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time)

Building features: 134it [00:00, 7222.17it/s]
Progress: 100%|██████████| 4373/4373 [10:22<00:00,  6.60cutoff time/s]


#### 1 chunk

In [84]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size=4373)

Building features: 134it [00:00, 6128.48it/s]
Progress: 100%|██████████| 4373/4373 [10:25<00:00,  6.99cutoff time/s]


#### 2 rows per chunk

In [86]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size=2)

Building features: 134it [00:00, 5178.11it/s]
Progress: 100%|██████████| 4373/4373 [14:03<00:00,  4.00cutoff time/s]


#### "cutoff time" option

In [29]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size="cutoff time")

Building features: 134it [00:00, 5697.74it/s]
Progress: 100%|██████████| 4373/4373 [19:18<00:00,  2.61cutoff time/s]


#### Fewer Unique Cutoff Times

Next we group the timestamps in the cutoff time dataframe by week to reduce the number of unique cutoff times to 55 from over 4000.  This speeds up the overall computation because there is more shareable computation when calculating.  This also causes a more noticeable drop in performance when choosing chunk sizes much smaller than the number of unique cutoff times per instance.

In [14]:
binned_cutoffs = bin_cutoff_times(cutoff_time.copy(), '7 days')
grouped = binned_cutoffs.groupby('time').count()
print("Number of unique cutoff times: " + str(len(grouped)))
max_cutoff_size = grouped.max()[0]
print("Largest cutoff group: %d instances" % max_cutoff_size)
max_cutoff_percent = float(max_cutoff_size)/binned_cutoffs.shape[0]
print("Largest cutoff percent of whole: %f" % (max_cutoff_percent))

Number of unique cutoff times: 55
Largest cutoff group: 532 instances
Largest cutoff percent of whole: 0.121656


In [100]:
binned_cutoffs.head()

Unnamed: 0_level_0,instance_id,time
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
17850.0,17850.0,2011-02-10
13047.0,13047.0,2011-11-03
12583.0,12583.0,2011-12-01
13748.0,13748.0,2011-09-01
15100.0,15100.0,2011-01-13


#### 10% per chunk

In [6]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  verbose=True)

Building features: 134it [00:00, 4504.37it/s]
Progress: 100%|██████████| 4373/4373 [00:54<00:00, 95.12cutoff time/s]


#### 1 chunk

In [7]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=4373,
                                  verbose=True)

Building features: 134it [00:00, 7367.11it/s]
Progress: 100%|██████████| 4373/4373 [00:53<00:00, 82.29cutoff time/s]


#### Mean cutoff time group size

Trying a new chunk size, the average cutoff time group size, or about 1.829% of the entire list of cutoff times

In [20]:
binned_cutoffs.groupby('time').count().mean()

instance_id    79.490909
dtype: float64

In [34]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=80,
                                  verbose=True)

Building features: 134it [00:00, 5199.23it/s]
Progress: 100%|██████████| 4373/4373 [00:57<00:00, 93.67cutoff time/s] 


#### 10 cutoffs per chunk 

In [16]:
# Defult chunk size of 10%
feature_matrix, features = ft.dfs(entityset=entityset,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=10,
                                  verbose=True)

Building features: 134it [00:00, 7815.29it/s]
Progress: 100%|██████████| 4373/4373 [03:52<00:00, 28.70cutoff time/s]


#### "cutoff time" option

In [30]:
feature_matrix, features = ft.dfs(entityset=entityset,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size="cutoff time",
                                  verbose=True)

Building features: 134it [00:00, 6469.41it/s]
Progress: 100%|██████████| 4373/4373 [00:55<00:00, 47.60cutoff time/s] 


### Instacart Dataset

Next we'll try some examples using the Instacart dataset used in the Predict Next Purchase demo.

In [22]:
import utils

# import entityset
es = utils.load_entityset("partitioned_data/part_1/")

# make labels and cutoff times
label_times = pd.concat([utils.make_labels(es=es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('March 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('June 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('September 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('December 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),],
                       ignore_index=True)

print(label_times.groupby('time').count())

            user_id  label
time                      
2015-03-15     7790   7790
2015-06-15     4458   4458
2015-09-15     2515   2515
2015-12-15     1091   1091


One cutoff time group accounts for nearly half of all rows of the cutoff time dataframe. As we can see by comparing the single chunk and group by cutoff time cases with the default 10% chunk size, in this scenario keeping the entire cutoff time group together slows down the computation. 

#### 10% per chunk

In [23]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=es,
                                  verbose=True)

Building features: 120it [00:00, 3605.88it/s]
Progress: 100%|██████████| 15854/15854 [03:21<00:00, 71.49cutoff time/s]


#### 1 chunk

In [24]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=es,
                                  chunk_size=15854,
                                  verbose=True)

Building features: 120it [00:00, 3105.82it/s]
Progress: 100%|██████████| 15854/15854 [04:26<00:00, 59.55cutoff time/s]


#### 250 per chunk

In [25]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=es,
                                  chunk_size=250,
                                  verbose=True)

Building features: 120it [00:00, 2267.82it/s]
Progress: 100%|██████████| 15854/15854 [03:14<00:00, 64.78cutoff time/s]


#### 100 per chunk

In [37]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=es,
                                  chunk_size=100,
                                  verbose=True)

Building features: 120it [00:00, 3566.43it/s]
Progress: 100%|██████████| 15854/15854 [03:41<00:00, 56.09cutoff time/s]


#### "cutoff time" option

In [26]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=es,
                                  chunk_size="cutoff time",
                                  verbose=True)

Building features: 120it [00:00, 2912.12it/s]
Progress: 100%|██████████| 15854/15854 [04:22<00:00, 60.86cutoff time/s]


#### Summary 

In review, featuretools uses a parameter `chunk_size` to divide up the instances when calculating features.  Creating chunks that have too little or too much calculation can slow down the calculations, so experimenting with different chunk sizes on a subset of the data before calculating on the entire can help find in finding a reasonable chunk size.   