# Chunk Size Guide

## Overview

When Featuretools calculates a feature matrix it is passed or creates a dataframe called `cutoff_time` to determine what data can be used to calculate each row of features.  By default `calculate_feature_matrix` will divide the `cutoff_time` dataframe into smaller `chunks` to calculate on, trying to group rows with the same time in the same chunk.  If there are more rows with the same cutoff time than the chunk size allows, the rows will be be placed in multiple chunks.
The size of each chunk is determined by the `chunk_size` parameter  in `dfs` or `calculate_feature_matirx`.  Valid inputs are:

* None (the default option. Each chunk will be 10% of the entire cutoff dataframe, or 10 rows per chunk, whichver is bigger)
* A positive integer (each chunk has this many rows)
* A float between 0 and 1 (each chunk is a percentage of the entire cutoff dataframe)
* The string "cutoff time" 


Unlike the other options, "cutoff time" does not genereate a specific number of rows per chunk.  Instead of trying to create uniformly sized chunks, featuretools will calculate every row with the same time together. 

Choosing the right chunk size can speed up the time spent calculating a feature matrix.  Small chunks can slow down computation in a few ways. Having to split up rows that share the same cutoff time means less shared computation.  If the time necessary to compute a chunk is too short, the cost in overhead for creating that chunk can impact the overall running time of the calculation.  Overly large chunks can slow down computation if the size of the data necessary for the calculations is too large.  What constitutes a good chunk size varies by dataset, cutoff time, and machine hardware.

### Small number of cutoff times

Here we look at the runtimes of several different chunk sizes for a feature matrix calculation with a small number of cutoff times (500) and few cutoff times sharing the same time (1.03 rows per timestamp, on average).

In [1]:
import time
import datetime

import featuretools as ft
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from featuretools.computational_backends import bin_cutoff_times

retail_es = ft.demo.load_retail()
cutoff_time = pd.DataFrame({'time': retail_es["customers"].last_time_index,
                            'instance_id': retail_es["customers"].df["CustomerID"]})
output_notebook()

#### Default (10% per chunk)

In [31]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500])
stop = time.time()
default_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 6378.52it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 10
Elapsed: 01:20 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 10/10 chunks


There are more than 100 rows in `cutoff_time` so chunks size is 10%

#### 1 chunk

In [32]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size=500)
stop = time.time()
one_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 7598.99it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 1
Elapsed: 01:29 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 1/1 chunks


Similar time with just 1 chunk.  Small dataset and mostly unique timestamps means computation does not slow down with fewer chunks.

#### 2 rows per chunk

In [33]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size=2)
stop = time.time()
two_per_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 8635.16it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 250
Elapsed: 01:47 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 250/250 chunks


With 2 rows per chunk calculation has slowed somewhat due to overhead of calculating each chunk.

#### "cutoff time" option

In [34]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time.iloc[:500],
                                  chunk_size="cutoff time")
stop = time.time()
cutoff_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 5383.65it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 493
Elapsed: 02:16 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 493/493 chunks


With a mean of 1.03 rows per timetamp, using the "cutoff time" options is similarly slow.

In [35]:
bar_names = ['10%', '100%', '2', 'cutoff time']
bar_values = [default_duration, one_chunk_duration, two_per_chunk_duration, cutoff_duration]
p = figure(y_range=bar_names, x_axis_type='datetime', plot_width=600, plot_height=400)
p.hbar(y=bar_names, height=0.5, left=0,
       right=bar_values, color="navy")

show(p)

### All cutoff times

Using all of the cutoff times does not change how few rows have the same cutoff time.  The default chunk size of 10% and calculating all of the cutoff times in a single chunk still takes about the same amount of time. Using a chunk size of 2 is noticably slower, and grouping by cutoff time even more so.

#### 10% per chunk

In [7]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time)
stop = time.time()
default_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 6367.89it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 11
Elapsed: 10:45 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 11/11 chunks


#### 1 chunk

In [8]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size=4373)
stop = time.time()
one_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 8910.75it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 1
Elapsed: 11:08 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 1/1 chunks


#### 2 rows per chunk

In [9]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size=2)
stop = time.time()
two_per_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 6565.70it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 2187
Elapsed: 14:26 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 2187/2187 chunks


#### "cutoff time" option

In [10]:
start= time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  chunk_size="cutoff time")
stop = time.time()
cutoff_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 7888.35it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 4240
Elapsed: 21:22 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 4240/4240 chunks


In [11]:
bar_names = ['10%', '100%', '2', 'cutoff time']
bar_values = [default_duration, one_chunk_duration, two_per_chunk_duration, cutoff_duration]
p = figure(y_range=bar_names, x_axis_type='datetime', plot_width=600, plot_height=400)
p.hbar(y=bar_names, height=0.5, left=0,
       right=bar_values, color="navy")

show(p)

#### Fewer Unique Cutoff Times

Next we group the timestamps in the cutoff time dataframe by week to reduce the number of unique cutoff times to 55 from over 4000.  This speeds up the overall computation because there is more shareable computation when calculating.  This also causes a more noticeable drop in performance when choosing chunk sizes much smaller than the number of unique cutoff times.

In [12]:
binned_cutoffs = bin_cutoff_times(cutoff_time.copy(), '7 days')
grouped = binned_cutoffs.groupby('time').count()
print("Number of unique cutoff times: " + str(len(grouped)))
max_cutoff_size = grouped.max()[0]
print("Largest cutoff group: %d instances" % max_cutoff_size)
max_cutoff_percent = float(max_cutoff_size)/binned_cutoffs.shape[0]
print("Largest cutoff percent of whole: %f" % (max_cutoff_percent))

Number of unique cutoff times: 55
Largest cutoff group: 532 instances
Largest cutoff percent of whole: 0.121656


In [13]:
binned_cutoffs.head()

Unnamed: 0_level_0,instance_id,time
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
17850.0,17850.0,2011-02-10
13047.0,13047.0,2011-11-03
12583.0,12583.0,2011-12-01
13748.0,13748.0,2011-09-01
15100.0,15100.0,2011-01-13


#### 10% per chunk

In [14]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  target_entity="customers",
                                  verbose=True,
                                  cutoff_time=binned_cutoffs)
stop = time.time()
default_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 5547.69it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 11
Elapsed: 00:54 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 11/11 chunks


#### 1 chunk

In [15]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=4373,
                                  verbose=True)
stop = time.time()
one_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 4124.22it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 1
Elapsed: 00:52 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 1/1 chunks


#### Mean cutoff time group size

Trying a new chunk size, the average cutoff time group size, or about 1.829% of the entire list of cutoff times

In [16]:
binned_cutoffs.groupby('time').count().mean()

instance_id    79.490909
dtype: float64

In [17]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=80,
                                  verbose=True)
stop = time.time()
mean_group_size_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 6724.21it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 57
Elapsed: 01:00 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 57/57 chunks


#### 10 cutoffs per chunk 

In [18]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size=10,
                                  verbose=True)
stop = time.time()
ten_per_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 8444.70it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 440
Elapsed: 02:02 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 440/440 chunks


#### "cutoff time" option

In [19]:
start = time.time()
feature_matrix, features = ft.dfs(entityset=retail_es,
                                  cutoff_time=binned_cutoffs,
                                  target_entity="customers",
                                  chunk_size="cutoff time",
                                  verbose=True)
stop = time.time()
cutoff_duration = datetime.timedelta(seconds=stop-start)

Building features: 134it [00:00, 5416.75it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 55
Elapsed: 00:57 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 55/55 chunks


In [20]:
bar_names = ['10%', '100%', 'mean group size', '10', 'cutoff time']
bar_values = [default_duration, one_chunk_duration, mean_group_size_duration, ten_per_chunk_duration, cutoff_duration]
p = figure(y_range=bar_names, x_axis_type='datetime', plot_width=600, plot_height=500)
p.hbar(y=bar_names, height=0.5, left=0,
       right=bar_values, color="navy")

show(p)

### Instacart Dataset

Next we'll try some examples using the Instacart dataset used in the Predict Next Purchase demo.

In [21]:
import utils

# import entityset
instacart_es = utils.load_entityset("partitioned_data/part_1/")

# make labels and cutoff times
label_times = pd.concat([utils.make_labels(es=instacart_es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('March 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=instacart_es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('June 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=instacart_es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('September 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),
                         utils.make_labels(es=instacart_es,
                                           product_name = "Banana",
                                           cutoff_time = pd.Timestamp('December 15, 2015'),
                                           prediction_window = ft.Timedelta("4 weeks"),
                                           training_window = ft.Timedelta("60 days")),],
                       ignore_index=True)

print(label_times.groupby('time').count())

            user_id  label
time                      
2015-03-15     7790   7790
2015-06-15     4458   4458
2015-09-15     2515   2515
2015-12-15     1091   1091


One cutoff time group accounts for nearly half of all rows of the cutoff time dataframe. As we can see by comparing the single chunk and group by cutoff time cases with the default 10% chunk size, in this scenario keeping the entire cutoff time group together slows down the computation. 

#### 10% per chunk

In [22]:
start = time.time()
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=instacart_es,
                                  verbose=True)
stop = time.time()
default_duration = datetime.timedelta(seconds=stop-start)

Building features: 120it [00:00, 3604.80it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 11
Elapsed: 03:27 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 11/11 chunks


#### 1 chunk

In [23]:
start = time.time()
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=instacart_es,
                                  chunk_size=15854,
                                  verbose=True)
stop = time.time()
one_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 120it [00:00, 3745.78it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 1
Elapsed: 05:00 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 1/1 chunks


#### 250 per chunk

In [24]:
start = time.time()
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=instacart_es,
                                  chunk_size=250,
                                  verbose=True)
stop = time.time()
two_fifty_per_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 120it [00:00, 3858.78it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 64
Elapsed: 03:39 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 64/64 chunks


#### 100 per chunk

In [25]:
start = time.time()
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=instacart_es,
                                  chunk_size=100,
                                  verbose=True)
stop = time.time()
one_hundred_per_chunk_duration = datetime.timedelta(seconds=stop-start)

Building features: 120it [00:00, 3438.68it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 159
Elapsed: 03:59 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 159/159 chunks


#### "cutoff time" option

In [26]:
start = time.time()
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"),
                                  entityset=instacart_es,
                                  chunk_size="cutoff time",
                                  verbose=True)
stop = time.time()
cutoff_duration = datetime.timedelta(seconds=stop-start)

Building features: 120it [00:00, 3838.42it/s]
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████||, Chunks created: 4
Elapsed: 04:32 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 4/4 chunks


In [28]:
bar_names = ['10%', '100%', '250', '100', 'cutoff time']
bar_values = [default_duration, one_chunk_duration, two_fifty_per_chunk_duration, one_hundred_per_chunk_duration, cutoff_duration]
p = figure(y_range=bar_names, x_axis_type='datetime', plot_width=600, plot_height=500)
p.hbar(y=bar_names, height=0.5, left=0,
       right=bar_values, color="navy")

show(p)

#### Summary 

In review, featuretools uses a parameter `chunk_size` to divide up the instances when calculating features.  Creating chunks that have too little or too much calculation can slow down the calculations, so experimenting with different chunk sizes on a subset of the data before calculating on the entire can help find in finding a reasonable chunk size.   