# Retail Data Example

Below is a demo applying automated feature engineering to a retail dataset

In [1]:
import featuretools as ft
import pandas as pd

## Prepare data

We load this data into from a CSV file hosted on Amazon S3. The origial dataset is available for download [here](http://archive.ics.uci.edu/ml/datasets/online+retail)

We then break the file up into several entities

* **item_purchases**: items in each invoice
* **items**: items and associated descriptions
* **invoices**: invoices placed 
* **customers**: customers who placed invoices

In [2]:
es = ft.EntitySet("retail")
data = pd.read_csv("s3://featuretools-static/uk_online_retail.csv")
es.entity_from_dataframe("item_purchases",
                   dataframe=data,
                   index="item_purchase_id",
                   make_index=True,
                   time_index="InvoiceDate")

es.normalize_entity(new_entity_id="items",
                    base_entity_id="item_purchases",
                    index="StockCode",
                    additional_variables=["Description"])

es.normalize_entity(new_entity_id="invoices",
                    base_entity_id="item_purchases",
                    index="InvoiceNo",
                    additional_variables=["CustomerID","Country"])

es.normalize_entity(new_entity_id="customers",
                    base_entity_id="invoices",
                    index="CustomerID",
                    additional_variables=["Country"])

Entityset: retail
  Entities:
    item_purchases (shape = [541909, 6])
    items (shape = [4070, 3])
    invoices (shape = [25900, 3])
    customers (shape = [4373, 3])
  Relationships:
    item_purchases.StockCode -> items.StockCode
    item_purchases.InvoiceNo -> invoices.InvoiceNo
    invoices.CustomerID -> customers.CustomerID

## Run Deep Feature Synthesis

The input to DFS is a set of entities and a list of relationships (defined by our EntitySet) and the "target_entity" to calculate features for. We can supply "cutoff times" to specify that we want to calculate features one year after a customer's first invoice.

The ouput of DFS is a feature matrix and the corresponding list of feature defintions

In [3]:
cutoff_times = es["customers"].df[["CustomerID", "first_invoices_time"]].rename(columns={"CustomerID": "instance_id", "first_invoices_time": "time"})
cutoff_times["time"] = cutoff_times["time"] + pd.Timedelta("365 days")
cutoff_times.head(3)

Unnamed: 0_level_0,instance_id,time
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
17850.0,17850.0,2011-12-01 08:26:00
13047.0,13047.0,2011-12-01 08:34:00
12583.0,12583.0,2011-12-01 08:45:00


In [4]:
feature_matrix, features = ft.dfs(entityset=es, target_entity="customers",
                                  cutoff_time=cutoff_times.sample(100),
                                  agg_primitives=["avg_time_between", "mean", "sum", "count"],
                                  trans_primitives=["day"], max_depth=5, verbose=True)

Built 49 features
Elapsed: 00:32 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks


In [5]:
feature_matrix.sample(3)

Unnamed: 0_level_0,Country,AVG_TIME_BETWEEN(invoices.first_item_purchases_time),COUNT(invoices),AVG_TIME_BETWEEN(item_purchases.InvoiceDate),MEAN(item_purchases.Quantity),MEAN(item_purchases.UnitPrice),SUM(item_purchases.Quantity),SUM(item_purchases.UnitPrice),COUNT(item_purchases),DAY(first_invoices_time),...,MEAN(invoices.SUM(item_purchases.items.MEAN(item_purchases.UnitPrice))),MEAN(invoices.SUM(item_purchases.items.SUM(item_purchases.Quantity))),MEAN(invoices.SUM(item_purchases.items.SUM(item_purchases.UnitPrice))),MEAN(invoices.SUM(item_purchases.items.COUNT(item_purchases))),SUM(invoices.MEAN(item_purchases.items.AVG_TIME_BETWEEN(item_purchases.InvoiceDate))),SUM(invoices.MEAN(item_purchases.items.MEAN(item_purchases.Quantity))),SUM(invoices.MEAN(item_purchases.items.MEAN(item_purchases.UnitPrice))),SUM(invoices.MEAN(item_purchases.items.SUM(item_purchases.Quantity))),SUM(invoices.MEAN(item_purchases.items.SUM(item_purchases.UnitPrice))),SUM(invoices.MEAN(item_purchases.items.COUNT(item_purchases)))
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14501.0,United Kingdom,11033160.0,3,2006029.0,9.916667,2.054167,119,24.65,12,24,...,9.885158,39723.0,7623.166667,2878.0,239338.687868,34.948782,7.246728,25785.916667,5221.5525,1958.0
17228.0,United Kingdom,4403520.0,8,148911.3,7.365385,1.726635,1532,359.14,208,2,...,52.217376,167893.25,25703.38125,12585.875,994419.433878,101.769726,16.010187,54517.444056,8106.948946,3990.390675
14336.0,United Kingdom,7481840.0,4,252196.9,19.544444,1.615111,1759,145.36,90,8,...,45.692224,126747.75,23279.6025,9373.75,775827.050852,51.578943,8.393304,22146.207443,4006.740658,1598.204632


In [6]:
features

[<Feature: Country>,
 <Feature: AVG_TIME_BETWEEN(invoices.first_item_purchases_time)>,
 <Feature: COUNT(invoices)>,
 <Feature: AVG_TIME_BETWEEN(item_purchases.InvoiceDate)>,
 <Feature: MEAN(item_purchases.Quantity)>,
 <Feature: MEAN(item_purchases.UnitPrice)>,
 <Feature: SUM(item_purchases.Quantity)>,
 <Feature: SUM(item_purchases.UnitPrice)>,
 <Feature: COUNT(item_purchases)>,
 <Feature: DAY(first_invoices_time)>,
 <Feature: MEAN(invoices.AVG_TIME_BETWEEN(item_purchases.InvoiceDate))>,
 <Feature: MEAN(invoices.MEAN(item_purchases.Quantity))>,
 <Feature: MEAN(invoices.MEAN(item_purchases.UnitPrice))>,
 <Feature: MEAN(invoices.SUM(item_purchases.Quantity))>,
 <Feature: MEAN(invoices.SUM(item_purchases.UnitPrice))>,
 <Feature: MEAN(invoices.COUNT(item_purchases))>,
 <Feature: SUM(invoices.AVG_TIME_BETWEEN(item_purchases.InvoiceDate))>,
 <Feature: SUM(invoices.MEAN(item_purchases.Quantity))>,
 <Feature: SUM(invoices.MEAN(item_purchases.UnitPrice))>,
 <Feature: MEAN(item_purchases.items.AV