# EntitySets and Feature Engineering

In this tutorial, we will go over the basics of Featuretools. 

## Getting started 


### Import Featuretools Library

In [1]:
import featuretools as ft

## Working with EntitySets


In this example, we will use a dataset of retail data of customers from a UK website from December 2010 to December 2011.


In [2]:
es = ft.demo.load_retail(nrows=1000)
es.id

'demo_retail_data'

### List entities in the EntitySet
An entity is a single table of data.  Each row is an "instance" of our entity and ach column is a "variable". 

In [3]:
for e in es.entities:
    print e.name

invoices
items
customers
item_purchases


see variables

### Show first n rows of an entity
We control the number of rows with ``n``. By default, ``n=10``.

In [4]:
es["item_purchases"].head(n=5)

Unnamed: 0_level_0,item_purchase_id,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice
item_purchase_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,536365,85123A,6,2010-12-01 08:26:00,2.55
1,1,536365,71053,6,2010-12-01 08:26:00,3.39
2,2,536365,84406B,8,2010-12-01 08:26:00,2.75
3,3,536365,84029G,6,2010-12-01 08:26:00,3.39
4,4,536365,84029E,6,2010-12-01 08:26:00,3.39


### Show specific instance (row) of entity
If we know a row we want to see, we can use ``show_instance``.

In [5]:
es["item_purchases"].show_instance(100)

Unnamed: 0_level_0,item_purchase_id,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice
item_purchase_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100,100,536378,84519A,6,2010-12-01 09:37:00,2.95


## Accessing Variables

### List variables in entity

If we want to list all the variables of an entity, which is equivalent to all of the columns in the entities table, we do as shown below.

In [6]:
es["invoices"].variables

[<Variable: first_item_purchases_time (dtype: datetime, format: None)>,
 <Variable: CustomerID (dtype = categorical, count = 65)>,
 <Variable: InvoiceNo (dtype = categorical, count = 66)>]

### Access a variable in an entity
We can access certain variables as well. This will become useful later for feature engineering and analysis.

In [7]:
es["item_purchases"]["UnitPrice"]

<Variable: UnitPrice (dtype = numeric, count = 1000)>

### Variable Statistics
We can view statistics about a variable such as the price of the items.

In [8]:
print es["item_purchases"]["UnitPrice"].max
print es["item_purchases"]["UnitPrice"].min

165.0
0.0


For a given variable, we can see the first n values using head

In [9]:
es["item_purchases"]["UnitPrice"].head(5)

Unnamed: 0_level_0,UnitPrice
item_purchase_id,Unnamed: 1_level_1
0,2.55
1,3.39
2,2.75
3,3.39
4,3.39


### Different Variable Types
Variables can be many different types. The example above of item prices is a numeric variable, but there are also categorical, datetime, ordinal, and text variables.

In [10]:
print es["item_purchases"]['item_purchase_id']
print es["item_purchases"]['InvoiceDate']

<Variable: item_purchase_id (dtype = categorical, count = 1000)>
<Variable: InvoiceDate (dtype: datetime, format: None)>


### View Relationships
The entities are connected by relationships between variables. A relationship is a parent-child relationship between entities. For example, the third relationship shown below shows that the purchase entity is a child of the customers entity. This means that each customer can have multiple purchases.

In [11]:
es.relationships

[<Relationship: item_purchases.StockCode -> items.StockCode>,
 <Relationship: item_purchases.InvoiceNo -> invoices.InvoiceNo>,
 <Relationship: invoices.CustomerID -> customers.CustomerID>]

these relationships are analogues to foreign-key relationship in a SQL database.

## Building Features

### A feature is any new value we create by transforming our raw data
The simplest way to build a feature is to just use a variable

In [12]:
region = ft.Feature(es["customers"]["Country"])
region

<Feature: Country>

### Creating More Complicated Features
We can create more complicated features as well. New features can be defined using previously defined features.

We will add onto previous simpler features to create more complicated ones.

In [13]:
germany = region == 'Germany'
germany

<Feature: Country = Germany>

The feature above is a feature with a boolean value (True or False) depending on whether the customer is from germany.

In [14]:
italy = region == 'Italy'
germany_or_italy = germany | italy
germany_or_italy

<Feature: Country = Germany OR Country = Italy>

### Viewing feature values

We can view feature values like this:

In [15]:
germany_or_italy.head(5)

Unnamed: 0,Country = Germany OR Country = Italy
12431.0,0.0
16029.0,0.0
16098.0,0.0
16210.0,0.0
16218.0,0.0


Featuretools has many functions such as ``multiply``, ``divide``, ``add``, ``subtract`` that can be used to create more complicated features. For example we can multiply price and quantity of a purchase to calculate the total cost.

In [16]:
price = ft.Feature(es['item_purchases']['UnitPrice'])
quantity = ft.Feature(es['item_purchases']['Quantity'])
total_item_cost = price * quantity
total_item_cost.head(10)

Unnamed: 0_level_0,UnitPrice * Quantity
item_purchase_id,Unnamed: 1_level_1
0,15.3
662,6.3
661,6.3
656,8.5
659,1.1
655,17.7
666,3.4
660,1.65
667,2.55
654,25.5


### Stacking and Creating Aggregation Features
We can stack features by aggregating features of a child entity. We use the 'Sum' aggregation for the cost of each item purchase to get the cost of total purchase. Unlike a SQL groupby and join, Featuretools handles the grouping of the child entity automatically

In [17]:
from featuretools.primitives import Sum

total_invoice_cost = Sum(total_item_cost, es['invoices'])
total_invoice_cost

<Feature: SUM(item_purchases.UnitPrice * Quantity)>

We can aggregate once again on each of the purchases of the customer to get the
* largest purchase made by the customer in the time span using 'Max' aggreagation 
* total spent by the customer in the time span using 'Sum' aggreagation 

In [18]:
from featuretools.primitives import Max

cost_largest_purchase = Max(total_invoice_cost, es['customers'])
total_spent = Sum(total_invoice_cost, es['customers'])

In [19]:
total_spent.head(3)

Unnamed: 0,SUM(invoices.SUM(item_purchases.UnitPrice * Quantity))
12431.0,358.25
16029.0,3702.12
16098.0,430.6


Instead of aggregating multiple times over parent enitities, aggregation can also be done in one aggregation as shown below.

In [20]:
total_spent = Sum(total_item_cost, es['customers'])

Notice the table below is equivalent to the one above.

In [21]:
total_spent.head(3)

Unnamed: 0,SUM(item_purchases.UnitPrice * Quantity)
12431.0,358.25
16029.0,3702.12
16098.0,430.6


### Creating Datetime Features

We will explore the and create features using the datetime column in the "invoices" entity. The column "InvoiceDate" is the time of the purchase.

In [22]:
es["invoices"]["first_item_purchases_time"].head(5)

Unnamed: 0_level_0,first_item_purchases_time
InvoiceNo,Unnamed: 1_level_1
536365,2010-12-01 08:26:00
536366,2010-12-01 08:28:00
536367,2010-12-01 08:34:00
536368,2010-12-01 08:34:00
536369,2010-12-01 08:35:00


We will create a feature that computes the month of the purchase

In [23]:
from featuretools.primitives import Month

month_of_purchase = Month(es["invoices"]["first_item_purchases_time"])
month_of_purchase.head(n=5)

Unnamed: 0_level_0,MONTH(first_item_purchases_time)
InvoiceNo,Unnamed: 1_level_1
536365,12
536400,12
536401,12
536402,12
536403,12


let's assume we are interested in consumers who shopped on the weekends in December leading up to the xmas holidays. 

In [24]:
from featuretools.primitives import Weekend, Day

weekend_purchase = Weekend(es["invoices"]["first_item_purchases_time"])
print weekend_purchase
purchase_before_25th = Day(es["invoices"]["first_item_purchases_time"]).LT(25)
print purchase_before_25th
holiday_shopping = (month_of_purchase == 12) & weekend_purchase & purchase_before_25th
print holiday_shopping

<Feature: IS_WEEKEND(first_item_purchases_time)>
<Feature: DAY(first_item_purchases_time) < 25>
<Feature: MONTH(first_item_purchases_time) = 12 AND IS_WEEKEND(first_item_purchases_time) AND DAY(first_item_purchases_time) < 25>


### Renaming Features
Feature names can get hard to read, so we want to assign new names to features

In [25]:
holiday_shopping = holiday_shopping.rename("is_holiday")
holiday_shopping

<Feature: is_holiday>

In [26]:
holiday_shopping.head(n=5)

Unnamed: 0_level_0,is_holiday
InvoiceNo,Unnamed: 1_level_1
536365,False
536400,False
536401,False
536402,False
536403,False


### Timesince Feature 
Suppose we are interested in looking at amount of time since the customer's most recent purchase. To do this, we can calculate a customer's last purchase and then apply the TimeSince feature.

In [27]:
from featuretools.primitives import Last, TimeSince

last_invoice = Last(es["invoices"]["first_item_purchases_time"], es["customers"])
time_since = TimeSince(last_invoice).rename("time_since_last_invoice")
time_since.head()

Unnamed: 0,time_since_last_invoice
12431.0,2410.244215
16029.0,2410.247687
16098.0,2410.256715
16210.0,2410.144215
16218.0,2410.184493
16250.0,2410.246993
16552.0,2410.137965
16583.0,2410.160881
17181.0,2410.136576
17377.0,2410.138659


### Comparing a Feature

If we are interested whether the unit price of a purchased item is above $2.50 we would compute as shown below

In [28]:
price = ft.Feature(es["item_purchases"]["UnitPrice"])
price_gt_5 = price > 2.50
price_gt_5.head(n=5)

Unnamed: 0_level_0,UnitPrice > 2.5
item_purchase_id,Unnamed: 1_level_1
0,True
662,False
661,False
656,True
659,False


### Limiting Duration of Historical Data 

Suppose we are only interested in the average purchase amount of the purchases the customer made in the last month. We can add a use previous parameter of 30 days.

In [29]:
from featuretools.primitives import Mean 

total_invoice = Sum(total_item_cost, es['invoices']).rename("total")

average_purchase_last_month = Mean(total_invoice, es['customers'],
                                   use_previous="30 days")
average_purchase_last_month

<Feature: MEAN(invoices.total, Last 30 Days)>

### Conditional Summing | Adding the 'where' Parameter

Suppose we are interested in the total amount the customer spent for holiday shopping. We can use the 'where' parameter adding the Boolean Feature "holiday_shopping" that we created previousily.

In [30]:
total_holiday_purchase = Sum(total_invoice,
                             es['customers'], where=holiday_shopping)
total_holiday_purchase

<Feature: SUM(invoices.total WHERE is_holiday)>

### SlidingWindow Feature
Sometimes we would like to use more than a single aggregate value from a child entity.

SlidingWindow features let us define windows of time on child entities with which to aggregate data. Unlike standard AggregationFeatures such as Mean and Max that return a single value for each parent instance, SlidingWindow features return a fixed size array of values, referred to as windows.

In our example below the feature will return 3 values since window_size is 3 times smaller than use_previous.



In [31]:
from featuretools.primitives import SlidingMean
SlidingMean(total_item_cost, es['customers'], use_previous='30 days', window_size='10 days')

<Feature: SLIDING_MEAN(item_purchases.UnitPrice * Quantity, Last 30 Days, window_size = 10 Days)>