# Manual Feature Engineering on the Retail Dataset

In this notebook we will work on manual feature engineering of a retail dataset. This dataset is originally from the UCI machine learning repository and is reminiscient of real-world data.

In [1]:
import numpy as np
import pandas as pd

import featuretools as ft

In [6]:
data = pd.read_excel('../input/Online Retail.xlsx', names = ['order_id', 'product_id', 'desc', 'quantity', 
                                                             'date', 'unit_price', 'customer_id', 'country'])
data.head()

Unnamed: 0,order_id,product_id,desc,quantity,date,unit_price,customer_id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [7]:
data.shape

(541909, 8)

In [8]:
data['month'] = data['date'].dt.month

# Prediction Problem

The first step we need to decide is what exactly we want to predict from this data. One question of interest to businesses is: will my customers be repeat buyers? We can answer this by determining whether or not a customer will have more than 2 orders in the next month. To do this, we will have to use a prediction point for each customer. To do this, for each customer, we'll identify the last observation in the data, and then go forward a month from this point. We'll then count the number of purchases in that last month of data to find the label.

In [43]:
last_purchase = pd.DataFrame(data.groupby(['customer_id'])['date'].max())
last_purchase['prediction_point'] = [date - pd.Timedelta(30, unit = 'M') for date in last_purchase['date']]
last_purchase.head()

Unnamed: 0_level_0,date,prediction_point
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
12346.0,2011-01-18 10:17:00,2008-07-19 07:44:00
12347.0,2011-12-07 15:52:00,2009-06-07 13:19:00
12348.0,2011-09-25 13:13:00,2009-03-26 10:40:00
12349.0,2011-11-21 09:51:00,2009-05-22 07:18:00
12350.0,2011-02-02 16:01:00,2008-08-03 13:28:00


In [None]:
orders = []

# Iterate through each customer
for i, customer_info in last_purchase.iterrows():
    customer = i
    prediction_point = customer_info['prediction_point']
    subset = data.loc[(data['date'] > prediction_point) & (data['customer_id'] == customer), :].copy()
    
    # Add the number of unique orders in the month
    orders.append(subset['order_id'].nunique())

In [None]:
last_purchase['orders'] = orders
last_purchase['label'] = last_purchase['orders'] > 2
last_purchase.describe()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
last_purchase['label'].value_counts().plot.bar();

# Normalizing Data

The first step we need to take is get the data into discrete normalized tables. We'll create 4 tables of information:

* customers: each customer (`customter_id`) will have one row
* products: each product (`product_id`) will have one row
* orders: each invoice (`order_id`) will have one row
* purchases: each purchased item will have one row

The purchases dataframe is the child of all the other dataframes. The `customers` dataframe is where we will make our features for the prediction problem. 

In [4]:
ft_data = ft.demo.load_retail()
ft_data

Entityset: demo_retail_data
  Entities:
    order_products [Rows: 541909, Columns: 5]
    products [Rows: 4070, Columns: 2]
    orders [Rows: 25900, Columns: 3]
    customers [Rows: 4373, Columns: 3]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_id -> customers.customer_id