# Welcome to the retail notebook!

In this demonstration, we will show you how the Retail EntitySet and Projects were created. 

In this notebook you will learn:

1. How an EntitySet can be made from a single table on S3
2. What a *prediction problem* is and
3. How to load EntitySets and Projects into Tempo

# Step 1: Make an EntitySet
We start by loading in a dataframe of retail logs. The CSV has information on Customers, Transactions, Orders and Products. The data is first downloaded from a public S3 bucket.

In [9]:
import featuretools as ft
import pandas as pd

import utils

csv_s3 = "s3://featurelabs-static/online-retail-logs.csv"
data = pd.read_csv(csv_s3, parse_dates=["order_date"])
# Convert to dollars
data['price'] = data['price'] * 1.65
data['total'] = data['price'] * data['quantity']
# utils.overview(data)
# utils.warnings(data)

In [10]:
# drop the duplicates
data = data.drop_duplicates()

# drop rows with null customer id
data = data.dropna(axis=0)

data['cancelled'] = data['order_id'].str.startswith('C')
# utils.show(utils.timeseries(data['order_date'], data['total'], dynamic=False, aggregate='sum', n_bins=30),
#           title='Total sales over time', width=900, height=300)

EntitySets organize the data you work with to define prediction problems, perform feature engineering, and train machine learning models. They contain multiple tables, known as entities, and the relationships between them.

In this case, we have a single table of data, but we can create new entities using `normalize_entity`. 

In [6]:
es = ft.EntitySet(id="Online Retail Logs")
es.entity_from_dataframe("order_products",
                         dataframe=data,
                         index="order_product_id",
                         time_index = 'order_date',
                         variable_types={'description': ft.variable_types.Text})

# create a new "products" entity
es.normalize_entity(new_entity_id="products",
                    base_entity_id="order_products",
                    index="product_id",
                    additional_variables=["description"])

# create a new "orders" entity
es.normalize_entity(new_entity_id="orders",
                    base_entity_id="order_products",
                    index="order_id",
                    additional_variables=[
                        "customer_id", "country", 'cancelled'])

# create a new "customers" entity based on the orders entity
es.normalize_entity(new_entity_id="customers",
                    base_entity_id="orders",
                    index="customer_id")

es.add_last_time_indexes()
es



Entityset: Online Retail Logs
  Entities:
    order_products [Rows: 401604, Columns: 7]
    products [Rows: 3684, Columns: 3]
    orders [Rows: 22190, Columns: 5]
    customers [Rows: 4372, Columns: 2]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_id -> customers.customer_id

In [11]:
es

Entityset: Online Retail Logs
  Entities:
    order_products [Rows: 401604, Columns: 7]
    products [Rows: 3684, Columns: 3]
    orders [Rows: 22190, Columns: 5]
    customers [Rows: 4372, Columns: 2]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_id -> customers.customer_id

In [13]:
import sys
sys.getsizeof(es) / 1e9

0.083496776

In [7]:
ft.save_obj_pickle(es, '../input/es.pkl')

In [8]:
es.to_pickle('../input/es')

Entityset: Online Retail Logs
  Entities:
    order_products [Rows: 401604, Columns: 7]
    products [Rows: 3684, Columns: 3]
    orders [Rows: 22190, Columns: 5]
    customers [Rows: 4372, Columns: 2]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_id -> customers.customer_id

In [4]:
utils.show(utils.timeseries(es['customers'].df['first_orders_time'], es['customers'], dynamic=False, aggregate='count', n_bins=30),
           title='Number of new customers over time', width=900, height=300)

In [5]:
utils.show(utils.histogram(data.groupby('customer_id').sum()['price'], 
                                  col_max=1500, n_bins=100, dynamic=False),
          title='Total customer spending histogram', height=400)

In [6]:
utils.show(utils.histogram(data.groupby('product_id').count()['order_id'], 
                                  col_max=400, n_bins=100, dynamic=False),
           title='Histogram of product orders', height=400)

# Step 2: Making predictions

The next step is to decide what we want to predict. *We can use the same EntitySet to make different predictions*. For example, we might be interested in predicting how much a customer will spend in the future, or we might be interested in predicting how many of each product we'll sell. Here's code to create those prediction problems:

In [None]:
def make_retail_cutoffs_total(start_date, end_date):
    # Find customers who exist before start date
    customer_pool = data[data['order_date'] < start_date]['customer_id'].unique()

    # For customers in the customer pool, find their sum between the start and end date
    tmp = data[data['customer_id'].isin(customer_pool)][
        (data['order_date'] > start_date) & 
        (data['order_date']<end_date)
    ].groupby('customer_id').sum()
    
    # Take the sum of `total` between the start date and end date
    ct = tmp.reset_index()[['customer_id', 'total']]
    
    # The cutoff time is the start date
    ct['cutoff_time'] = start_date
    return ct

predict_may_sales = make_retail_cutoffs_total('2011-05-01', '2011-06-01')

predict_may_sales_binary = predict_may_sales.copy()
predict_may_sales_binary['total'] = predict_may_sales_binary['total'] > 500

predict_may_sales.to_csv('predict_may_sales.csv')
predict_may_sales_binary.to_csv('predict_may_sales_binary.csv')

In [None]:
predict_june_sales = make_retail_cutoffs_total('2011-06-01', '2011-07-01')
predict_july_sales = make_retail_cutoffs_total('2011-07-01', '2011-08-01')
predict_august_sales = make_retail_cutoffs_total('2011-08-01', '2011-09-01')
predict_september_sales = make_retail_cutoffs_total('2011-09-01', '2011-10-01')
predict_october_sales = make_retail_cutoffs_total('2011-10-01', '2011-11-01')
predict_november_sales = make_retail_cutoffs_total('2011-11-01', '2011-12-01')
predict_december_sales = make_retail_cutoffs_total('2011-12-01', '2012-01-01')

In [None]:
test_cutoffs = pd.concat([predict_june_sales, predict_july_sales, predict_august_sales, predict_september_sales, predict_october_sales, predict_november_sales, predict_december_sales])

In [8]:
utils.show(utils.histogram(predict_may_sales['total'], col_max=5000, col_min=0, n_bins=50, dynamic=False), 
           height=300, y_range=(0, 200))
utils.show(utils.piechart(predict_may_sales_binary['total'], dynamic=False), title='Predict more than $500 in May sales')

# Step 3: Connecting to Tempo

Here, we'll upload the data directly to the webapp. As a warning, this cell can destroy any manipulations you've already done in the app with the retail dataset. If you want to overwrite the existing EntitySet and Projects, uncomment the commented line.

In [9]:
import featurelabs as fl
client = fl.Client(api_key="ab8c5174-8f92-11e8-9899-16d71f649b32")

# client.unpublish_entityset(es)
client.publish_entityset(es)
client.publish_project(project_name='Predict May Sales',
                       label_times=predict_may_sales,
                       entityset_id=es.id,
                       entity_id='customers',
                       label_type='regression',
                       description='For every customer who has a transaction before'
                                   'May 1, predict how much they will spend in May.')

client.publish_project(project_name='Predict May Sales (Binary)',
                       label_times=predict_may_sales_binary,
                       entityset_id=es.id,
                       entity_id='customers',
                       label_type='multiclass',
                       description='For every customer who has a transaction before'
                                   'May 1, predict whether or not they will spend more than $300'
                                   'in May. This is a standard method to convert a regression'
                                   'to a binary classification problem')


ValueError: Duplicate EntitySet, try deleting                     EntitySet or changing EntitySet name