# Duplicated Rows in Deep Feature Synthesis

When using a `cutoff_time` dataframe with more than one row per unique observation (customer in this case), dfs creates duplicated rows when `chunk_size` is less than the size of the labels. The duplicated rows are created for the same cutoff time for the same customer. If `chunk_size = len(labels)` this issue does not occur. `drop_duplicates` _does not work_ as a solution because for some customers, the information is the exact same for multiple months. Therefore, you have to reset the index (assuming you included the `cutoff_time_in_index` in `dfs`) so the time is unique for each individual and then `drop_duplicates()`. This is shown below.  

In [14]:
# Data manipulation
import pandas as pd 
import numpy as np

# Automated feature engineering
import featuretools as ft

# Read in data
csv_s3 = "https://s3.amazonaws.com/duplicated-rows/clean_retail_data.csv"
data = pd.read_csv(csv_s3, parse_dates=["order_date"])

labels = pd.read_csv('https://s3.amazonaws.com/duplicated-rows/labels.csv', 
                     parse_dates = ['cutoff_time'], index_col = 0)

# Entity set
es = ft.EntitySet(id="Online Retail Logs")

# Add the entire data table as an entity
es.entity_from_dataframe("purchases",
                         dataframe=data,
                         index="purchases_index",
                         time_index = 'order_date',
                         variable_types={'description': ft.variable_types.Text})

# create a new "products" entity
es.normalize_entity(new_entity_id="products",
                    base_entity_id="purchases",
                    index="product_id",
                    additional_variables=["description"])

# create a new "customers" entity based on the orders entity
es.normalize_entity(new_entity_id="customers",
                    base_entity_id="purchases",
                    index="customer_id")

# create a new "orders" entity
es.normalize_entity(new_entity_id="orders",
                    base_entity_id="purchases",
                    index="order_id",
                    additional_variables=["country"])

# Creates no duplicate rows (chunk_size = len(labels))
feature_matrix, _ = ft.dfs(entityset=es, target_entity='customers',
                           cutoff_time = labels[labels['customer_id'].isin([12347, 12352])],
                           verbose = 2,
                           cutoff_time_in_index = True,
                           chunk_size = len(labels), n_jobs = -1,
                           max_depth = 1)

# Creates 42 duplicate rows
feature_matrix_duplicated, _  = ft.dfs(entityset=es, target_entity='customers',
                                       cutoff_time = labels[labels['customer_id'].isin([12347, 12352])],
                                       verbose = 2,
                                       cutoff_time_in_index = True,
                                       chunk_size = 2, n_jobs = -1,
                                       max_depth = 1)

Built 33 features




EntitySet scattered to workers in 1.948 seconds
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 1/1 chunks
Built 33 features
EntitySet scattered to workers in 6.875 seconds
Elapsed: 00:01 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks


In [15]:
feature_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,SUM(purchases.Unnamed: 0),SUM(purchases.quantity),SUM(purchases.price),SUM(purchases.total),STD(purchases.Unnamed: 0),STD(purchases.quantity),STD(purchases.price),STD(purchases.total),MAX(purchases.Unnamed: 0),MAX(purchases.quantity),...,NUM_UNIQUE(purchases.order_id),NUM_UNIQUE(purchases.product_id),MODE(purchases.order_id),MODE(purchases.product_id),DAY(first_purchases_time),YEAR(first_purchases_time),MONTH(first_purchases_time),WEEKDAY(first_purchases_time),total,label
customer_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
12347.0,2011-03-01,2095540,315,120.7305,784.3935,8.3666,7.379955,3.945969,11.4459,72274,24,...,1,29,542237,20719,26,2011,1,2,0.0,0
12352.0,2011-03-01,1377360,98,112.7775,489.225,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.722,1
12347.0,2011-04-01,2095540,315,120.7305,784.3935,8.3666,7.379955,3.945969,11.4459,72274,24,...,1,29,542237,20719,26,2011,1,2,1049.8125,1
12352.0,2011-04-01,5350589,188,3135.1155,991.947,15767.778014,6.960703,157.369454,171.627342,129788,12,...,8,26,544156,M,16,2011,2,2,0.0,0
12347.0,2011-05-01,5654656,798,223.509,1834.206,37848.692147,32.080921,4.049262,53.512381,148308,240,...,2,46,542237,21041,26,2011,1,2,0.0,0
12352.0,2011-05-01,5350589,188,3135.1155,991.947,15767.778014,6.960703,157.369454,171.627342,129788,12,...,8,26,544156,M,16,2011,2,2,0.0,0
12347.0,2011-06-01,5654656,798,223.509,1834.206,37848.692147,32.080921,4.049262,53.512381,148308,240,...,2,46,542237,21041,26,2011,1,2,631.158,1
12352.0,2011-06-01,5350589,188,3135.1155,991.947,15767.778014,6.960703,157.369454,171.627342,129788,12,...,8,26,544156,M,16,2011,2,2,0.0,0
12347.0,2011-07-01,9625105,994,311.982,2465.364,59363.500894,28.079866,3.926609,46.989513,220589,240,...,3,58,542237,22196,26,2011,1,2,0.0,0
12352.0,2011-07-01,5350589,188,3135.1155,991.947,15767.778014,6.960703,157.369454,171.627342,129788,12,...,8,26,544156,M,16,2011,2,2,0.0,0


In [16]:
feature_matrix_duplicated

Unnamed: 0_level_0,Unnamed: 1_level_0,SUM(purchases.Unnamed: 0),SUM(purchases.quantity),SUM(purchases.price),SUM(purchases.total),STD(purchases.Unnamed: 0),STD(purchases.quantity),STD(purchases.price),STD(purchases.total),MAX(purchases.Unnamed: 0),MAX(purchases.quantity),...,NUM_UNIQUE(purchases.order_id),NUM_UNIQUE(purchases.product_id),MODE(purchases.order_id),MODE(purchases.product_id),DAY(first_purchases_time),YEAR(first_purchases_time),MONTH(first_purchases_time),WEEKDAY(first_purchases_time),total,label
customer_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
12347.0,2011-03-01,2095540,315,120.7305,784.3935,8.366600,7.379955,3.945969,11.445900,72274,24,...,1,29,542237,20719,26,2011,1,2,0.0000,0
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12352.0,2011-03-01,1377360,98,112.7775,489.2250,4.320494,3.981066,5.355588,10.528141,91831,12,...,1,15,544156,21232,16,2011,2,2,502.7220,1
12347.0,2011-04-01,2095540,315,120.7305,784.3935,8.366600,7.379955,3.945969,11.445900,72274,24,...,1,29,542237,20719,26,2011,1,2,1049.8125,1
12347.0,2011-04-01,2095540,315,120.7305,784.3935,8.366600,7.379955,3.945969,11.445900,72274,24,...,1,29,542237,20719,26,2011,1,2,1049.8125,1


## Dropping duplicates does not work unless index is reset

In [20]:
feature_matrix_duplicated.shape

(62, 35)

Dropping the duplicated rows is not valid because some months contain exactly the same information for the customer (they had no purchases from month to month that affected the features). Therefore, this call will drop _too many_ rows.

In [21]:
feature_matrix_duplicated.drop_duplicates().shape

(16, 35)

Instead, we can use `drop_duplicates` _after_ reseting the index. 

In [22]:
feature_matrix_duplicated.reset_index().drop_duplicates().shape

(20, 37)