# Ancestory Product Analytics - Predicting Cross Sells

Ancestry provides a way to track their geneology and trace back their family trees. Users often first experience the product through a DNA Kit. They then go on to purchase a subscription.

The goal of this project is to understand what factors and customer behaviors best predict a cross-sell to a subscription.

## Import data and libraries

In [96]:
# Import Libraries
import numpy             as np 
import pandas            as pd 
import matplotlib.pyplot as plt

In [79]:
# Load Data
data = pd.read_csv('../../take-home_exercise_data.csv')

## Clean data for EDA

Replace 'NA' and '-1' values with nan

In [80]:
# Replace 'NA' with nan
data.replace('NA', np.nan, inplace= True)

# Replace '-1' with nan in daystogetresult_grp
data.loc[data.daystogetresult_grp == '-1', 'daystogetresult_grp'] = np.nan

Convert strings to numeric and date-time

In [81]:
# xsell_day_exact: convert to numeric
data['xsell_day_exact'] = pd.to_numeric(data['xsell_day_exact'])

# dnatestactivationdayid: convert to datetime
data['dnatestactivationdayid'] = pd.to_datetime(data['dnatestactivationdayid']).dt.date

In [82]:
# Trim ' weeks' from string and create new 'weekstogetresult_grp' column
data['weekstogetresult_grp'] = data['daystogetresult_grp']
data['weekstogetresult_grp'] = data['weekstogetresult_grp'].str.replace('weeks', '')
data['weekstogetresult_grp'] = data['weekstogetresult_grp'].str.replace('week' , '')

Clean Strings

In [83]:
## regtenure: clean strings
data['regtenure'].replace('More than 120 days old', '>120',  inplace=True)
data['regtenure'].replace('Order prior to reg',     '0',     inplace=True)
data['regtenure'].replace('No Reg Date',             np.nan, inplace=True)
data['regtenure'] = data['regtenure'].str.replace(' days', '')
data['regtenure'] = data['regtenure'].str.replace(' day',  '')

In [84]:
# dna_visittrafficsubtype
# aggregate values into new 'traffic_source' column

data.loc[data['dna_visittrafficsubtype'].str.contains('unknown', case=False), 'traffic_source']             = 'unknown'
data.loc[data['dna_visittrafficsubtype'].str.contains('email', case=False), 'traffic_source']               = 'email'
data.loc[data['dna_visittrafficsubtype'].str.contains('social', case=False), 'traffic_source']              = 'social'
data.loc[data['dna_visittrafficsubtype'].str.contains('organic', case=False), 'traffic_source']             = 'organic search'
data.loc[data['dna_visittrafficsubtype'].str.contains('paid search', case=False), 'traffic_source']         = 'paid search'
data.loc[data['dna_visittrafficsubtype'].str.contains('affiliate', case=False), 'traffic_source']           = 'affiliate external'
data.loc[data['dna_visittrafficsubtype'].str.contains('content marketing', case=False), 'traffic_source']   = 'content marketing'
data.loc[data['dna_visittrafficsubtype'].str.contains('external referrals', case=False), 'traffic_source']  = 'external referrals'
data.loc[data['dna_visittrafficsubtype'].str.contains('internal referrals', case=False), 'traffic_source']  = 'internal referrals'
data.loc[data['dna_visittrafficsubtype'].str.contains('external paid media', case=False), 'traffic_source'] = 'external paid media'
data.loc[data['dna_visittrafficsubtype'].str.contains('app', case=False), 'traffic_source']                 = 'app'

data.loc[data['dna_visittrafficsubtype'].isin(['direct non-homepage', 'direct dna homepage', 
                                               'direct core homepage', 'Direct']), 'traffic_source'] = 'direct'

data.loc[data['dna_visittrafficsubtype'].isin(['Digital Video', 'Direct Mail', 'Feeders', 'FindAGrave', 
                                              'FTM Software Integration','geo-redirect', 'Inbound', 'Library/Assoc.', 
                                              'Mobile', 'Overlays', 'Partners', 'Radio Brand/PR', 'Search', 
                                              'Telemarketing Other (short term 8/31/05)', 'TV Brand/PR', 'Web Property',
                                              'Biz Dev', 'Kiosk', 'Display'])
                                              ,'traffic_source'] = 'other'

# Other Potential Segmentations
# paid vs non-paid
# Search: brand core, brand dna, non-brand
# Landing Page: dna, home, other

**Remove kit orders from existing ACOM subscribers**

In [85]:
data = data[data['customer_type_group'] != 'Acom Sub']

Remove columns with irregular dates

**Export CSV for EDA**

In [86]:
data.to_csv('cross_sell_data2.csv', encoding='utf-8', index=False)