ToDo:

- define conversion point 
- separate sales activity from CX activity
- map to leads & opportunities
- survey sales lifecycle; characterize the typical patterns of activities for each stage of the sales cycle

Stages:

1. Initical Contact
2. Demo
3. Signup
4. Onboarding

Questions:

- What is the most common sales abandonment stage?
- What behavior has the most impact on abandonment?
- What behavior influences conversion?
- What is the most common and ideal timeframes at each stage of the sales cycle?

Caveats

- WhoID and WhatID are potentially referencing different things
    - WhoID typically refers to leads or contacts while WhatID typically refers to accounts or opportunities
    - we need to track activity from leads to opportunities so a given tracked entity will transition WhoID and WhatID during the life cycle
    - _are these identifiers automatically updated when a lead is converted to an opportunity?_

In [1]:
import pandas as pd
import sys
import numpy as np

sys.path.insert(1, '../../../scripts/')
from s3_support import *

# Load & confirm

In [2]:
activity_history_file = "activity_history.csv"

In [3]:
df = get_dataframe_from_file("sfc-export", activity_history_file)
df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])
print("{} rows; {} columns".format(len(df), len(df.columns)))

459989 rows; 11 columns


In [4]:
unq_whatids = len(df['WhatId'].unique())
unq_whoids = len(df['WhoId'].unique())

print("unique what IDs: {}; unique who IDs: {}".format(unq_whatids, unq_whoids))
print("mean entries per who ID: {:.2f}".format(df.groupby('WhoId')['Id'].count().mean()))

unique what IDs: 8892; unique who IDs: 31907
mean entries per who ID: 14.42


In [5]:
df.columns

Index(['Id', 'WhoId', 'WhatId', 'Subject', 'OwnerId', 'Description', 'Type',
       'AccountId', 'CreatedDate', 'CreatedById', 'SystemModstamp'],
      dtype='object')

In [6]:
leads_to_opportunities = get_dataframe_from_file("sfc-export", 'leads_to_opportunities.csv')
len(leads_to_opportunities), len(leads_to_opportunities.columns)

(7014, 6)

In [7]:
# map whoid to lead, whatid to opportunity
opps_mrgd = df.merge(leads_to_opportunities, left_on='WhatId', right_on='opportunity_id')
leads_mrgd = df.merge(leads_to_opportunities, left_on="WhoId", right_on='lead_id')

mrgd = opps_mrgd.append(leads_mrgd)

len(mrgd), len(mrgd.columns)

(59142, 17)

In [8]:
# filter out post sales close date to remove CX activity
mrgd['CreatedDate'] = pd.to_datetime(mrgd['CreatedDate'])
mrgd['opportunity_closedate'] = pd.to_datetime(mrgd['opportunity_closedate'])

mrgd = mrgd[mrgd['CreatedDate']<=mrgd['opportunity_closedate']]

In [9]:
len(mrgd)

56043

In [10]:
df = mrgd

In [11]:
df.columns

Index(['Id', 'WhoId', 'WhatId', 'Subject', 'OwnerId', 'Description', 'Type',
       'AccountId', 'CreatedDate', 'CreatedById', 'SystemModstamp', 'lead_id',
       'opportunity_id', 'lead_date', 'opportunity_date',
       'opportunity_closedate', 'opportunity_stage'],
      dtype='object')

# Type distributions

In [12]:
# omitting 000000000000000AAA; think it might be an anonymous ID
what_type_counts = df.groupby(['WhatId', 'Type'])['Id'].count().reset_index()

In [13]:
what_types = what_type_counts.pivot(index='WhatId', columns='Type', values='Id').fillna(0)
what_types.head(3).transpose()

WhatId,000000000000000AAA,0063100000XzTINAA3,0063100000Y01KhAAJ
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
60_day_follow_up_call,0.0,0.0,0.0
call,0.0,0.0,0.0
demo completed,0.0,0.0,1.0
demo scheduled,0.0,0.0,0.0
email,6.0,8.0,4.0
initial contact,0.0,0.0,0.0
interested,0.0,0.0,0.0
lead qualification,0.0,0.0,0.0
lead submitted form,0.0,0.0,0.0
meeting,0.0,0.0,0.0


In [14]:
what_types.mean()

Type
60_day_follow_up_call    0.009004
call                     2.679530
demo completed           0.261852
demo scheduled           0.088938
email                    4.244579
initial contact          0.072032
interested               0.000184
lead qualification       0.019111
lead submitted form      0.015619
meeting                  0.026461
none                     0.000368
not interested           0.002756
one pager campaign       0.021316
post-demo follow up      1.268100
pre-demo follow up       1.586549
prepared materials       0.001838
dtype: float64

Email is clearly the winner with an average of > 10 per entry, but this does include CX data so we really need to filter that out to get a better picture of sales effectiveness and operations.

### Add lead ID, opportunity ID, and close status

In [15]:
# map whoid to lead, whatid to opportunity
what_mrgd = what_types.merge(leads_to_opportunities, left_on='WhatId', right_on='opportunity_id')

len(what_mrgd), len(what_mrgd.columns), len(what_types), len(what_types.columns)

(5441, 22, 5442, 16)

In [16]:
what_mrgd.columns

Index(['60_day_follow_up_call', 'call', 'demo completed', 'demo scheduled',
       'email', 'initial contact', 'interested', 'lead qualification',
       'lead submitted form', 'meeting', 'none', 'not interested',
       'one pager campaign', 'post-demo follow up', 'pre-demo follow up',
       'prepared materials', 'lead_id', 'opportunity_id', 'lead_date',
       'opportunity_date', 'opportunity_closedate', 'opportunity_stage'],
      dtype='object')

In [17]:
what_mrgd['opportunity_date'] = pd.to_datetime(what_mrgd['opportunity_date'])
what_mrgd['opportunity_closedate'] = pd.to_datetime(what_mrgd['opportunity_closedate'])

what_mrgd['created_to_close'] = what_mrgd['opportunity_closedate'] - what_mrgd['opportunity_date']

In [18]:
print("time delta created to close:")
print("-"*30)
for stage in what_mrgd['opportunity_stage'].unique().tolist():
    print("{}: {} days mean; {} days std".format(stage, what_mrgd[what_mrgd['opportunity_stage']==stage]['created_to_close'].mean().days, what_mrgd[what_mrgd['opportunity_stage']==stage]['created_to_close'].std().days))

time delta created to close:
------------------------------
Closed Lost: 130 days mean; 143 days std
Closed Won: 34 days mean; 57 days std
Discovery: 204 days mean; 366 days std
Demo Completed: 235 days mean; 154 days std
Technical Win: 97 days mean; 115 days std
Demo Scheduled: 98 days mean; 74 days std
Signup: 99 days mean; 76 days std


In [19]:
cols_of_interest = ['call', 'demo completed', 'demo scheduled', 'email', 
                    'initial contact', 'interested', 'lead qualification', 'lead submitted form',
                    'meeting', 'none', 'one pager campaign', 'post-demo follow up', 'pre-demo follow up',
                    'prepared materials', 'created_to_close']

what_mrgd.groupby('opportunity_stage')[cols_of_interest].mean().transpose()

opportunity_stage,Closed Lost,Closed Won,Demo Completed,Demo Scheduled,Discovery,Signup,Technical Win
call,2.647059,2.424467,4.80427,5.588235,1.375,5.714286,2.023256
demo completed,0.265608,0.259469,0.24911,0.029412,0.25,0.142857,0.44186
demo scheduled,0.101768,0.069221,0.096085,0.0,0.0,0.285714,0.325581
email,3.930711,4.680888,3.131673,6.352941,6.75,9.142857,5.465116
initial contact,0.067124,0.047453,0.323843,0.058824,0.125,0.142857,0.046512
interested,0.0,0.000435,0.0,0.0,0.0,0.0,0.0
lead qualification,0.022375,0.014367,0.017794,0.088235,0.125,0.0,0.0
lead submitted form,0.020209,0.012625,0.0,0.0,0.0,0.0,0.0
meeting,0.024901,0.024815,0.05694,0.0,0.0,0.0,0.046512
none,0.000722,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
what_mrgd.groupby('opportunity_stage')[cols_of_interest].std().transpose()

opportunity_stage,Closed Lost,Closed Won,Demo Completed,Demo Scheduled,Discovery,Signup,Technical Win
call,3.722447,3.788573,4.499299,4.472933,1.92261,9.357961,3.195893
demo completed,0.4687,0.470079,0.472691,0.171499,0.46291,0.377964,0.547824
demo scheduled,0.337383,0.273697,0.31851,0.0,0.0,0.48795,0.474137
email,5.334538,6.204594,5.671397,7.800087,9.422617,6.618876,7.762391
initial contact,0.542595,0.357734,0.991986,0.342997,0.353553,0.377964,0.213083
interested,0.0,0.020865,0.0,0.0,0.0,0.0,0.0
lead qualification,0.170593,0.142351,0.157106,0.287902,0.353553,0.0,0.0
lead submitted form,0.148237,0.111675,0.0,0.0,0.0,0.0,0.0
meeting,0.162652,0.166415,0.247047,0.0,0.0,0.0,0.213083
none,0.026861,0.0,0.0,0.0,0.0,0.0,0.0


## Close percentage by rep

In [21]:
closed_stages = [
    'Closed Lost',
    'Closed Won',
    'Technical Win'
]
rep_closes = pd.DataFrame(df[df['opportunity_stage'].isin(closed_stages)].groupby('OwnerId')['opportunity_stage'].value_counts(normalize=True))

In [22]:
rep_closes.columns = ['stage_count']

In [23]:
rep_closes = rep_closes.reset_index().pivot(index='OwnerId', columns='opportunity_stage', values='stage_count').reset_index()

rep_closes['Closed Won'] = rep_closes['Closed Won'] + rep_closes['Technical Win']
rep_closes['Rep'] = rep_closes['OwnerId']
rep_closes['Won'] = rep_closes['Closed Won']
rep_closes['Lost'] = rep_closes['Closed Lost']
rep_closes.drop(['Technical Win', 'OwnerId', 'Closed Won', 'Closed Lost'], axis=1, inplace=True)

In [24]:
rep_closes[['Lost', 'Won']].mean()

opportunity_stage
Lost    0.619096
Won     0.451592
dtype: float64

In [25]:
rep_closes[['Rep', 'Lost', 'Won']].dropna()

opportunity_stage,Rep,Lost,Won
0,00531000006kRT2AAM,0.594927,0.405073
1,00531000007YpsBAAS,0.573964,0.426036
3,00531000007gwftAAA,0.552598,0.447402
4,00531000008TfLhAAK,0.529537,0.470463
7,0055A000006JxseQAC,0.707965,0.292035
8,0055A000008YAu0QAG,0.553226,0.446774
11,0055A000008p6HDQAY,0.565951,0.434049
12,0055A000008pdP0QAI,0.508393,0.491607
14,0055A000009U9pGQAS,0.696078,0.303922
18,005i0000001hjDIAAY,0.169082,0.830918


# Classifying the Stage for a given event.

# Load opportunity data

In [26]:
url = get_file_url("sfc-export", "opportunity_date_ranges.csv")
date_ranges = pd.read_csv(url, encoding="ISO-8859-1", low_memory=False)

# Cast the strings to pd.Timestamps
date_ranges['Initial Contact'] = pd.to_datetime(date_ranges['Initial Contact'])
date_ranges['Demo Scheduled'] = pd.to_datetime(date_ranges['Demo Scheduled'])
date_ranges['Demo Completed'] = pd.to_datetime(date_ranges['Demo Completed'])
date_ranges['Signup'] = pd.to_datetime(date_ranges['Signup'])
date_ranges['Closed Won'] = pd.to_datetime(date_ranges['Closed Won'])
date_ranges['Closed Lost'] = pd.to_datetime(date_ranges['Closed Lost'])
date_ranges['Onboarding'] = pd.to_datetime(date_ranges['Onboarding'])

In [27]:
# Prints the row of date ranges and all events associated with the given opportunity_id
def print_opportunity_data(opp_id):
    print('---------------------------- Date Ranges ----------------------------')
    display(date_ranges.loc[date_ranges['OpportunityId'] == opp_id])
    print('---------------------------- Events ----------------------------')
    display(df.loc[df['opportunity_id'] == opp_id])
    
# print_opportunity_data('0063100000cD1ySAAS')

# NOTE: Some of the`CreatedDate`s are not accurate and/or duplicated so this function will not cover every case.

In [28]:
def get_stage(r):
    created_date = r['CreatedDate']
    stages = date_ranges[date_ranges['OpportunityId']==r['opportunity_id']]
    
    ordered_stages = ['Initial Contact', 'Demo Scheduled', 'Demo Completed', 'Signup',
                      'Closed Won', 'Closed Lost', 'Onboarding']
    
    current_stage = 'Unknown'
    
    # verify we have date ranges for stages
    if len(stages) > 0:
        stages = stages.iloc[0]
        
        # iterate through ordered columns
        for c in ordered_stages:
            # if stage entry date is less than the activity creation date, store stage name
            if pd.notnull(stages[c]) and stages[c] <= created_date:
                current_stage = c
            
    return current_stage


def get_final_state(r):
    this_opp = date_ranges[date_ranges['OpportunityId']==r['opportunity_id']]['Status']
    
    if len(this_opp) > 0:
        return this_opp.iloc[0]
    else:
        return 'Unknown'

In [30]:
# You may need to reset the index since the apply method complains when you have duplicated indexes.
# df = df.reset_index()

# Applies the `stage` function to the df using the date_ranges DataFrame to determine the most likey stage for a given entry. 
df['Stage'] = df.apply(get_stage, axis=1)
df['Final State'] = df.apply(get_final_state, axis=1)

In [31]:
print('There are {} rows missing a stage classification.'.format(
    len(df[df['Stage'] == 'None'])
))

print('There are {} duplicated dates for the same opportunity.'.format(
    len(df[df.duplicated(subset=['CreatedDate', 'opportunity_id'])])
))

print('{:.2f}% of events are unclassified.'.format(
     100 * (len(df[df['Stage'] == 'None']) / len(df))
))

There are 0 rows missing a stage classification.
There are 2573 duplicated dates for the same opportunity.
0.00% of events are unclassified.


In [32]:
df['Stage'].value_counts()

Unknown            26148
Demo Completed     17070
Initial Contact     7502
Signup              2588
Demo Scheduled      1727
Closed Lost          830
Onboarding           157
Closed Won            21
Name: Stage, dtype: int64

In [33]:
df['Final State'].value_counts()

Lost       28936
Won        23347
Unknown     3760
Name: Final State, dtype: int64

In [51]:
stage_type_counts = df.fillna(0).groupby(['Final State', 'Stage', 'Type'])['Id'].count().reset_index()
stage_type_counts.columns = ['Final State', 'Stage', 'Type', 'Count']

In [52]:
cols_of_interest = ['call', 'email', 'lead qualification', 'lead submitted form', 'meeting', 'not interested']

In [53]:
print("Losses")
losses_tbl = stage_type_counts[stage_type_counts['Final State']=='Lost'].drop('Final State', axis=1).pivot(index="Stage", columns="Type", values="Count")[cols_of_interest].transpose() / len(stage_type_counts[stage_type_counts['Final State']=='Lost'])
losses_tbl[['Initial Contact', 'Demo Scheduled', 'Demo Completed', 'Signup', 'Closed Lost', 'Unknown']]

Losses


Stage,Initial Contact,Demo Scheduled,Demo Completed,Signup,Closed Lost,Unknown
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
call,10.431034,8.189655,59.568966,1.689655,4.327586,43.655172
email,28.741379,4.775862,44.448276,5.948276,3.948276,100.103448
lead qualification,0.068966,0.086207,0.344828,,0.017241,0.568966
lead submitted form,,,,,,0.965517
meeting,0.12069,0.017241,0.137931,,0.017241,0.896552
not interested,0.172414,,0.051724,,,0.017241


In [54]:
print("Wins")
wins_tbl = stage_type_counts[stage_type_counts['Final State']=='Won'].drop('Final State', axis=1).pivot(index="Stage", columns="Type", values="Count")[cols_of_interest].transpose() / len(stage_type_counts[stage_type_counts['Final State']=='Won'])
wins_tbl[['Initial Contact', 'Demo Scheduled', 'Demo Completed', 'Signup', 'Closed Lost', 'Closed Won', 'Unknown']]

Wins


Stage,Initial Contact,Demo Scheduled,Demo Completed,Signup,Closed Lost,Closed Won,Unknown
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
call,7.873016,2.253968,39.587302,3.873016,1.079365,0.126984,33.873016
email,20.666667,2.84127,43.52381,15.968254,1.539683,0.063492,86.238095
lead qualification,0.031746,,0.079365,,,,0.412698
lead submitted form,,,,,,,0.460317
meeting,0.111111,0.015873,0.126984,0.031746,,,0.634921
not interested,,,,,,,0.015873
