This method was chosen in the interest of meeting deadlines.  It is not the most effective method, but because it is based on assumptions that can be defined it is a good demonstration of a quick solution.  It may also be useful as a comparison to methods that require a greater time investment, for the sake of understanding how the perspective shifts based on the complexity of the method used.

The assumptions:

1. The oldest touchpoint is more valuable than any other non-converting touchpoints because it is the onset of awareness.
2. Together, non-converting touchpoints are worth the same as or less than the converting touchpoint.
3. The converting touchpoint gets at least 50% of the credit because something about it caused a decision none of the others did.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from itertools import combinations
%matplotlib inline

path = 'C:/Users/alexj/Documents/Ad & Retail Data Projects'
file = 'marketing_dummy_data.csv'
filepath = path + '/' + file
data = pd.read_csv(filepath)

#data.head()
#data.describe()
#data.info()

In [2]:
#  fix dates so they are datetime values

date_columns = ['date_subscribed', 'date_canceled', 'date_of_touch', 'date_alternative_touch']

for col in date_columns:
    data[col] = pd.to_datetime(data[col], errors='coerce')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10037 entries, 0 to 10036
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   user_id                 10037 non-null  object        
 1   marketing_channel       10022 non-null  object        
 2   variant                 10037 non-null  object        
 3   language_displayed      10037 non-null  object        
 4   language_preferred      10037 non-null  object        
 5   age_group               10037 non-null  object        
 6   subscribing_channel     1856 non-null   object        
 7   ab_test                 10037 non-null  object        
 8   date_subscribed         7390 non-null   datetime64[ns]
 9   date_canceled           6273 non-null   datetime64[ns]
 10  date_of_touch           10037 non-null  datetime64[ns]
 11  converted               10037 non-null  bool          
 12  converted_at_touch      10037 non-null  bool  

In [3]:
#  sort the data to ensure chronological order
data = data.sort_values(by=['user_id', 'date_of_touch'])

data.head()

Unnamed: 0,user_id,marketing_channel,variant,language_displayed,language_preferred,age_group,subscribing_channel,ab_test,date_subscribed,date_canceled,date_of_touch,converted,converted_at_touch,converted_counter,retained,retained_counter,retention_length,date_alternative_touch
2380,a100000001,Social,control,English,English,GZ,,A,NaT,NaT,2020-10-28,False,False,0,True,1,,2019-05-12
2381,a100000002,Influencer,control,English,English,Mil,,B,2018-04-27,2019-01-19,2018-04-23,True,False,1,False,0,267.0,2018-11-11
2382,a100000003,Influencer,control,English,English,Mil,,A,NaT,NaT,2023-07-09,False,False,0,True,1,,2022-04-22
1815,a100000004,Streaming,control,English,English,GZ,,B,2022-06-03,2024-04-05,2020-01-13,True,False,1,False,0,672.0,2018-01-03
2383,a100000005,Social,control,English,English,GX,,B,2021-09-07,2021-10-17,2020-06-19,True,False,1,False,0,40.0,2021-06-27


In [4]:
data = data[['user_id', 'date_of_touch', 'date_subscribed', 'marketing_channel', 'converted', 'ab_test']]

In [5]:
data.describe()

Unnamed: 0,date_of_touch,date_subscribed
count,10037,7390
mean,2020-06-07 00:07:53.448241408,2021-04-24 22:51:01.217861888
min,2018-01-01 00:00:00,2018-01-01 00:00:00
25%,2018-12-03 00:00:00,2019-08-22 00:00:00
50%,2020-01-23 00:00:00,2021-04-27 00:00:00
75%,2021-09-15 00:00:00,2022-12-25 18:00:00
max,2024-09-02 00:00:00,2024-09-02 00:00:00


In [6]:
#  position based attribution (U-Shaped)
#  first and last get 40 % each, middle gets remaining %
#  gives preference to the first and last points
#  my behavioral intuition says setting this to 30 - 20 - 60 is a good representation of reality
#  the rational behind my intution is that first contact initiates interest
#  making it more valuable than any in between, but less valuable than last, because last caused the conversion


df = data.copy()

# Calculate position weights
def assign_weights(touchpoints):
    n = len(touchpoints)
    if n == 1:
        return [1]
    if n == 2:
        return [0.5, 0.5]
    middle_weight = 0.2 / (n - 2)
    weights = [0.3] + [middle_weight] * (n - 2) + [0.5]
    return weights

# Sort the data for sequential processing
df = df.sort_values(['user_id', 'date_of_touch'])
# Apply position weights
df['credit'] = df.groupby('user_id')['date_of_touch'].transform (assign_weights).explode().reset_index(drop=True)
    
# Sum the credits for each marketing channel
attribution_table = df.groupby(['marketing_channel', 'ab_test'])['credit'].sum().reset_index().sort_values(by='credit', ascending=False)

attribution_table

Unnamed: 0,marketing_channel,ab_test,credit
8,Streaming,A,1909.023333
9,Streaming,B,1560.4
3,Influencer,B,721.551667
7,Social,B,707.065
6,Social,A,599.603333
2,Influencer,A,583.49
4,Push,A,384.65
5,Push,B,372.225
1,Email,B,286.491667
0,Email,A,169.5


In [7]:
#  make it into a file for other tools

attribution_table.to_csv('heuristic_spend_table.csv', index=False)

In [8]:
#  export prepared dataset for other tools

data.to_csv('spend_analysis_data.csv', index=False)