In [1]:
import pandas as pd

# Goal

 This notebook aims to compare 3 different attribution models applied to the dataset 'homepage_visits.csv'. The models are First-Touch, Last-Touch and Linear.
 
 All the models have similar results for this dataset, so the recommendation is to proceeed with Last-Touch attribution for being simple and widely adopted by the industry.

# Data Preparation

In [2]:
# Load the data
df = pd.read_csv('homepage_visits.csv')

In [3]:
df

Unnamed: 0,EVENT_DATETIME,VISITOR_ID,ACQUISITION_CHANNEL
0,2021-10-19 08:40:21.204000000,385685697470,channel_8
1,2021-10-30 11:37:24.650000000,490063861180,channel_4
2,2021-10-29 18:29:08.754000000,489723674348,channel_4
3,2021-10-02 21:17:11.035000000,473638768926,channel_1
4,2021-10-16 07:11:35.390000000,480995855292,channel_4
...,...,...,...
251457,2021-10-03 02:57:21.317000000,473744387526,channel_8
251458,2021-10-27 07:42:56.317000000,488328212466,channel_1
251459,2021-10-20 04:40:34.164000000,483285124336,channel_1
251460,2021-10-10 19:50:51.873000000,477883496040,channel_4


In [4]:
# Convert EVENT_DATETIME to datetime object for easier manipulation
df['EVENT_DATETIME'] = pd.to_datetime(df['EVENT_DATETIME'])

# Sort by VISITOR_ID and EVENT_DATETIME to simulate the journey per visitor
df = df.sort_values(by=['VISITOR_ID', 'EVENT_DATETIME'])

In [5]:
# Group by VISITOR_ID to get first-touch and last-touch attribution models
def attribution_model(df, model='last_touch'):
    # Create a dataframe to store attribution results
    attribution_df = pd.DataFrame(columns=['VISITOR_ID', 'ATTRIBUTION_CHANNEL'])

    if model == 'last_touch':
        # Take the last channel for each visitor
        attribution_df = df.groupby('VISITOR_ID').tail(1)[['VISITOR_ID', 'ACQUISITION_CHANNEL']].copy()
        attribution_df.rename(columns={'ACQUISITION_CHANNEL': 'ATTRIBUTION_CHANNEL'}, inplace=True)
    elif model == 'first_touch':
        # Take the first channel for each visitor
        attribution_df = df.groupby('VISITOR_ID').head(1)[['VISITOR_ID', 'ACQUISITION_CHANNEL']].copy()
        attribution_df.rename(columns={'ACQUISITION_CHANNEL': 'ATTRIBUTION_CHANNEL'}, inplace=True)

    return attribution_df

In [6]:
# Applying last-touch attribution
last_touch_df = attribution_model(df, model='last_touch')

# Applying first-touch attribution
first_touch_df = attribution_model(df, model='first_touch')

# Analyze the results, for example, by counting occurrences of each channel
last_touch_counts = last_touch_df['ATTRIBUTION_CHANNEL'].value_counts()
first_touch_counts = first_touch_df['ATTRIBUTION_CHANNEL'].value_counts()

In [10]:
# Create the linear attribution model
def linear_attribution(df):
    # Group by VISITOR_ID to find all channels a visitor touched
    grouped = df.groupby('VISITOR_ID')
    
    # Create a list to store the attribution results
    attribution_records = []

    # For each visitor, find their unique acquisition channels
    for visitor_id, group in grouped:
        unique_channels = group['ACQUISITION_CHANNEL'].unique()
        credit_per_channel = 1 / len(unique_channels)  # Equal credit for all channels

        # Assign credit to each channel and store it in the list
        for channel in unique_channels:
            attribution_records.append({'ACQUISITION_CHANNEL': channel,
                                        'ATTRIBUTION_CREDIT': credit_per_channel})
    
    # Convert the list of records to a DataFrame
    attribution_df = pd.DataFrame(attribution_records)
    
    # Aggregate the attribution credits per channel
    total_attribution = attribution_df.groupby('ACQUISITION_CHANNEL')['ATTRIBUTION_CREDIT'].sum().reset_index()

    return total_attribution

In [11]:
# Apply the linear attribution model
linear_attribution_df = linear_attribution(df)

In [13]:
## Combine the models in one dataframe ##

# # First, convert the first_touch_counts and last_touch_counts into DataFrames for merging
first_touch_df = first_touch_counts.reset_index()
first_touch_df.columns = ['ACQUISITION_CHANNEL', 'FIRST_TOUCH_COUNT']

last_touch_df = last_touch_counts.reset_index()
last_touch_df.columns = ['ACQUISITION_CHANNEL', 'LAST_TOUCH_COUNT']

# Merge the three DataFrames: first_touch_df, last_touch_df, and linear_attribution_df
comparison_df = pd.merge(first_touch_df, last_touch_df, on='ACQUISITION_CHANNEL', how='outer')
comparison_df = pd.merge(comparison_df, linear_attribution_df, on='ACQUISITION_CHANNEL', how='outer')

# Fill any missing values with 0 (in case some channels are missing in one model)
comparison_df = comparison_df.fillna(0)

# Sorting by last touch for easier viewing, you can change this sorting criteria
comparison_df = comparison_df.sort_values(by='LAST_TOUCH_COUNT', ascending=False)

# Results

In [14]:
# Styling the comparison DataFrame
styled_df = comparison_df[["ACQUISITION_CHANNEL","FIRST_TOUCH_COUNT","LAST_TOUCH_COUNT","ATTRIBUTION_CREDIT"]].style.hide_index()\
    .bar(subset=['FIRST_TOUCH_COUNT'], color="#63eb80", align='mid')\
    .bar(subset=['LAST_TOUCH_COUNT'], color="#0569ff", align='mid')\
    .bar(subset=['ATTRIBUTION_CREDIT'], color='#55d794', align='mid')\
    .format(precision=0, thousands=',', formatter={
        'FIRST_TOUCH_COUNT': "{:.0f}",
        'LAST_TOUCH_COUNT': "{:.0f}",
        'ATTRIBUTION_CREDIT': "{:.2f}"
    })

# Display the styled DataFrame
styled_df

  styled_df = comparison_df[["ACQUISITION_CHANNEL","FIRST_TOUCH_COUNT","LAST_TOUCH_COUNT","ATTRIBUTION_CREDIT"]].style.hide_index()\


ACQUISITION_CHANNEL,FIRST_TOUCH_COUNT,LAST_TOUCH_COUNT,ATTRIBUTION_CREDIT
channel_1,57640,56360,56893.87
channel_4,53347,49587,51091.8
channel_8,17974,22372,20409.77
channel_2,13046,10930,11797.82
channel_11,3138,5881,4984.82
channel_3,2539,2836,2641.77
channel_6,1494,1292,1400.47
channel_5,457,410,431.63
channel_10,224,246,239.37
channel_7,219,164,186.7


In [15]:
# First, calculate total counts/credits for each model
total_first_touch = comparison_df['FIRST_TOUCH_COUNT'].sum()
total_last_touch = comparison_df['LAST_TOUCH_COUNT'].sum()
total_linear_credit = comparison_df['ATTRIBUTION_CREDIT'].sum()

# Add new columns to the DataFrame for percentages
comparison_df['FIRST_TOUCH_PERCENT'] = (comparison_df['FIRST_TOUCH_COUNT'] / total_first_touch) * 100
comparison_df['LAST_TOUCH_PERCENT'] = (comparison_df['LAST_TOUCH_COUNT'] / total_last_touch) * 100
comparison_df['LINEAR_ATTRIBUTION_PERCENT'] = (comparison_df['ATTRIBUTION_CREDIT'] / total_linear_credit) * 100

# Now apply styling to the percentage columns
styled_df = comparison_df[["ACQUISITION_CHANNEL","FIRST_TOUCH_PERCENT","LAST_TOUCH_PERCENT","LINEAR_ATTRIBUTION_PERCENT"]].style.hide_index()\
    .bar(subset=['FIRST_TOUCH_PERCENT'], color="#63eb80", align='mid')\
    .bar(subset=['LAST_TOUCH_PERCENT'], color="#0569ff", align='mid')\
    .bar(subset=['LINEAR_ATTRIBUTION_PERCENT'], color='#55d794', align='mid')\
    .format(precision=2, thousands=',', formatter={
        'FIRST_TOUCH_PERCENT': "{:.2f}%",
        'LAST_TOUCH_PERCENT': "{:.2f}%",
        'LINEAR_ATTRIBUTION_PERCENT': "{:.2f}%"
    })

# Display the styled DataFrame with percentages
styled_df


  styled_df = comparison_df[["ACQUISITION_CHANNEL","FIRST_TOUCH_PERCENT","LAST_TOUCH_PERCENT","LINEAR_ATTRIBUTION_PERCENT"]].style.hide_index()\


ACQUISITION_CHANNEL,FIRST_TOUCH_PERCENT,LAST_TOUCH_PERCENT,LINEAR_ATTRIBUTION_PERCENT
channel_1,38.41%,37.55%,37.91%
channel_4,35.55%,33.04%,34.04%
channel_8,11.98%,14.91%,13.60%
channel_2,8.69%,7.28%,7.86%
channel_11,2.09%,3.92%,3.32%
channel_3,1.69%,1.89%,1.76%
channel_6,1.00%,0.86%,0.93%
channel_5,0.30%,0.27%,0.29%
channel_10,0.15%,0.16%,0.16%
channel_7,0.15%,0.11%,0.12%


# Analysis

**Top Acquisition Channels**:

- Channel 1 has the highest attribution credit (56,893.87), making it the dominant channel in the dataset.
- Channel 4 follows closely behind with 51,091.80 in attribution credit.


**Discrepancies Between First-Touch and Last-Touch**:

- Channel 8 has a significant shift (+4,398) from first-touch to last-touch, indicating it is more effective at closing conversions rather than initiating them.
- Channel 11 also shows a notable shift (+2,743), suggesting its role in finalizing conversions rather than early acquisition.
- Channel 2 and Channel 1 are relatively stable, meaning their role in conversions is more balanced across the funnel.

**First-Touch vs. Attribution Credit**:

- Channels 1, 4, and 2 have higher first-touch counts than their attributed credit, meaning these channels may be good at generating initial traffic but could be losing conversions later.
- Channels 8 and 11 have lower first-touch counts than their attribution credit, suggesting that they contribute more effectively in later-stage conversions.

**Last-Touch vs. Attribution Credit**:

- Channel 8 (+1,962) and Channel 11 (+896) have higher last-touch counts than their attribution credit, reinforcing their strong role in finalizing conversions.
- Channel 1 (-533) and Channel 4 (-1,504) have lower last-touch counts compared to attribution, indicating that their impact might be spread across multiple touchpoints instead of acting as the final interaction.

**Recommendations**:

- Optimize Early-Stage Channels (Channel 1, Channel 4, Channel 2): These channels drive strong initial engagement, so focus on improving retention and conversion strategies.
- Leverage Closing-Stage Channels (Channel 8, Channel 11): Since they perform well in the last touch, consider allocating budget for retargeting and remarketing campaigns.
- Analyze Attribution Models Further: The discrepancy between first-touch, last-touch, and linear attribution suggests deeper funnel insights can be explored. It would be interesting to analyze user's journeys to complement the analysis.