# Reach Curve Modeling Based On Timespends Data
 
This notebook explores and models reach curve.

## Objectives:
1. Load and explore data
2. Exploratory data analysis
3. Research
3. Reach curve modeling
4. Evaluation method
5. Reach Optimization
6. Recommendations and Insights

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

## 1. Load and explore data

In [None]:
timespends = pd.read_csv("../data/timespends.csv", index_col=0)

TOTAL_NUM_SAMPLES = len(timespends)
print(f"No. of samples: {TOTAL_NUM_SAMPLES}")
print(f"Sample:\n {timespends.head()}")


plt.plot(sorted(timespends['timespends']))
plt.title("Timespends (sorted)")
plt.show()

### Basic observations

So, the entire population comprises 10,333 samples, for which the average time spent on a particular medium was measured.

The above plot reveals about 20% of sampled population doesn't use the medium at all. 
This means there's simply no way to reach them through this medium, and it will also set an upper limit for the reach curve (around 80%).

On the other hand, about 15% of the population spends the most time on this medium compared to all other groups. 
This group will likely contain the highest number of individuals who will see an advertisement multiple times. 
Because of it - the reach curve will flatten – despite an increase in the number of impressions.

## 2. Exploratory Data Analysis

In [None]:
timespends['timespends'].value_counts().head(15)

In [None]:
print("Zero time users are {:.2f} % of all users.".format(np.sum(timespends['timespends'] == 0) / len(timespends) * 100))

In [None]:
N_CLUSTERS = 7
X = timespends['timespends'].to_numpy().reshape(-1, 1)
kmedoids = KMedoids(n_clusters=N_CLUSTERS, random_state=0, method='pam').fit(X)

In [None]:
clustered_data = timespends.copy()
clustered_data['cluster'] = kmedoids.predict(X)
clustered_data['cluster-timespend'] = clustered_data.apply(lambda row: kmedoids.cluster_centers_[int(row['cluster'])][0], axis=1)
clustered_data.head()

In [None]:
plt.plot(sorted(clustered_data['cluster-timespend']))
plt.title("Timespends (sorted) - clustered")

In [None]:
df = pd.DataFrame(clustered_data['cluster-timespend'].value_counts())
df.reset_index(inplace=True)
df.sort_values(by='cluster-timespend', inplace=True, ascending=False)
df['perc_of_total'] = df['count'] / TOTAL_NUM_SAMPLES * 100
df['perc_of_total_cum'] = np.cumsum(df['perc_of_total'].to_numpy())
df

### Observations
1. We have 7 main user groups defined.
2. The maximum value the reach curve will achieve is ```79.45%```
3. The minimum value is 0 (it would be interesting to estimate this worst-case scenario somehow — i.e., what would be the chances that absolutely no one gets any impression even once).
4. The reach curve is a non-decreasing curve because an impression, once received, cannot be unreceived (or taken back).

## 3. Research

### Articles - a brief literature overview
1. [Reach Measurement, Optimization and Frequency Capping In Targeted Online Advertising Under k-Anonymity](https://arxiv.org/pdf/2501.04882v1)
2. [Estimating reach curves from one data point](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43218.pdf)
3. [Privacy-centric Cross-publisher Reach and Frequency Estimation Via Vector of Counts](https://storage.googleapis.com/gweb-research2023-media/pubtools/6039.pdf#cite.kreuter2020privacy)
4. [Privacy-Preserving Secure Cardinality and Frequency Estimation](https://storage.googleapis.com/gweb-research2023-media/pubtools/5611.pdf)
5. [Virtual People: Actionable Reach Modeling](https://research.google/pubs/virtual-people-actionable-reach-modeling/)
6. [Measuring Cross-Device Online Audiences](https://research.google/pubs/measuring-cross-device-online-audiences/)
7. [Scalable Multi-objective Optimization in Programmatic Advertising via Feedback Control](https://www.researchgate.net/publication/356879928_Scalable_Multi-objective_Optimization_in_Programmatic_Advertising_via_Feedback_Control)

These are articles I found while searching for topics related to reach curve modeling. While there's only one curve, it's a surprisingly broad subject with fascinating modeling.

## 4. Model and simulation of reach curve
To better understand the problem, it's worth considering running a simulation.

The way we model the reach curve will largely depend on the media type. \
It'll be different for TV, where a single impression can attract multiple viewers, but you can't differentiate them when that impression appears across a user's various accounts on a platform. \
On these platforms, you pay for every impression, but if it's not anonymous, you can gather data on whether a specific ad has been previously viewed.

In the notebook - the idea is to treat impressions as something $\bold{requested}$ by a given user who uses a particular medium, no user tracking history if an user already seen ad, so where repeat views are possible.

## Mathematical Framework

### Some definitions
- $\Omega = \{0, 1, ..., N - 1\}$ - user population, where $N = 10333$
- random variable $T:\Omega \to \mathbb{R}$ to annotate average timespend of user $i$
- random variable $Y: \Omega \to \mathbb{R}$ where $Y(i) = p_i$ is the probability that impression will be requested by the user $i$ (?) \
it could be seemed as some function of $T$, one of the simplest ideas is:
$$
Y(i) = \frac{T(i)}{\sum_{i\epsilon\Omega}T(i)} 
$$
- random variable $X_n: \Omega \to \mathbb{R}$ to model result of requesting of $n^{th}$ impression (?)
- $S_n \subseteq \Omega$ - set of users reached after $n$ impressions
- $R(n) = |S_n| / |\Omega| * 100$%

### Reach Function
- $S_0 = \emptyset$
- $S_{n+1} = S_n \cup \{? \}$ <- how to model results of $X_n$, I would like to get map somehow and get $S_{n}$ as image
- $R(n) = |S_n|$ is non-decreasing with $R(n) \leq |A|$, where $A = { \{i \epsilon \Omega: Y(i) > 0\}}$ 

!!! Note: I struggle a little to formalize the problem and I have still a with that, so the description is incomplete.
TODO:
1. Probably try to brute force computations of all states for the given n of impressions
2. Draw for $\Omega$ with number of users $N = 3, 4$
3. I would like that upper limit for $R$ will not come from common sense, but conlusion from the model 

### Simulation

In [67]:
class StochasticReachSimulator:
    """
    Simulates reach curve using stochastic process based on user timespends.

    Note: Probabilities for active users are 
    directly related to their time spent on the medium.
    """
    
    def __init__(self, timespends):
        self.timespends = timespends
        self.total_users = len(timespends)
        
        self.active_mask = timespends > 0
        self.active_indices = np.where(self.active_mask)[0]
        self.active_timespends = timespends[self.active_mask]
        
        self.probabilities = self.timespends / np.sum(self.timespends)
        self.max_reach = len(self.active_indices)
        
    def simulate_single_run(self, n_impressions: int) -> tuple[np.array, set[int]]:
        """
        Simulate a single run of the reach process.
        
        Returns:
            reach_curve: array of reach values at each impression
            reached_users: set of users reached

        """
        reached_users = set()
        reach_curve = np.zeros(n_impressions + 1)
        
        for i in range(n_impressions):
            selected_user_idx = np.random.choice(
                self.timespends, 
                p=self.probabilities
            )
            
            reached_users.add(selected_user_idx)
            reach_curve[i + 1] = len(reached_users)
            
        return reach_curve, reached_users
    
    def simulate_multiple_runs(self, n_impressions: int, n_runs: int) -> np.array:
        """
        Simulate multiple runs to get distribution of reach curves.
        """
        reach_curves = np.zeros((n_runs, n_impressions + 1))
        
        for run in tqdm(range(n_runs), desc="Simulating runs"):
            reach_curve, _ = self.simulate_single_run(n_impressions)
            reach_curves[run] = reach_curve
            
        return reach_curves
    
    def compute_expected_reach(self, n_impressions: int) -> np.array:
        """
        Compute theoretical expected reach using inclusion-exclusion principle.
        For large n, this approximates to: E[R(n)] ≈ M * (1 - exp(-n/M))
        where M is the number of active users.
        """
        prob_not_reached = (1 - self.probabilities) ** n_impressions
        expected_reach = self.max_reach - np.sum(prob_not_reached)
        
        return expected_reach

In [None]:
simulator = StochasticReachSimulator(timespends.values)

N_IMPRESSIONS = 100_000
N_RUNS = 20

reach_curves = simulator.simulate_multiple_runs(N_IMPRESSIONS, N_RUNS)

In [None]:
mean_reach = np.mean(reach_curves, axis=0)
std_reach = np.std(reach_curves, axis=0)
percentile_5 = np.percentile(reach_curves, 5, axis=0)
percentile_95 = np.percentile(reach_curves, 95, axis=0)

impressions = np.arange(N_IMPRESSIONS + 1)

plt.figure(figsize=(12, 8))

for i in range(min(10, N_RUNS)):
    plt.plot(impressions, reach_curves[i], alpha=0.1, color='gray')

plt.plot(impressions, mean_reach, 'b-', linewidth=2, label='Mean reach')
plt.fill_between(impressions, percentile_5, percentile_95, 
                 alpha=0.2, color='blue', label='90% confidence interval')               
plt.axhline(y=simulator.max_reach, color='r', linestyle='--', 
            label=f'Maximum possible reach: {simulator.max_reach:,}')

plt.xlabel('Number of Impressions')
plt.ylabel('Reach (Unique Users)')
plt.title('Stochastic Reach Curve Simulation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 5. Reach Efficiency Analysis

In [None]:
reach_pct_total = mean_reach / simulator.total_users * 100
reach_pct_active = mean_reach / simulator.max_reach * 100

marginal_reach = np.diff(mean_reach)

fig, ax = plt.subplots(figsize=(12, 10))

ax.plot(impressions, reach_pct_total, 'b-', linewidth=2, 
         label='% of total population')
ax.plot(impressions, reach_pct_active, 'g-', linewidth=2, 
         label='% of active users')
ax.set_xlabel('Number of Impressions')
ax.set_ylabel('Reach (%)')
ax.set_title('Reach as Percentage of Population')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Theoretical vs Empirical Comparison

In [None]:
impression_points = np.logspace(1, np.log10(N_IMPRESSIONS), 50).astype(int)
theoretical_reach = []

for n in impression_points:
    expected = simulator.compute_expected_reach(n)
    theoretical_reach.append(expected)

theoretical_reach = np.array(theoretical_reach)

plt.figure(figsize=(12, 8))

plt.plot(impressions, mean_reach, 'b-', linewidth=2, 
         label='Empirical (simulation mean)', alpha=0.8)

plt.plot(impression_points, theoretical_reach, 'ro--', 
         markersize=6, linewidth=2, label='Theoretical expected reach')

plt.axhline(y=simulator.max_reach, color='g', linestyle='--', 
            label=f'Maximum: {simulator.max_reach:,}')

plt.xlabel('Number of Impressions')
plt.ylabel('Reach (Unique Users)')
plt.title('Theoretical vs Empirical Reach Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, N_IMPRESSIONS)
plt.show()