# Reach Curve Modeling Based On Timespends Data
 
This notebook explores and models reach curve.

## Objectives:
1. Load and explore data
2. Exploratory data analysis
3. Research
3. Reach curve modeling
4. Evaluation method
5. Reach Optimization
6. Recommendations and Insights

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids

import warnings
warnings.filterwarnings('ignore')

## 1. Load and explore data

In [None]:
data = pd.read_csv("../data/timespends.csv", index_col=0)

print(f"No. of samples: {len(data)}")
print(f"Sample:\n {data.head()}")


plt.plot(sorted(data['timespends']))
plt.title("Timespends (sorted)")
plt.show()

### Basic observations

So, the entire population comprises 10,333 samples, for which the average time spent on a particular medium was measured.

The above plot reveals about 20% of sampled population doesn't use the medium at all. This means there's simply no way to reach them through this medium, and it will also set an upper limit for the reach curve (around 80%).

On the other hand, about 15% of the population spends the most time on this medium compared to all other groups. This group will likely contain the highest number of individuals who will see an advertisement multiple times. Because of it - the reach curve will flatten – despite an increase in the number of impressions.

## 2. Exploratory Data Analysis

In [None]:
data['timespends'].value_counts().head(15)

In [None]:
N_CLUSTERS = 7
X = data['timespends'].to_numpy().reshape(-1, 1)
kmedoids = KMedoids(n_clusters=N_CLUSTERS, random_state=0, method='pam').fit(X)

In [None]:
clustered_data = data.copy()
clustered_data['cluster'] = kmedoids.predict(X)
clustered_data['cluster-center'] = clustered_data.apply(lambda row: kmedoids.cluster_centers_[int(row['cluster'])][0], axis=1)
clustered_data.head()

In [None]:
plt.plot(sorted(clustered_data['cluster-center']))
clustered_data['cluster-center'].value_counts()

### Simple observations
1. We have 7 main user groups defined.
2. The maximum value the reach curve will achieve is ```77%``` (simply ```1 - 2346/ 10_333```)
3. The minimum value is 0 (it would be interesting to estimate this worst-case scenario somehow—i.e., what would be the chances that absolutely no one gets any impression even once).
4. The reach curve is a non-decreasing curve because an impression, once received, cannot be unreceived (or taken back).

However, one can imagine that if a single person receives too many impressions, the curve might not only stop growing, but also grow more slowly – if that person discourages others from using the medium.

## 3. Research

### Articles
1. [Reach Measurement, Optimization and Frequency Capping In Targeted Online Advertising Under k-Anonymity](https://arxiv.org/pdf/2501.04882v1)
2. [Estimating reach curves from one data point](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43218.pdf)

## 4. Reach Curve Modeling

In [None]:
def reach_curve(impressions: float) -> float:
    raise NotImplementedError

## 5. Evaluation method

## 6. Reach Optimization

## 7. Recommendations and Insights

Based on the reach curve analysis:

1. **Optimal impression levels**: The marginal reach decreases as impressions increase
2. **Cost efficiency**: There's a sweet spot for cost-efficient reach
3. **Model selection**: Choose the model with better fit for your data

### Next Steps:
- Segment analysis by campaign type
- Time-series analysis of reach patterns
- Multi-channel reach optimization