# Reach Curve Modeling Based On Timespends Data
 
This notebook explores and models reach curve.

## Objectives:
1. Load and explore data
2. Build analytical model
3. Questions and answers

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## 1. Load and explore data

In [None]:
timespends = pd.read_csv("../data/timespends.csv", index_col=0)

TOTAL_NUM_SAMPLES = len(timespends)
print(f"No. of samples: {TOTAL_NUM_SAMPLES}")
print(f"Sample:\n {timespends.head()}")

In [None]:
plt.plot(sorted(timespends['timespends']))
plt.title("Timespends (sorted)")
plt.show()

In [None]:
timespends['timespends'].hist()

In [None]:
print("Zero time users are {:.2f} % of all users.".format(np.sum(timespends['timespends'] == 0) / len(timespends) * 100))

### Observations

The entire population comprises $10,333$ samples, for which the average time spent on a particular medium was measured.

The above plot reveals about $20$% of sampled population doesn't use the medium at all. 
This means there's simply no way to reach them through this medium, and it will also set an upper limit for the reach curve ($80$%)

On the other hand, about $15$% of the population spends the most time on this medium compared to all other groups. 
This group will likely contain the highest number of individuals who will see an advertisement multiple times. 
Because of it - the reach curve will flatten – despite an increase in the number of impressions.

In [None]:
TOTAL_TIMESPENDS_SUM = np.sum(timespends['timespends'])
timespends.sort_values(by='timespends', inplace=True)
timespends['perc_of_total'] = timespends['timespends'] / TOTAL_TIMESPENDS_SUM * 100

## 2. Modelling

### Observation
The approach to modeling a reach curve significantly depends on the media type.

For TV, a single ad impression can reach multiple viewers at once, but you can't distinguish between them. It also makes it impossible to prevent repeat views to the same individual.

On digital platforms, you pay for every impression. Here, the situation is more complex. While you can sometimes track if a user has seen an ad before (if they aren't anonymous), a single person might have multiple accounts. The platform sees these as different users, which inflates your reach numbers and makes it difficult to accurately cap frequency for that individual.

For the notebook, let's simplify this by treating each impression as a request from a user. Assume there's no tracking history, meaning we can't tell if a user has already seen an ad. This assumption makes it possible for users to see the same ad multiple times.

### Poisson regression model
Model reach curve with [Poisson regression model](https://en.wikipedia.org/wiki/Poisson_regression). It seems promising.

## Questions and answers
1. What criteria should the reach function meet (maximum and minimum value, monotonicity)?
2. How would you map the distribution of impressions among people and why?
3. What is the relationship between time-spends and expected values of impressions received for the given people?