In [None]:
"""
REFERENCES:
Methods that were originally developed and tested by Adobe and Alibaba:
— G. Theocharous, P. Thomas, and M. Ghavamzadeh, Personalized Ad Recommendation Systems 
for Life-Time Value Optimization with Guarantees, 2015;
— D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai, Budget Constrained 
Bidding by Model-free Reinforcement Learning in Display Advertising, 2018;

OBJECTIVE:
A simple testbed environment with synthetic data to explain the model design and 
implementation more cleanly.

THEORY:
Basics of customer intent analysis:
One of the most basic and widely used targeting methods: look-alike modeling.
The idea of look-alike modeling is to personalize ads or offers based on the similarity of a
given customer to other customers who have exhibited certain desirable or undesirable 
properties in the past. 

We can approach this problem by collecting a number of historical customer profiles that can
include both demographic and behavioral features, attributing those features with the 
observed outcomes, training a classification model based on these samples and then using this
model to score any given customer to determine whether the offer should be issued or not.

This approach provides significant flexibility, in the sense that models can be built for a 
wide variety of business objectives depending on how the outcome label is defined. It also 
works well in a variety of environments, including online advertising and retail promotions.

The solution described above can be extended in a number of ways. One of the apparent 
limitations of basic look-alike modeling is that models for different outcomes and objectives
are built separately, and the useful information about similarities between offerings is 
ignored. 

This issue becomes critical in environments with a large number of offerings, such as 
recommendation systems. Typically, the problem is circumvented by incorporating user 
features, offering features and a larger user-offering interaction matrix into the model so 
that interaction patterns can be generalized across the offerings; it is done in many 
collaborative filtering algorithms, including ones that are based on matrix factorization, 
factorization machines and deep learning methods.

These methods help to predict customer intent more accurately by incorporating a wider range
of data, but this is not particularly useful for optimizing multi-step action strategies.

The problem of strategic optimization can be partly addressed by a more careful design of the
target variable, and this group of techniques represents the second important extension of 
basic look-alike modeling. The target variable is often designed to quantify the probability
of some immediate event like a click, purchase, or subscription cancellation, but it can also
incorporate more strategic considerations. 

For example, it is common to combine look-alike modeling with lifetime value (LTV) models to 
quantify not only the probability of the event, but also its long term economic impact (e.g. 
what will be the total return from a customer retained from churn or what will be a 3-month 
spending uplift after a special offer):

In this case, for a Profile x, we can say that the probability of each Outcome can  be scored
as:

Unconditional Propensity: score(x) = P(response | x)
Expected LVT: score(x) = P(response | x) * LVT(x)
LVT Uplift: score(x) = (P(response | x) - P(response | no offer, x)) * LVT(x)

While these techniques help put the modeling process into a strategic context, they do not 
really provide a framework for optimizing a long-term strategy. So our next step will be to 
develop a more suitable framework, specifically for this problem.

Customer journey as a markov decision process:
The problem of strategic (multi-step) optimization stems from the stateful nature of 
customer relationships and dependencies between actions. For example, one can view retail 
offer targeting as a single-step process where a customer can either be converted or 
completely lost (and train a model that maximizes the probability of conversion):

Offer —> Purchase
  |
  ∨
Lost costumer

However, the real retail environment is more complex and customers can interact with a 
retailer multiple times before a conversion occurs, and the customer makes related purchases.

For instance, in a situation where the actions in the strategy are related, and their order 
is important, the initial action alone may not increase conversions, but it can boost the 
efficiency of the downstream actions. 

It can also be the case that an action is most efficient if issued not in the beginning, but
between two other different actions, and so on.

In complex structures of a customer journey, as customer maturity grows over time, offerings
need to be sequenced properly to account for this.

MARKOV DECISION PROCESSES (MDP):
Use cases can frequently be represented as Markov Decision Processes (MDP), where a customer
can potentially be in several different states and move from one state to another over time 
under the influence of marketing actions. 

In each state, the marketer has to choose an action (or non-action) to execute, and each 
transition between states corresponds to some reward (e.g. number of purchases), so that all
rewards along the customer trajectory sum up to a total return that corresponds to customer 
LTV or campaign ROI.

Although we can use a hand-crafted set of states with clear semantic meaning, we will assume
the state is represented simply by a customer feature vector, so that the total number of 
states can be large or infinite in the case of real-valued features. Note that the state at 
any point of time can include records of all previous actions taken with regard to this 
customer.

In the MDP framework, our goal is to find an optimal policy — π — that maps each state to the 
probabilities of selecting each possible action, so that π(a | s) is the probability of 
taking action — a — given that a customer is currently in state — s. The optimality of the 
policy can be quantified using the expected return under this policy – the expected sum of 
rewards — r — earned at each transition.

A naive solution for this problem is thus to build multiple look-alike models for different 
actions and different designs of the training labels.
"""