# Introduction

A specific content, such as an article or a video talking about a product, can be very engaging
for a niche of users but not interesting at all for the rest of the audience domain. The idea
behind the inPowered’s success is that, if we can find what that niche is, we can focus the
budget of our client on that specific audience segment. This will increase dramatically the ratio
of users which is really engaged into the content (i.e. spends more than 15 seconds on it)
versus those who just click and leave. This will produce a good CPE (cost per engagement) and
give us a great margin on the cost we charge our clients for an engagement.

# How InPowered works

The channels we use to advertise our client’s content (e.g. Facebook, Yahoo, Twitter, etc)
support creation of Ads through api. They also allow to specify what’s the audience the Ad is
targeted to. Generally the supported targeting segments are:

* Age (e.g. 18-24, 25-34, 35-44, 45-54, 55-64, 65+)
* Gender (e.g. Male, Female)
* Device (e.g. Mobile, Desktop)
* OS (e.g. Android, iOS)
* Language (e.g. English, French, Spanish, ...)
* Geo (e.g. United States, Canada, New York, San Francisco ...)
* ( Interest (e.g. Travel, Sport, Auto .... custom interests per channel)

For each Ad, channels also allow to define a creative, which is the combination of an image and
a headline to show as preview of the Ad.

At very high level, what we do is:

1. For each given content to promote (e.g. a url pointing to an article), we generate all the possible permutations of targeting segments (lineItem). 

1. For each lineItem, we create an Ad and traffick it on the associated channel. This implies that we will have thousands versions of the same Ad which will differ by the audience segment they are targeted to.

1. Then we activate them **randomly** with a micro budget and collect performance metrics (i.e. spend and engagements). 

1. As soon as the exploratory budget has been used, we evaluate the performance of the Ad by comparing its CPE with the CPE of the other explored versions and decide if it merits to be kept alive or has to be killed.

# Smart Exploration goal

Smart Exploration aims to change the criteria we use to decide which lineItem (version of the Ad) to test first, from **randomly** to **based on the experience gathered through previous campaigns on the same typology of content**.

In [1]:
import pandas as pd

columns=['LineItemID', 'ContentUrl', 'Campaign ID', 
         'Story ID', 'Channel', 'Preview Headline', 
         'Preview Image', 'Channel Ad ID', 'Gender Targeting', 
         'Age Targeting', 'Device Targeting', 'Geo Targeting', 
         'Interest Targeting', 'Language Targeting', 'OS Targeting',
         'Lifetime Spend', 'Lifetime Clicks', 'Lifetime Engagements', 'CPE']

pd.DataFrame([], columns=columns).head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,Device Targeting,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE


# Problem

In this section we describe the problem to solve. We can rely on a bunch of historical data. The
first table shows how lineItem data looks like for campaigns which have already run.

The second table shows lineitems generated for the current campaign.

The columns of both the table are the transposition of the column into the LineItem attributes
table from the previous section.

In [2]:
table1 = [[1, 1, 'd1', 'd1_1', 'fb', 1, 'g1', 'fb1', 'M', [35, -4, 4], 'D', 'US', 'id_fb1', 'EN', 'iOS', 45., 1000, 500, .09],
          [2, 1, 'd1', 'd1_1', 'fb', 1, 'g1', 'fb2', 'F', [35, -4, 4], 'D', 'US', 'id_fb1', 'EN', 'iOS', 50., 1000, 500, .10],
          [3, 1, 'd1', 'd1_1', 'yh', 1, 'g1', 'yh1', 'F', [35, -4, 4], 'D', 'US', 'id_yh1', 'EN', 'iOS', 150., 1000, 500, .30],
          [30000000, 543040, 'd100', 'd100_1', 'yh', 2, 'g1', 'yh154', 'F', [55, -6, 5], 'M', 'US', 'id_yh30', 'EN', 'iOS', 100., 1000, 500, .20],
          ]

historical_data = pd.DataFrame(table1, columns=columns)

In [3]:
historical_data.head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,Device Targeting,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE
0,1,1,d1,d1_1,fb,1,g1,fb1,M,"[35, -4, 4]",D,US,id_fb1,EN,iOS,45.0,1000,500,0.09
1,2,1,d1,d1_1,fb,1,g1,fb2,F,"[35, -4, 4]",D,US,id_fb1,EN,iOS,50.0,1000,500,0.1
2,3,1,d1,d1_1,yh,1,g1,yh1,F,"[35, -4, 4]",D,US,id_yh1,EN,iOS,150.0,1000,500,0.3
3,30000000,543040,d100,d100_1,yh,2,g1,yh154,F,"[55, -6, 5]",M,US,id_yh30,EN,iOS,100.0,1000,500,0.2


In [4]:
table2 = [[30000001, 643040, 'd101', 'd101_1', 'fb', 1, 'g1', 'fb1', 'M', [45, -4, 4], 'D', 'US', 'id_fb1', 'EN', 'iOS', None, None, None, None],
          [30000002, 643040, 'd101', 'd101_1', 'fb', 1, 'g1', 'fb2', 'F', [35, -4, 4], 'D', 'US', 'id_fb2', 'EN', 'iOS', None, None, None, None],
          [30000003, 643040, 'd101', 'd101_1', 'yh', 1, 'g1', 'yh1', 'F', [35, -4, 4], 'D', 'US', 'id_yh1', 'EN', 'iOS', None, None, None, None],
          [30000004, 743040, 'd101', 'd101_1', 'yh', 2, 'g1', 'yh154', 'F', [55, -6, 5], 'M', 'US', 'id_yh30', 'EN', 'iOS', None, None, None, None],
          ]
current_campaign_data = pd.DataFrame(table2, columns=columns)

In [5]:
current_campaign_data.head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,Device Targeting,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE
0,30000001,643040,d101,d101_1,fb,1,g1,fb1,M,"[45, -4, 4]",D,US,id_fb1,EN,iOS,,,,
1,30000002,643040,d101,d101_1,fb,1,g1,fb2,F,"[35, -4, 4]",D,US,id_fb2,EN,iOS,,,,
2,30000003,643040,d101,d101_1,yh,1,g1,yh1,F,"[35, -4, 4]",D,US,id_yh1,EN,iOS,,,,
3,30000004,743040,d101,d101_1,yh,2,g1,yh154,F,"[55, -6, 5]",M,US,id_yh30,EN,iOS,,,,


The **first part** of the problem is to understand which historical story can be used to make
decision on the current campaign’s stories.

**Second part** of the problem is to assign a priority to each lineitem of the current campaign so
that we start exploring the ones with higher chances to perform well (in terms of CPE).
Assuming that we know which historical story to use as reference, it can happen that exact
matches (in terms of lineitem targeting) are no good performers. Partial matches may have
better CPE but we need to figure out how negligible the discrepancy is.

# Part1

The most straightforward way to tackle this problem is by keeping only meaningful data.
Basically, we should rely on the story that went successful. It means, we can discard all stories below a defined threshold on CPE, as the goal is to have the maximum CPE possible.
For instance, like the following I have applied a threshold of $.1$:

Although, it is necessary to pick the right lineitem in historical data in order to estimate CPE on current data.

In [6]:
valids = historical_data[historical_data.CPE > .1]
valids.head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,Device Targeting,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE
2,3,1,d1,d1_1,yh,1,g1,yh1,F,"[35, -4, 4]",D,US,id_yh1,EN,iOS,150.0,1000,500,0.3
3,30000000,543040,d100,d100_1,yh,2,g1,yh154,F,"[55, -6, 5]",M,US,id_yh30,EN,iOS,100.0,1000,500,0.2


# Part2

Then, we can define a similarity metric in order to rank all the lineitems according to every story we have.

## Preprocessing


Firstly, it is necessary to preprocess some features we have.

The approach here is just to convert the categorical features to numerical ones if there's a new feature not present in the historical data it is replaced by $-1$.
An special treatment is performed for `Age Targeting`, thus we create two new features `low_age` and `high_age` in order to represent the whole range.

In [7]:
from sklearn.preprocessing import LabelEncoder

str_features = ['Campaign ID', 'Story ID', 'Channel', 'Preview Image', 'Channel Ad ID', 
                'Gender Targeting', 'Device Targeting', 'Geo Targeting', 'Interest Targeting', 
                'Language Targeting', 'OS Targeting']

# creating label encoder for every column
encoders = {}
for feature in str_features:
    le = LabelEncoder()
    encoders[feature] = le.fit(valids[feature])

In [8]:
def transform(le, x):
    try:
        return le.transform([x])[0]
    except (KeyError, ValueError):
        return -1
    
def normalize(df):
    norm = df.copy()
    for feature, le in encoders.items():
        norm[feature] = df[feature].apply(lambda x: transform(le, x))
    norm['low_age'] = df['Age Targeting'].apply(lambda x: x[0] + x[1])
    norm['high_age'] = df['Age Targeting'].apply(lambda x: x[0] + x[2])
    norm['Age Targeting'] = df['Age Targeting'].apply(lambda x: x[0])
    return norm

In [9]:
normalize(valids).head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,...,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE,low_age,high_age
2,3,1,0,1,0,1,0,0,0,35,...,0,0,0,0,150.0,1000,500,0.3,31,39
3,30000000,543040,1,0,0,2,0,1,0,55,...,0,1,0,0,100.0,1000,500,0.2,49,60


In order to choose the best approach, we try to find the most similar row in historical data considering a current value.

Then the resulting values are ranked based on two factors, the similarity with historical data and estimated CPE (CPE on historical data). Thus, we can compute a weighted CPE, which is the historical CPE  weighted by the similarity with the current data.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
    
def compute_similarity(historic, current):
    df = current.copy()
    df1 = normalize(current.iloc[:, :15])
    df2 = normalize(historic.iloc[:, :15])
    sim = cosine_similarity(df1, df2)
    df.iloc[:, 15:] = historic.iloc[sim.argmax(axis=1), 15:].reset_index()
    df['similarity'] = sim.max(axis=1) 
    df['weighted CPE'] = sim.max(axis=1) * historic.iloc[sim.argmax(axis=1)].CPE.values
    return df.sort_values('weighted CPE', ascending=False), sim

In [11]:
df, sim = compute_similarity(valids, current_campaign_data)

In [12]:
df.head()

Unnamed: 0,LineItemID,ContentUrl,Campaign ID,Story ID,Channel,Preview Headline,Preview Image,Channel Ad ID,Gender Targeting,Age Targeting,...,Geo Targeting,Interest Targeting,Language Targeting,OS Targeting,Lifetime Spend,Lifetime Clicks,Lifetime Engagements,CPE,similarity,weighted CPE
2,30000003,643040,d101,d101_1,yh,1,g1,yh1,F,"[35, -4, 4]",...,US,id_yh1,EN,iOS,100.0,1000,500,0.2,0.999994,0.199999
1,30000002,643040,d101,d101_1,fb,1,g1,fb2,F,"[35, -4, 4]",...,US,id_fb2,EN,iOS,100.0,1000,500,0.2,0.999994,0.199999
0,30000001,643040,d101,d101_1,fb,1,g1,fb1,M,"[45, -4, 4]",...,US,id_fb1,EN,iOS,100.0,1000,500,0.2,0.999994,0.199999
3,30000004,743040,d101,d101_1,yh,2,g1,yh154,F,"[55, -6, 5]",...,US,id_yh30,EN,iOS,100.0,1000,500,0.2,0.999978,0.199996


In [13]:
sim.argmax(axis=1) 

array([1, 1, 1, 1])

# Wrap up

Thus considering the available current data, according to the historical data is likely that line item `30000003`,`30000002` and `30000001` will slightly outperform `30000004`. Additionally, all of them are more similar to line item `30000000` than `3` in historical data.