# Data Science Workshop Project

### Eran Horowitz, Yair Hadas, Dean Ayalon

For our project, we chose to take on the Outbrain competition presented on Kaggle. While we did not actually enter the competition, we still managed to acheive respectable results - just shy of the top 50 entries. In this notebook, we will show the workflow, features and models which got us there.

## Business Problem to Machine Learning Problem

Companies like Outbrain are in the business of online advertising - many high-profile websites include advertising based on Outbrain's platform, and many high-profile companies choose Outbrain as their online advertising platform. Both sides' goal is the same - to maximize the chance of the ads to be clicked. Every click means another potential customer or potential revenue for the advertiser, and it probably carries some sort of incentive for Outbrain as well.

Unlike traditional advertising platforms, advertising on the Internet can be highly personalized and tailored to each user. So the natural thing to do would be to tailor the ads to each user, in a way that would maximize the chances of an ad being clicked. But exactly how? Based on what attributes of the user and of the ad?

This is where Machine Learning comes in. Using the vast amounts of data presented by Outbrain as part of this challenge, we can build a model that would attempt to accurately predict which ad a given user will click presented a given series of ads. The info from this model could be used to design a better algorithm for choosing which ads will be presented to each user.

## A Bit More Formally

The main panel on which ads are presented to the user is called a Display. Each display has a unique id, and may consist of anywhere between 3 and 12 individual ads, each of them also having a unique id.
For each display id in the test data, we need to return an array of the ad id's, sorted in descending order of the probability they were the ones that have been clicked. The main training and test datasets only include displays in which an ad was clicked, so we know for sure one of the ads has been clicked on. 

## Introducing the Data

Outbrain provided two weeks worth of data for this challenge, for the period of 14-28.6.2016, containing information about displays, the sites they were displayed on, the ads they contained, various details about the documents these ads were linked to, the users viewing those ads, and so forth.

The full dataset for this challenge was enormous, totalling almost 100 GB when uncompressed. As we lacked the resources the handle such vast amounts of information, we chose to focus on only a small subset of the data. The full training file contains information about over 16 million different displays, we chose to sample about 3% of that amount, so the data set we worked with contained about 500,000 displays, uniformly sampled from the original data.

Some of the tables we worked with:

**clicks_train.csv** - the core table, containing display id's, the ads each display contains, and which of them was clicked.

**events.csv** - contains information about the clicks "events" - on what platform they were made (mobile,desktop or tablet), the country from which they were made, when they were made, etc.

**documents_categories.csv, documents_topic.csv, documents_entities.csv, documents_meta.csv** - tables containing various metadata details about documents, both ones which ads were displayed on, and ones which were linked to by ads.

**promoted_content.csv** - contains information about the publishers of ads and the campaigns they were part of.

The biggest table in the data was **page_views.csv**, which was essentialy the same as events.csv, but contained **all** of the page views of users tracked in those two weeks, including pages they viewed without clicking an ad. We decided to abandon this table to the massive resources we would have needed in order to work withit.

## Evaluating Our Models

Models in the competition were evaluated using the MAP@12 metric, MAP standing for Mean Average Precision:

$$MAP@12 = \frac{1}{|U|} \sum_{u=1}^{|U|} \sum_{k=1}^{min(12, n)} P(k)$$

In this formulation, |U| is the number of display_id's, and P(k) is the precision at cutoff k. To try and get a better understanding of this metric, we can think of the array of ads we give in a response to a display id as a **series** of predictions, trying to predict the ad that was actually clicked. Naturally, the best case would be to make the correct prediction in the first try, but making it on the second or third attempt isn't bad either. However, making it on the last attempt isn't that satisfying. The MAP metric represents this difference.

For our data, since for every display there is only one correct answer (the ad that was actually clicked) the formulation becomes a bit simpler:

$$MAP@12 = \frac{1}{|U|} \sum_{u=1}^{|U|} \frac{1}{r_{u}}$$

In this formulation, $r_{u}$ is the position of the correct ad in the array we return. i.e. 1 if it's first, 2 if it's second, etc. 


# Data Preparation

One of the main challenges we faced was deciding how to integrate the many tables contained in the data set to one comprehensive table on which we can run our models. It took us a while to understand the connections between them, and the meanings of the various features contained in each of them. The first stage was to understand what basic features we could rely on, that is, what features don't have large proportions of null values.

# Importing the Final Table

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot

In [None]:
dataset = pd.read_csv("final_main_table.csv")

# Feature Extraction

## Click Time

It's reasonable to assume that people tend to click different ads when surfing at different times of day, so we wanted to factor the time each click happened into the model.  

This required several adjustments:

1. The click timestamps provided were all UTC, so they didn't account for timezone differences between countries. At first we wanted to come up with a clever way of making this adjustment automatically for every single country in the data, but this quickly turned out to be infeasible. There are python libraries intended for this purpose, but coordinating them with our data required a lot of effort. So instead, we settled on doing it manually.

   Fortunatley for us, some countries appear in the data a lot more than others. Let's look at the ten most common:

## Platforms

The data includes information about the platform from which every click was made. The possible platforms include desktop, mobile, and tablet. We expected to see a difference here as well, as each platform might be used in a different context (work vs. home). 

In the original dataset, this information was presented as a single digit, with possible values of 1 (desktop), 2 (mobile) and 3 (tablet). These digits are obviously numerically meaningless - desktop isn't "smaller" than mobile, and tablet isn't "bigger" than both of them. So in order to include this information in the model, we decided to create three individual binary vectors, each corresponding to one of the platforms. 

In [2]:
desktop_clicks = sum(dataset.platform_is_desktop)
mobile_clicks = sum(dataset.platform_is_mobile)
tablet_clicks = sum(dataset.platform_is_tablet)



## Weekend Clicks

We expected the clicking trends to be different over the weekends - the data includes two of them: 18-19.6 and 25-26.6.
In order to include this in the model, we created a binary vector based on the click times. This time we dind't correct for timezone, mainly because when we did the results were actually worse.

## Ads per Display

We thought the number of ads in a display might be factor affecting which ad will be clicked. 
Bigger displays might attract more attention, or repel a person from looking at them, while smaller displays might be more stealthy.

Displays of different sizes might also be located in different places on the webpage, some more likely to engage the person viewing them than others. On top of that, Outbrain's algorithms probably choose different ads for displays of different sizes, so there are probably ads which never get to be on a small display, as those are probably reserved only for ads with a high proabability of being clicked.

## Ads Per Advertiser and Ads Per Campaign

We wanted to measure the "attractiveness" of every ad - whether it's because it was published by a big-name publisher or becuase it mentions some high-profile person or company. As the topic and entity data that were provided are purely numeric, we couldn't do this directly.

Our solution was to check how many ads each advertiser published - working under the assumption that "attractive" advertisers would be big ones with a lot of money and high-profile customers, meaning they will have many more ads published than small publishers with small budgets.

We also checked for the same trend with in campaigns - trying to identift ads which are part of big campaigns.

## CTR - Click Through Ratio

This turned out to be our main feature, perhaps not surprisingly. 
Our attention were first drawn to it via the article Amit sent at the beginning of the semester (https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf), and then again when we discovered a relatively high score of 0.63714 could be achieved based only on this metric (https://www.kaggle.com/clustifier/outbrain-click-prediction/pandas-is-cool-lb-0-63714).

In light of these discoveries, we knew it would probably be beneficial to include CTR in our model. Now we only needed to decide how to calculate it.

The naive formula for CTR is simple:
Per a given ad, divide the number of times it was clicked by the total number of appearances ("impressions") it had.
This brings up a problem - an ad which appeared once and was clicked at that time would get a CTR of 1, while an ad that appeared 100 times and was clicked 99 times would get a CTR of 0.99. The second ad got a lower rating, even though it's prbobably much more successful than the first. 

Intuitively, a CTR for an ad that appeared many times is more credible than one for an ad that didn't - it got "tested" more often, and therefore we should be able to trust the "results" more.

The soultion to this is regularization - somehow factor in the number of times an ad appeared into the calculation, so that ads that apperead more would get a "bonus" if they indeed also got a lot of clicks.

The formula used in the kernel was this:



