# w261 Final Project - Clickthrough Rate Prediction


- Team 15
- Erik Hou, Noah Pflaum, Connor Stern, Anu Yadav
- Fall 2019, section 3  

## Table of Contents

* __Section 1__ - Question Formulation
* __Section 2__ - Algorithm Explanation
* __Section 3__ - EDA & Challenges
* __Section 4__ - Algorithm Implementation
* __Section 5__ - Course Concepts

In [5]:
# imports
import re
import ast
import time
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt

In [6]:
%reload_ext autoreload
%autoreload 2

In [7]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [40]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "fp_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

In [41]:
spark

In [4]:
!head -1 data/dac/train.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16


In [14]:
rawRDD = sc.textFile('data/dac/train.txt')

In [18]:
rawDF = spark.read.text('data/dac/train.txt')
#rawDF = rawRDD.toDF(rawRDD, )

In [20]:
rawDF.write.parquet('data/adclicks.parquet')

In [21]:
parquetDF = spark.read.parquet('data/adclicks.parquet')

In [None]:
sampleDF = spark.read.parquet('data/adclicks.parquet/part-00000-39ed38a6-7cfa-463c-9076-9a79a698c2fc-c000.snappy.parquet')

In [26]:
sampleDF.count()

555493

In [None]:
sampleDF.take(2)

In [None]:
def parse(row):
    variables = row.split('\t')
    click = variables[0]
    yield click
#sampleDF.flatMap(parse).take(1)
sampleDF.take(1)

In [33]:
sampleDF.describe().show()

+-------+--------------------+
|summary|               value|
+-------+--------------------+
|  count|              555493|
|   mean|                null|
| stddev|                null|
|    min|0		-1												...|
|    max|1	99	130	3	2	0	0	...|
+-------+--------------------+



# __Section 1__ - Question Formulation

# __Section 2__ - Algorithm Explanation

Because an online advertisement can either be clicked ($response = 1$) or not ($response = 0$) , Click-Through-Rate (CTR) Prediction is generally treated as a logistic regression problem. For any set of features, we calculate some value $s$, and perform the logit transformation to yield our CTR prediction:
$$CTR = \frac{1}{1+e^{-s}} $$

There are several methods we can use to estimate $s$, each with its own benefits and drawbacks. Typical implementations include a linear model, a degree-2 polynomial mapping, a factorization machine, and a field-aware factorization machine. 

We will consider an example with the following dataset as we discuss the different methods of estimating $s$:

<table>
<th>Response</th>
<th>Publisher</th>
<th>Advertiser</th>
<th>Gender</th>
<tr><td>1</td><td>Netflix</td><td>Pepsi</td><td>Male</td></tr>
<tr><td>0</td><td>Spotify</td><td>Pepsi</td><td>Male</td></tr>
<tr><td>0</td><td>Facebook</td><td>Gatorade</td><td>Female</td></tr>
<tr><td>1</td><td>Spotify</td><td>Coca-cola</td><td>Male</td></tr>
<tr><td>1</td><td>Facebook</td><td>Coca-cola</td><td>Female</td></tr>
<tr><td>0</td><td>Facebook</td><td>Pepsi</td><td>Female</td></tr>
<tr><td>1</td><td>Netflix</td><td>Gatorade</td><td>Female</td></tr>
</table>

In this dataset, we refer to the categories Publisher, Advertiser, and Gender as "fields" and the labels within each field (Netflix, Spotify, Facebook, Pepsi, Gatorade, Coca-cola, Male, Female) as "features."

Below is a one-hot encoded representation of the dataset:

<table>
<th>Response</th>
<th>Netflix</th>
<th>Spotify</th>
<th>Facebook</th>
<th>Pepsi</th>
<th>Gatorade</th>
<th>Coca-cola</th>
<th>Male</th>
<th>Female</th>
<tr><td>1</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
<tr><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td></tr>
<tr><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr>
</table>

#### Linear Model
In a linear model, the algorithm learns a weight for every given feature. The formulation of the model is:
$$s = \phi(\textbf{w},\textbf{x}) =\textbf{w}^T \textbf{x}  =\sum_{j \epsilon C_1}w_jx_j$$
where $\textbf{w}$ is the learned model, $\textbf{x}$ is the data observation, and $C_1$ is the non-zero elements in $\textbf{x}$. 

In our toy example, our model would learn different weights for the different Publishers (Netflix, Spotify, and Pepsi), Advertisers (Pepsi, Gatorade, and Coca-cola), and Genders (Male, Female). The value $s$ would then be calculated for each impression using these different weights. Thus, for each impression we would have have:

$$
\begin{aligned}
s =& w_{Netflix}\cdot x_{Netflix} + w_{Spotify}\cdot x_{Spotify} + w_{Facebook}\cdot x_{Facebook} \\
&+ w_{Coca-cola}\cdot x_{Coca-cola} + w_{Pepsi}\cdot x_{Pepsi} + w_{Gatorade}\cdot x_{Gatorade} \\
&+ + w_{Male}\cdot x_{Male} + w_{Female}\cdot x_{Female}
\end{aligned}
$$

For our first impression of the dataset (Netflix, Pepsi, Male) this becomes:
$$s = w_{Netflix} + w_{Coca-cola} + w_{Male}$$
since $x_j = 1$ for Netflix, Pepsi, and Male while $x_j = 0$ for all other features. 


This model is simple and efficient, yet it does not allow for interactive effects between features. For example, Coca-cola may have a higher CTR with Netflix than another publisher. A linear model is unable to learn this type of information, as it essentially learns the "average effect" of each feature.

#### Degree-2 Polynomial Mapping
The simplest way to learn the effect of the "feature conjunction" described above (in the case where a particular advertiser may have a higher CTR with one publisher compared to others) is to use a degree-2 polynomial mapping. In this model, the algorithm learns an additional weight for each feature pair. The formulation of the model is:
$$s = \phi(\textbf{w},\textbf{x}) = \sum_{j_1, j_2 \epsilon C_2} w_{j_1,j_2} \cdot x_{j_1}x_{j_2}$$
where $C_2$ is the pairwise combination of non-zero elements in $\textbf{x}$.


Returning to our example dataset and the impression with the features Netflix, Pepsi, and Male the model would be:
$$s = w_{Netflix} + w_{Pepsi} + w_{Male} + w_{Netflix,Pepsi} + w_{Netflix,Male} + w_{Pepsi,Male}$$

While this model improves on the linear model by allowing us to account for interactions between features, it does not handle sparse datasets well. Since we have 0 impressions of the advertiser Gatorade with the publisher Spotify, the model prediction will be trivial as no weight was learned for this feature combination. The model is also susceptible to overfitting, as it generates unreliable predictions for feature combinations with a very small number of impressions.

#### Factorization Machine
Factorization Machines (FM) begin to provide us a solution to the problem of sparse datasets. Here the model learns a latent vector, rather than an explicit weight, for each feature. The model formulation is:
$$s = \phi(\textbf{w},\textbf{x}) = \sum_{j_1, j_2 \epsilon C_2} \langle \textbf{w}_{j_1}, \textbf{w}_{j_2} \rangle x_{j_1}x_{j_2}$$
where $\textbf{w}_{j_1}$ and $\textbf{w}_{j_2}$ are two learned latent vectors of length $k$ (some user-defined parameter).

Returning to our example of an impression with the features of Netflix, Pepsi, and Male, the FM model would be:
$$s = w_{Netflix} \cdot w_{Pepsi} + w_{Netflix} \cdot w_{Male} + w_{Pepsi} \cdot w_{Male}$$
where $w_{Netflix}, w_{Pepsi}, w_{Male} \epsilon {\rm I\!R}^k$.

This allows the model to learn the latent vectors for each feature based on all the data points for that feature, and these latent vectors can be used to predict the CTR for unobserved feature combinations (such as Spotify and Gatorade, as previously mentioned), something the degree-2 polynomial mapping method did not allow us to do.

We notice in our example above, however, that the latent vector $w_{Netflix}$ is used twice: once to calculate the latent effect of the Publisher Netflix with the Advertiser Pepsi ($w_{Netflix} \cdot w_{Pepsi}$), and once to calculate the latent effect of the Publisher Netflix with the Gender Male ($w_{Netflix} \cdot w_{Male}$). Yet the latent effect for publisher with advertiser could be very different from the latent effect for publisher with gender, and as such the Factorization Machine model is too restrictive and unrealistic.


#### Field-Aware Factorization Machine
Field-Aware Factorization Machines (FFM) provide a solution to this problem by introducing the flexibility to learn multiple latent vectors for each feature. In the context of the above example, the FFM model will learn two separate latent vectors for Netflix: $w_{Netflix, A}$, which will be used for the calculation of $P \times A$ (to learn the latent effect of Netflix with a given advertiser), and $w_{Netflix, G}$. $w_{Netflix, A}$, for the calculation of $P \times G$ (to learn the latent effect of Netflix with a given gender).

Specifically, the FFM model formulation with the (Netflix, Pepsi, Male) impression gives us:

$$s = w_{Netflix, A} \cdot w_{Pepsi, P} + w_{Netflix, G} \cdot w_{Male, P} + w_{Pepsi, G} \cdot w_{Male, A}$$

Now, the latent effect of (Netflix, Pepsi), is learned by using the latent vector $w_{Netflix, A}$, since Pepsi belongs to the advertiser field (A). The latent effect of (Netflix, Male), by contrast, is learned using a different latent vector, $w_{Netflix, G}$, since Male belongs to the gender field (G). In this way, FFM splits the latent factors for $P \times A$ and $P \times G$, something the traditional Factorization Machine is unable to do. 

The full model formulation for a Field-Aware Factorization Machine is:
$$s = \phi(\textbf{w},\textbf{x}) = \sum_{j_1, j_2 \epsilon C_2} \langle \textbf{w}_{j_1, f_2}, \textbf{w}_{j_2, f_1} \rangle x_{j_1}x_{j_2}$$

where $f_1$ and $f_2$ are the fields of $j_1$ and $j_2$, respectively. 
#### Optimization

As part of a logistic regression problem, the goal is to find the set of parameters that minimize the log-loss function, defined by:

$$ LogLoss = - \frac{1}{n} \sum_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i})] $$


where $n$ is the number of impressions, $y_i$ is the true CTR of impression $i$, and $\hat{y_i}$ is the predicted CTR of impression $i$, $\frac{1}{1+e^{-s_i}}$.

For a given impression $x_i$, we have two cases:
* If $y-i = 1$:
$$
\begin{aligned}
loss &= -log_e(\hat{y_i}) \\
&= -log_e \left( \frac{1}{1+e^{-s}} \right) \\
&= log_e (1+e^{-s})
\end{aligned}
$$

* If $y_i = 0$:
$$
\begin{aligned}
loss &= -log_e(1-\hat{y_i}) \\
&= -log_e \left( 1 - \frac{1}{1+e^{-s}} \right)\\
&= -log_e \left( \frac{1 + e^{-s} - 1}{1+e^{-s}} \right) \\
&= -log_e \left( \frac{e^-s}{1+e^{-s}} \right) \\
&= -log_e \left( \frac{1}{1+e^{s}} \right) \\
&= log_e(1+e^{s})
\end{aligned}
$$

As such, the loss for impression $x_i$ can be written as:
$$log(1+exp(-\bar{y_i}\cdot s_i)$$
where 
$$\bar{y_i} = 
\begin{cases}
    1,\ if\ y_i=1\\
    -1,\ if\ y_i=0\\
\end{cases}$$

Therefore, we can rewrite the log-loss function as:
$$ Loss = \frac{1}{n} \sum_{i=1}^n log(1+exp(-\bar{y_i}\cdot s_i))$$
where 
$\bar{y_i} = 
\begin{cases}
    1,\ if\ y_i=1\\
    -1,\ if\ y_i=0\\
\end{cases}$


and $s = \phi(\textbf{w},\textbf{x}_i) = \sum_{j_1, j_2 \epsilon C_2} \langle \textbf{w}_{j_1, f_2}, \textbf{w}_{j_2, f_1} \rangle x_{j_1}x_{j_2}$


Introducing regularization, the optimization problem we have is:

$$ \min_{\textbf{w}} \sum_{i=1}^m \left( log(1+exp(-\bar{y_i}\phi(\textbf{w},\textbf{x}_i)) + \frac{\lambda}{2}\|\textbf{w}\|^2 \right)$$

$m$ is the number of impressions, and  $\lambda$ is our regularization parameter.

We solve this optimization problem using gradient descent methods.


#### Toy example: FFM in action

We will now walk through an entire iteration of the FFM algorithm to learn a CTR prediction model.
For reference, we reprint the one-hot encoded version of our dataset here:

<table>
<th>Response</th>
<th>Netflix</th>
<th>Spotify</th>
<th>Facebook</th>
<th>Pepsi</th>
<th>Gatorade</th>
<th>Coca-cola</th>
<th>Male</th>
<th>Female</th>
<tr><td>1</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
<tr><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td></tr>
<tr><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td></tr>
</table>


__1)__ We define k (the length of our latent vectors) to be 5.

__2)__ We randomly initialize our latent vectors $\textbf{w}$. These latent vectors will be represented as columns below.

For the Publisher field, each feature will have two different latent vectors--one corresponding to Advertiser (A), and one corresponding to Gender (G):

<table>
<th>$\textbf{w}_{Netflix,A}$</th>
<th>$\textbf{w}_{Netflix,G}$</th>
<th>$\textbf{w}_{Spotify,A}$</th>
<th>$\textbf{w}_{Spotify,G}$</th>
<th>$\textbf{w}_{Facebook,A}$</th>
<th>$\textbf{w}_{Facebook,G}$</th>
<tr><td>-0.1</td><td>0.5</td><td>0.7</td><td>0.3</td><td>0.3</td><td>0</td></tr>
<tr><td>0.3</td><td>-0.5</td><td>-0.4</td><td>0.2</td><td>-0.1</td><td>0.4</td></tr>
<tr><td>0.8</td><td>0.2</td><td>0</td><td>0.6</td><td>0.1</td><td>-0.5</td></tr>
<tr><td>0</td><td>-0.1</td><td>0.1</td><td>-0.9</td><td>0.2</td><td>-0.1</td></tr>
<tr><td>-0.6</td><td>0.9</td><td>0.1</td><td>0.4</td><td>0.7</td><td>0.9</td></tr>
<table>
    
The Advertiser field will similarly have two different latent vectors for each feature--one for Publisher (P), and one for Gender (G):
<table>
<th>$\textbf{w}_{Pepsi,P}$</th>
<th>$\textbf{w}_{Pepsi,G}$</th>
<th>$\textbf{w}_{Gatorade,P}$</th>
<th>$\textbf{w}_{Gatorade,G}$</th>
<th>$\textbf{w}_{Coca-cola,P}$</th>
<th>$\textbf{w}_{Coca-cola,G}$</th>
<tr><td>0.2</td><td>-0.4</td><td>0.5</td><td>0.6</td><td>0</td><td>0.7</td></tr>
<tr><td>0.1</td><td>0</td><td>0.3</td><td>-0.6</td><td>-0.1</td><td>-0.2</td></tr>
<tr><td>-0.7</td><td>0</td><td>0.1</td><td>0.3</td><td>-0.3</td><td>-0.1</td></tr>
<tr><td>0.5</td><td>0.3</td><td>0.1</td><td>0.1</td><td>-0.4</td><td>-0.5</td></tr>
<tr><td>0.2</td><td>-0.5</td><td>0</td><td>-0.9</td><td>0.8</td><td>0.3</td></tr>
<table>
    
The Gender field will also have two different latent vectors per feature (one for Publisher (P), one for Advertiser (A)): 
<table>
<th>$\textbf{w}_{Male,P}$</th>
<th>$\textbf{w}_{Male,A}$</th>
<th>$\textbf{w}_{Female,P}$</th>
<th>$\textbf{w}_{Female,A}$</th>
<tr><td>0.6</td><td>0.3</td><td>0.0</td><td>-0.5</td></tr>
<tr><td>-0.8</td><td>-0.4</td><td>0.3</td><td>-0.4</td></tr>
<tr><td>0</td><td>0.6</td><td>0.6</td><td>-0.2</td></tr>
<tr><td>0.8</td><td>0.7</td><td>-0.9</td><td>0</td></tr>
<tr><td>-0.4</td><td>-0.1</td><td>-0.2</td><td>0.9</td></tr>
<table>

__3)__ We calculate $s$ for each impression using these weight vectors.

For each impression, we would use the formula
$$s = \phi(\textbf{w},\textbf{x}) = \sum_{j_1, j_2 \epsilon C_2} \langle \textbf{w}_{j_1, f_2}, \textbf{w}_{j_2, f_1} \rangle x_{j_1}x_{j_2}$$    
to generate our value of $s$. Fully expanded, this would be:
    
$$
\begin{aligned}
s =& \langle\textbf{w}_{Netflix,A} \cdot \textbf{w}_{Pepsi,P}\rangle x_{Netflix}x_{Pepsi} + \langle\textbf{w}_{Netflix,A} \cdot \textbf{w}_{Coca-cola,P}\rangle x_{Netflix}x_{Coca-cola} + \langle\textbf{w}_{Netflix,A} \cdot \textbf{w}_{Gatorade,P}\rangle x_{Netflix}x_{Gatorade} \\
&+ \langle\textbf{w}_{Netflix,G} \cdot \textbf{w}_{Male,P}\rangle x_{Netflix}x_{Male} + \langle\textbf{w}_{Netflix,G} \cdot \textbf{w}_{Female,P}\rangle x_{Netflix}x_{Female} + \langle\textbf{w}_{Spotify,A} \cdot \textbf{w}_{Pepsi,P}\rangle x_{Spotify}x_{Pepsi} \\
&+ \langle\textbf{w}_{Spotify,A} \cdot \textbf{w}_{Coca-cola,P}\rangle x_{Spotify}x_{Coca-cola} + \langle\textbf{w}_{Spotify,A} \cdot \textbf{w}_{Gatorade,P}\rangle x_{Spotify}x_{Gatorade} + \langle\textbf{w}_{Spotify,G} \cdot \textbf{w}_{Male,P}\rangle x_{Spotify}x_{Male} \\
&+ \langle\textbf{w}_{Spotify,G} \cdot \textbf{w}_{Female,P}\rangle x_{Spotify}x_{Female} + \langle\textbf{w}_{Facebook,A} \cdot \textbf{w}_{Pepsi,P}\rangle x_{Facebook}x_{Pepsi} + \langle\textbf{w}_{Facebook,A} \cdot \textbf{w}_{Coca-cola,P}\rangle x_{Facebook}x_{Coca-cola} \\
&+ \langle\textbf{w}_{Facebook,A} \cdot \textbf{w}_{Gatorade,P}\rangle x_{Facebook}x_{Gatorade} + \langle\textbf{w}_{Facebook,G} \cdot \textbf{w}_{Male,P}\rangle x_{Facebook}x_{Male} + \langle\textbf{w}_{Facebook,G} \cdot \textbf{w}_{Female,P}\rangle x_{Facebook}x_{Female} 
\end{aligned}
$$

For each impression, we are only concerned with the terms containing the features represented in the impression, since these $x$ values are equal to one, while all other terms become zero.
    
Our first impression (Netflix, Pepsi, Male) would therefore give us:
$$
\begin{aligned}
s &= \langle\textbf{w}_{Netflix,A} \cdot \textbf{w}_{Pepsi,P}\rangle + \langle\textbf{w}_{Netflix,G} \cdot \textbf{w}_{Male,P}\rangle + \langle\textbf{w}_{Pepsi,G} \cdot \textbf{w}_{Male,A}\rangle\\
&= (-0.1*0.2 + 0.3*0.1 + 0.8*-0.7 + 0*0.5 + -0.6*0.2) + (0.5*0.6 + -0.5*-0.8 + 0.2*0 + -0.1*0.8 + 0.9*-0.4) + (-0.4*0.3 + 0*-0.4 + 0*0.6 + 0.3*0.7 + -0.5*-0.1)  \\
&= -0.67 + 0.26 + 0.14 \\
&= -0.27
\end{aligned}
$$
Plugging this value into the logit transformation, we get a predicted CTR of:
$$\frac{1}{1+e^{-s}} = \frac{1}{1+e^{0.27}} \approx 0.43$$
    
Since this value is closer to 0 than 1, we would ultimately predict a non-click (incorrectly). However, this predicted value is vital for our log-loss calculation, which we will see momentarily.
    
The following table shows the results of the calculations for all seven impressions:

<table>
<th>Impression</th>
<th>$s$</th>
<th>Predicted CTR</th>
<th>True Response</th>
<th>Correct?</th>
<tr><td>(Netflix, Pepsi, Male)</td><td>-0.27</td><td>$\approx 0.43$</td><td>1</td><td>No</td></tr>
<tr><td>(Spotify, Pepsi, Male)</td><td>-0.55</td><td>$\approx 0.37$</td><td>0</td><td>Yes</td></tr>
<tr><td>(Facebook, Gatoriade, Female)</td><td>-1.05</td><td>$\approx 0.26$</td><td>0</td><td>Yes</td></tr>
<tr><td>(Spotify, Coca-cola, Male)</td><td>-0.93</td><td>$\approx 0.28$</td><td>1</td><td>No</td></tr>
<tr><td>(Facebook, Coca-cola, Female)</td><td>0.05</td><td>$\approx 0.51$</td><td>1</td><td>Yes</td></tr>
<tr><td>(Facebook, Pepsi, Female)</td><td>-0.30</td><td>$\approx 0.43$</td><td>0</td><td>Yes</td></tr>
<tr><td>(Netflix, Gatorade, Female)</td><td>-0.93</td><td>$\approx 0.28$</td><td>1</td><td>No</td></tr>
<table>

    
__4)__ We calculate the log-loss of the model, using the equation:
$$ LogLoss = - \frac{1}{n} \sum_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i})] $$

Simply put, if our True Response ($y_i$) for an impression is 0, we add the log-value of `1-Predicted CTR` to our sum; if our True Response ($y_i$) for an impression is 1, we add the log-value of `CTR`.    

Using our values from our table above, we have:

$$ \begin{aligned}
LogLoss =& - \frac{1}{7}[log_e(0.43) + log_e(0.63) + log_e(0.74) + log_e(0.28) + log_e(0.51) + log_e(0.57)+ log_e(0.28)]\\
=& 0.769
\end{aligned}
$$
       
    
__5)__ We then would optimize our weights using gradient descent ...

We repeat 3-5 until some convergence criteria is met.

#### Sources:
- https://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
- https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
- https://www.youtube.com/watch?v=1cRGpDXTJC8

###### Removed for now, but may still utilize:
* OR Formula from the paper...
    
$$\sum_{i=1}^m \left( log(1+exp(-y_i\phi(\textbf{w},\textbf{x}_i)) + \frac{\lambda}{2}\|\textbf{w}\|^2 \right)$$

For simplicity, we ignore the regularization term $\frac{\lambda}{2}\|\textbf{w}\|^2$

The log-loss then becomes:
$$
\begin{aligned}
L =& log(1+exp(-y_1 \cdot s_1) + log(1+exp(-y_2 \cdot s_2) \\
&+ log(1+exp(-y_3 \cdot s_3) + log(1+exp(-y_4 \cdot s_4) + log(1+exp(-y_5 \cdot s_5) \\
&+ log(1+exp(-y_6 \cdot s_6) + log(1+exp(-y_7 \cdot s_7)
\end{aligned}
$$
Using the values given above, we have:    
$$
\begin{aligned}
L =& log(1+exp(-1 \cdot -0.27) + log(1+exp(1 \cdot -0.55) \\
&+ log(1+exp(1 \cdot -1.05) + log(1+exp(-1 \cdot -0.93) + log(1+exp(-1 \cdot 0.05) \\
&+ log(1+exp(1 \cdot -0.30) + log(1+exp(-1 \cdot -0.93)
\end{aligned}
$$ 

# __Section 3__ - EDA & Challenges

# __Section 4__ - Algorithm Implementation

# __Section 5__ - Course Concepts